date:20170521

Kazuaki Ishizaki created SPARK-20819:


 Summary: Enhance ColumnVector to keep UnsafeArrayData for other 
types
 Key: SPARK-20819
 URL: https://issues.apache.org/jira/browse/SPARK-20819
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20819) Enhance ColumnVector to keep UnsafeArrayData for other types


[ 
https://issues.apache.org/jira/browse/SPARK-20819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018732#comment-16018732
 ] 

Kazuaki Ishizaki commented on SPARK-20819:
--

Follow-on of SPARK-20783 to support other type such as Map and struct

> Enhance ColumnVector to keep UnsafeArrayData for other types
> 
>
> Key: SPARK-20819
> URL: https://issues.apache.org/jira/browse/SPARK-20819
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20819) Enhance ColumnVector to keep UnsafeArrayData for other types


[ 
https://issues.apache.org/jira/browse/SPARK-20819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018732#comment-16018732
 ] 

Kazuaki Ishizaki edited comment on SPARK-20819 at 5/21/17 7:58 AM:
---

Follow-on of SPARK-20783 to support other types such as Map and struct


was (Author: kiszk):
Follow-on of SPARK-20783 to support other type such as Map and struct

> Enhance ColumnVector to keep UnsafeArrayData for other types
> 
>
> Key: SPARK-20819
> URL: https://issues.apache.org/jira/browse/SPARK-20819
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20820) Add compression/decompression of data to ColumnVector for other compression schemes

Kazuaki Ishizaki created SPARK-20820:


 Summary: Add compression/decompression of data to ColumnVector for 
other compression schemes
 Key: SPARK-20820
 URL: https://issues.apache.org/jira/browse/SPARK-20820
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20821) Add compression/decompression of data to ColumnVector for other data types

Kazuaki Ishizaki created SPARK-20821:


 Summary: Add compression/decompression of data to ColumnVector for 
other data types
 Key: SPARK-20821
 URL: https://issues.apache.org/jira/browse/SPARK-20821
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20822) Generate code for table cache using ColumnarBatch

Kazuaki Ishizaki created SPARK-20822:


 Summary: Generate code for table cache using ColumnarBatch
 Key: SPARK-20822
 URL: https://issues.apache.org/jira/browse/SPARK-20822
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20823) Generate code for table cache using ColumnarBatch for other types

Kazuaki Ishizaki created SPARK-20823:


 Summary: Generate code for table cache using ColumnarBatch for 
other types
 Key: SPARK-20823
 URL: https://issues.apache.org/jira/browse/SPARK-20823
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20824) Generate code that directly gets a value from each column in CachedBatch for table cache


 [ 
https://issues.apache.org/jira/browse/SPARK-20824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-20824:
-
Summary: Generate code that directly gets a value from each column in 
CachedBatch for table cache  (was: Generate code that a value from each column 
in CachedBatch for table cache)

> Generate code that directly gets a value from each column in CachedBatch for 
> table cache
> 
>
> Key: SPARK-20824
> URL: https://issues.apache.org/jira/browse/SPARK-20824
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20824) Generate code that a value from each column in CachedBatch for table cache

Kazuaki Ishizaki created SPARK-20824:


 Summary: Generate code that a value from each column in 
CachedBatch for table cache
 Key: SPARK-20824
 URL: https://issues.apache.org/jira/browse/SPARK-20824
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20825) Generate code that directly gets a value from each column in CachedBatch for table cache for other data types

Kazuaki Ishizaki created SPARK-20825:


 Summary: Generate code that directly gets a value from each column 
in CachedBatch for table cache for other data types
 Key: SPARK-20825
 URL: https://issues.apache.org/jira/browse/SPARK-20825
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20826) Support compression/decompression of ColumnVector in generated code

Kazuaki Ishizaki created SPARK-20826:


 Summary: Support compression/decompression of ColumnVector in 
generated code
 Key: SPARK-20826
 URL: https://issues.apache.org/jira/browse/SPARK-20826
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20820) Add compression/decompression of data to ColumnVector for other compression schemes


[ 
https://issues.apache.org/jira/browse/SPARK-20820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018736#comment-16018736
 ] 

Kazuaki Ishizaki commented on SPARK-20820:
--

Follow-on of SPARK-20807. Support additional compression schemes such as 
{{intDelta}} and {{longDelta}}.


> Add compression/decompression of data to ColumnVector for other compression 
> schemes
> ---
>
> Key: SPARK-20820
> URL: https://issues.apache.org/jira/browse/SPARK-20820
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20821) Add compression/decompression of data to ColumnVector for other data types


[ 
https://issues.apache.org/jira/browse/SPARK-20821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018737#comment-16018737
 ] 

Kazuaki Ishizaki commented on SPARK-20821:
--

Follow-on of SPARK-20807. Support other data types such as float, double, and 
array.

> Add compression/decompression of data to ColumnVector for other data types
> --
>
> Key: SPARK-20821
> URL: https://issues.apache.org/jira/browse/SPARK-20821
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20823) Generate code for table cache using ColumnarBatch for other types


[ 
https://issues.apache.org/jira/browse/SPARK-20823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018742#comment-16018742
 ] 

Kazuaki Ishizaki commented on SPARK-20823:
--

Follow-on of SPARK-20822.
Generate Java code for table cache using ColumnarBatch for other data types 
such as long and array.

> Generate code for table cache using ColumnarBatch for other types
> -
>
> Key: SPARK-20823
> URL: https://issues.apache.org/jira/browse/SPARK-20823
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20822) Generate code for table cache using ColumnarBatch


[ 
https://issues.apache.org/jira/browse/SPARK-20822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018738#comment-16018738
 ] 

Kazuaki Ishizaki commented on SPARK-20822:
--

Waiting for merge of SPARK-20770.
Generate Java code for table cache using ColumnarBatch. At first, support int 
and double data types.

> Generate code for table cache using ColumnarBatch
> -
>
> Key: SPARK-20822
> URL: https://issues.apache.org/jira/browse/SPARK-20822
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20824) Generate code that directly gets a value from each column in CachedBatch for table cache


[ 
https://issues.apache.org/jira/browse/SPARK-20824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018744#comment-16018744
 ] 

Kazuaki Ishizaki edited comment on SPARK-20824 at 5/21/17 8:18 AM:
---

Follow-on of SPARK-20822.
Generate Java code that directly get a value from each column in CachedBatch 
for table cache. It will replace current while-loop by whole-stage codegen with 
for-loop.
At first, support int and double data types.


was (Author: kiszk):
Follow-on of SPARK-20822.
Generate Java code that directly get a value from each column in CachedBatch 
for table cache. It will replace current while-loop by whole-stage codegen with 
for-loop.

> Generate code that directly gets a value from each column in CachedBatch for 
> table cache
> 
>
> Key: SPARK-20824
> URL: https://issues.apache.org/jira/browse/SPARK-20824
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20824) Generate code that directly gets a value from each column in CachedBatch for table cache


[ 
https://issues.apache.org/jira/browse/SPARK-20824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018744#comment-16018744
 ] 

Kazuaki Ishizaki commented on SPARK-20824:
--

Follow-on of SPARK-20822.
Generate Java code that directly get a value from each column in CachedBatch 
for table cache. It will replace current while-loop by whole-stage codegen with 
for-loop.

> Generate code that directly gets a value from each column in CachedBatch for 
> table cache
> 
>
> Key: SPARK-20824
> URL: https://issues.apache.org/jira/browse/SPARK-20824
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20825) Generate code that directly gets a value from each column in CachedBatch for table cache for other data types


[ 
https://issues.apache.org/jira/browse/SPARK-20825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018746#comment-16018746
 ] 

Kazuaki Ishizaki commented on SPARK-20825:
--

Follow-on of SPARK-20824.
Generate Java code that directly get a value from each column in CachedBatch 
for table cache. Support other data types such as long and array data types.

> Generate code that directly gets a value from each column in CachedBatch for 
> table cache for other data types
> -
>
> Key: SPARK-20825
> URL: https://issues.apache.org/jira/browse/SPARK-20825
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20826) Support compression/decompression of ColumnVector in generated code


[ 
https://issues.apache.org/jira/browse/SPARK-20826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018748#comment-16018748
 ] 

Kazuaki Ishizaki commented on SPARK-20826:
--

Wait for merge of SPARK-20820, SPARK-20822, and SPARK-20824.
Support compression and decompression of ColumnVector in generated code that 
uses ColumnarBatch for table cache.

> Support compression/decompression of ColumnVector in generated code
> ---
>
> Key: SPARK-20826
> URL: https://issues.apache.org/jira/browse/SPARK-20826
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20754) Add Function Alias For MOD/TRUNCT/POSITION

2017-05-21 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018757#comment-16018757
 ] 

Yuming Wang commented on SPARK-20754:
-

I am working on this.

> Add Function Alias For MOD/TRUNCT/POSITION
> --
>
> Key: SPARK-20754
> URL: https://issues.apache.org/jira/browse/SPARK-20754
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>  Labels: starter
>
> We already have the impl of the following functions. We can add the function 
> alias to be consistent with ANSI. 
> {noformat} 
> MOD(,)
> {noformat} 
> Returns the remainder of m divided by n. Returns m if n is 0.
> {noformat} 
> TRUNC
> {noformat} 
> Returns the number x, truncated to D decimals. If D is 0, the result will 
> have no decimal point or fractional part. If D is negative, the number is 
> zeroed out.
> {noformat} 
> POSITION
> {noformat} 
> Returns the position of the char IN ) in the source string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20399) Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string in parser


[ 
https://issues.apache.org/jira/browse/SPARK-20399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018796#comment-16018796
 ] 

Apache Spark commented on SPARK-20399:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/18048

> Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string 
> in parser
> --
>
> Key: SPARK-20399
> URL: https://issues.apache.org/jira/browse/SPARK-20399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.2.0
>
>
> The new SQL parser is introduced into Spark 2.0. Seems it bring an issue 
> regarding the regex pattern string.
> The following codes can reproduce it:
> {code}
> val data = Seq("\u0020\u0021\u0023", "abc")
> val df = data.toDF()
> // 1st usage: works in 1.6
> // Let parser parse pattern string
> val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'")
> // 2nd usage: works in 1.6, 2.x
> // Call Column.rlike so the pattern string is a literal which doesn't go 
> through parser
> val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$"))
> // In 2.x, we need add backslashes to make regex pattern parsed correctly
> val rlike3 = df.filter("value rlike '^x20[x20-x23]+$'")
> {code}
> Due to unescaping SQL String in parser, the first usage working in 1.6 can't 
> work in 2.0. To make it work, we need to add additional backslashes.
> It is quite weird that we can't use the same regex pattern string in the 2 
> usages. I think we should not unescape regex pattern string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20827) cannot express HAVING without a GROUP BY clause

N Campbell created SPARK-20827:
--

 Summary: cannot express HAVING without a GROUP BY clause
 Key: SPARK-20827
 URL: https://issues.apache.org/jira/browse/SPARK-20827
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: N Campbell


SPARK SQL does not support a HAVING clause without a GROUP BY which is valid 
SQL and supported by other engines (ORACLE, DB2, )

SELECT
'' AS `C1`
FROM
`cert`.`tparts`
 HAVING 
COUNT(`pno`) > 0

SQL state: java.lang.UnsupportedOperationException: Cannot evaluate expression: 
count(input[0, string, true]), Query: SELECT
'' AS `C1`
FROM
`cert`.`tparts`
 HAVING 
COUNT(`pno`) > 0.
SQLState:  HY000
ErrorCode: 500051



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20828) Concatenated grouping sets scenario not supported

N Campbell created SPARK-20828:
--

 Summary: Concatenated grouping sets scenario not supported 
 Key: SPARK-20828
 URL: https://issues.apache.org/jira/browse/SPARK-20828
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: N Campbell


Following scenario supported by other vendors (i.e. ORACLE, DB2, ...) not 
supported by SPARK SQL

 WITH 
SQL1 AS 
(
SELECT
sno AS C1, 
pno AS C2, 
SUM(qty) AS C3
FROM
cert.tsupply 
GROUP BY 
ROLLUP(sno), 
CUBE(pno)
)
SELECT
SQL1.C1 AS C1, 
SQL1.C2 AS C2, 
SQL1.C3 AS C3
FROM
SQL1

Error: [Simba][SparkJDBCDriver](500051) ERROR processing query/statement. Error 
Code: ERROR_STATE, SQL state: org.apache.spark.sql.AnalysisException: 
expression 'tsupply.`sno`' is neither present in the group by, nor is it an 
aggregate function. Add to group by or wrap in first() (or first_value) if you 
don't care which value you get.;;
'Project ['SQL1.C1 AS C1#1517671, 'SQL1.C2 AS C2#1517672, 'SQL1.C3 AS 
C3#1517673]
+- 'SubqueryAlias SQL1
   +- 'Aggregate [rollup(sno#1517678), cube(pno#1517679)], [sno#1517678 AS 
C1#1517674, pno#1517679 AS C2#1517675, sum(cast(qty#1517681 as bigint)) AS 
C3#1517676L]
  +- MetastoreRelation cert, tsupply




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2017-05-21 Thread Barry Becker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018840#comment-16018840
 ] 

Barry Becker commented on SPARK-16845:
--

I checked out the the v2.1.1 tag of spark from github, but when I build and try 
to run all the unit tests, it fails on this test:

- GenerateOrdering with FloatType
- GenerateOrdering with ShortType
- SPARK-16845: GeneratedClass$SpecificOrdering grows beyond 64 KB *** FAILED ***
  com.google.common.util.concurrent.ExecutionError: java.lang.StackOverflowError
  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
  at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
  at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
  at 
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:905)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:188)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:43)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:889)
  at 
org.apache.spark.sql.catalyst.expressions.OrderingSuite$$anonfun$1.apply$mcV$sp(OrderingSuite.scala:138)
  at 
org.apache.spark.sql.catalyst.expressions.OrderingSuite$$anonfun$1.apply(OrderingSuite.scala:131)
  ...
  Cause: java.lang.StackOverflowError:
  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370)
  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
  ...

The command I used is
 ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.5 package
Did I do something wrong?

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: hejie
>Assignee: Liwei Lin
> Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0
>
> Attachments: error.txt.zip
>
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20827) cannot express HAVING without a GROUP BY clause

2017-05-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20827:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> cannot express HAVING without a GROUP BY clause
> ---
>
> Key: SPARK-20827
> URL: https://issues.apache.org/jira/browse/SPARK-20827
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: N Campbell
>Priority: Minor
>
> SPARK SQL does not support a HAVING clause without a GROUP BY which is valid 
> SQL and supported by other engines (ORACLE, DB2, )
> SELECT
> '' AS `C1`
> FROM
> `cert`.`tparts`
>  HAVING 
> COUNT(`pno`) > 0
> SQL state: java.lang.UnsupportedOperationException: Cannot evaluate 
> expression: count(input[0, string, true]), Query: SELECT
> '' AS `C1`
> FROM
> `cert`.`tparts`
>  HAVING 
> COUNT(`pno`) > 0.
> SQLState:  HY000
> ErrorCode: 500051



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20829) var_sampe returns Nan while other vendors return a null value

N Campbell created SPARK-20829:
--

 Summary: var_sampe returns Nan while other vendors return a null 
value
 Key: SPARK-20829
 URL: https://issues.apache.org/jira/browse/SPARK-20829
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: N Campbell
Priority: Minor


SELECT
sno AS SNO, 
pno AS PNO, 
VAR_SAMP(qty) AS C1
FROM
tsupply 
GROUP BY 
sno, 
pno


create table  if not exists TSUPPLY (RNUM int  , SNO string, PNO string, JNO 
string, QTY int  )
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' 
 STORED AS textfile ;






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20829) var_samp returns Nan while other vendors return a null value


 [ 
https://issues.apache.org/jira/browse/SPARK-20829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

N Campbell updated SPARK-20829:
---
Summary: var_samp returns Nan while other vendors return a null value  
(was: var_sampe returns Nan while other vendors return a null value)

> var_samp returns Nan while other vendors return a null value
> 
>
> Key: SPARK-20829
> URL: https://issues.apache.org/jira/browse/SPARK-20829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: N Campbell
>Priority: Minor
>
> SELECT
> sno AS SNO, 
> pno AS PNO, 
> VAR_SAMP(qty) AS C1
> FROM
> tsupply 
> GROUP BY 
> sno, 
> pno
> create table  if not exists TSUPPLY (RNUM int  , SNO string, PNO string, JNO 
> string, QTY int  )
>  ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' 
>  STORED AS textfile ;



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20829) var_samp returns Nan while other vendors return a null value


 [ 
https://issues.apache.org/jira/browse/SPARK-20829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

N Campbell updated SPARK-20829:
---
Attachment: TSUPPLY

> var_samp returns Nan while other vendors return a null value
> 
>
> Key: SPARK-20829
> URL: https://issues.apache.org/jira/browse/SPARK-20829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: N Campbell
>Priority: Minor
> Attachments: TSUPPLY
>
>
> SELECT
> sno AS SNO, 
> pno AS PNO, 
> VAR_SAMP(qty) AS C1
> FROM
> tsupply 
> GROUP BY 
> sno, 
> pno
> create table  if not exists TSUPPLY (RNUM int  , SNO string, PNO string, JNO 
> string, QTY int  )
>  ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' 
>  STORED AS textfile ;



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20830) PySpark wrappers for explode_outer and posexplode_outer

2017-05-21 Thread Maciej Szymkiewicz (JIRA)

Maciej Szymkiewicz created SPARK-20830:
--

 Summary: PySpark wrappers for explode_outer and posexplode_outer
 Key: SPARK-20830
 URL: https://issues.apache.org/jira/browse/SPARK-20830
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.2.0
Reporter: Maciej Szymkiewicz


Implement Python wrappers for {{o.a.s.sql.functions.explode_outer}} and 
{{o.a.s.sql.functions.posexplode_outer}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20830) PySpark wrappers for explode_outer and posexplode_outer


[ 
https://issues.apache.org/jira/browse/SPARK-20830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018893#comment-16018893
 ] 

Apache Spark commented on SPARK-20830:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/18049

> PySpark wrappers for explode_outer and posexplode_outer
> ---
>
> Key: SPARK-20830
> URL: https://issues.apache.org/jira/browse/SPARK-20830
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Implement Python wrappers for {{o.a.s.sql.functions.explode_outer}} and 
> {{o.a.s.sql.functions.posexplode_outer}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20830) PySpark wrappers for explode_outer and posexplode_outer


 [ 
https://issues.apache.org/jira/browse/SPARK-20830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20830:


Assignee: (was: Apache Spark)

> PySpark wrappers for explode_outer and posexplode_outer
> ---
>
> Key: SPARK-20830
> URL: https://issues.apache.org/jira/browse/SPARK-20830
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Implement Python wrappers for {{o.a.s.sql.functions.explode_outer}} and 
> {{o.a.s.sql.functions.posexplode_outer}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20830) PySpark wrappers for explode_outer and posexplode_outer


 [ 
https://issues.apache.org/jira/browse/SPARK-20830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20830:


Assignee: Apache Spark

> PySpark wrappers for explode_outer and posexplode_outer
> ---
>
> Key: SPARK-20830
> URL: https://issues.apache.org/jira/browse/SPARK-20830
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>
> Implement Python wrappers for {{o.a.s.sql.functions.explode_outer}} and 
> {{o.a.s.sql.functions.posexplode_outer}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20073) Unexpected Cartesian product when using eqNullSafe in join with a derived table

2017-05-21 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018907#comment-16018907
 ] 

koert kuipers commented on SPARK-20073:
---

leftjoin of a dataframe x with results from an aggregation on that same 
dataframe x is one of the most common patterns of joins i know of. as a result 
of bugs like this i will always defensively rename or alias columns in joins, 
negating the benefit of spark supporting duplicate column names in dataframes 
(after all one of the main rationales for having duplicate column names was to 
support joins). 


> Unexpected Cartesian product when using eqNullSafe in join with a derived 
> table
> ---
>
> Key: SPARK-20073
> URL: https://issues.apache.org/jira/browse/SPARK-20073
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Everett Anderson
>  Labels: correctness
>
> It appears that if you try to join tables A and B when B is derived from A 
> and you use the eqNullSafe / <=> operator for the join condition, Spark 
> performs a Cartesian product.
> However, if you perform the join on tables of the same data when they don't 
> have a relationship, the expected non-Cartesian product join occurs.
> {noformat}
> // Create some fake data.
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.Dataset
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions
> val peopleRowsRDD = sc.parallelize(Seq(
> Row("Fred", 8, 1),
> Row("Fred", 8, 2),
> Row(null, 10, 3),
> Row(null, 10, 4),
> Row("Amy", 12, 5),
> Row("Amy", 12, 6)))
> 
> val peopleSchema = StructType(Seq(
> StructField("name", StringType, nullable = true),
> StructField("group", IntegerType, nullable = true),
> StructField("data", IntegerType, nullable = true)))
> 
> val people = spark.createDataFrame(peopleRowsRDD, peopleSchema)
> people.createOrReplaceTempView("people")
> scala> people.show
> ++-++
> |name|group|data|
> ++-++
> |Fred|8|   1|
> |Fred|8|   2|
> |null|   10|   3|
> |null|   10|   4|
> | Amy|   12|   5|
> | Amy|   12|   6|
> ++-++
> // Now create a derived table from that table. It doesn't matter much what.
> val variantCounts = spark.sql("select name, count(distinct(name, group, 
> data)) as variant_count from people group by name having variant_count > 1")
> variantCounts.show
> ++-+  
>   
> |name|variant_count|
> ++-+
> |Fred|2|
> |null|2|
> | Amy|2|
> ++-+
> // Now try an inner join using the regular equalTo that drops nulls. This 
> works fine.
> val innerJoinEqualTo = variantCounts.join(people, 
> variantCounts("name").equalTo(people("name")))
> innerJoinEqualTo.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> | Amy|2| Amy|   12|   5|
> | Amy|2| Amy|   12|   6|
> ++-++-++
> // Okay now lets switch to the <=> operator
> //
> // If you haven't set spark.sql.crossJoin.enabled=true, you'll get an error 
> like
> // "Cartesian joins could be prohibitively expensive and are disabled by 
> default. To explicitly enable them, please set spark.sql.crossJoin.enabled = 
> true;"
> //
> // if you have enabled them, you'll get the table below.
> //
> // However, we really don't want or expect a Cartesian product!
> val innerJoinSqlNullSafeEqOp = variantCounts.join(people, 
> variantCounts("name")<=>(people("name")))
> innerJoinSqlNullSafeEqOp.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> |Fred|2|null|   10|   3|
> |Fred|2|null|   10|   4|
> |Fred|2| Amy|   12|   5|
> |Fred|2| Amy|   12|   6|
> |null|2|Fred|8|   1|
> |null|2|Fred|8|   2|
> |null|2|null|   10|   3|
> |null|2|null|   10|   4|
> |null|2| Amy|   12|   5|
> |null|2| Amy|   12|   6|
> | Amy|2|Fred|8|   1|
> | Amy|2|Fred|8|   2|
> | Amy|2|null|   10|   3|
> | Amy|2|null|   10|   4|
> | Amy|2| Amy|   12|   5|
> | Amy|2| Amy|   12|   6|
> ++-++-++
> // Okay, let's try to construct the exact same va

[jira] [Commented] (SPARK-20818) Tab-to-autocomplete in IPython + Python3 results in job execution

2017-05-21 Thread Peter Parente (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018920#comment-16018920
 ] 

Peter Parente commented on SPARK-20818:
---

I opened an issue against IPython at the same time, 
https://github.com/ipython/ipython/issues/10580, as I wasn't sure yet if it was 
also evaluating in other situations. It's looking like a  jedi issue, so I'll 
close the defect here.

> Tab-to-autocomplete in IPython + Python3 results in job execution
> -
>
> Key: SPARK-20818
> URL: https://issues.apache.org/jira/browse/SPARK-20818
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2
>Reporter: Peter Parente
>  Labels: ipython, jupyter
>
> Using Spark in a Jupyter Notebook 5.0 with an IPython 6.0 kernel and Python 
> 3, when I press Tab to autocomplete the function names on a DataFrame or RDD, 
> Spark executes the job graph constructed thus far. This only appears to 
> happen for certain autocompletions, namely completions that resolve to 
> functions having the @ignore_unicode_prefix decorator applied to them, and 
> only in Python 3 (never in Python 2). 
> https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L146
> Indeed, this function does have a special case for Python 3 in which it 
> rewrites function docstrings. Why IPython autocompletion has a bad 
> interaction with this logic is unknown (to me at least) at the moment.
> I reproduced this bug on Spark 2.0.2. The code in the decorator hasn't 
> changed in 2.1.x, so the bug likely impacts that version as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-20818) Tab-to-autocomplete in IPython + Python3 results in job execution

2017-05-21 Thread Peter Parente (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Parente closed SPARK-20818.
-
Resolution: Not A Problem

> Tab-to-autocomplete in IPython + Python3 results in job execution
> -
>
> Key: SPARK-20818
> URL: https://issues.apache.org/jira/browse/SPARK-20818
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2
>Reporter: Peter Parente
>  Labels: ipython, jupyter
>
> Using Spark in a Jupyter Notebook 5.0 with an IPython 6.0 kernel and Python 
> 3, when I press Tab to autocomplete the function names on a DataFrame or RDD, 
> Spark executes the job graph constructed thus far. This only appears to 
> happen for certain autocompletions, namely completions that resolve to 
> functions having the @ignore_unicode_prefix decorator applied to them, and 
> only in Python 3 (never in Python 2). 
> https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L146
> Indeed, this function does have a special case for Python 3 in which it 
> rewrites function docstrings. Why IPython autocompletion has a bad 
> interaction with this logic is unknown (to me at least) at the moment.
> I reproduced this bug on Spark 2.0.2. The code in the decorator hasn't 
> changed in 2.1.x, so the bug likely impacts that version as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20831) Unresolved operator when INSERT OVERWRITE data source tables with IF NOT EXISTS

2017-05-21 Thread Xiao Li (JIRA)

Xiao Li created SPARK-20831:
---

 Summary: Unresolved operator when INSERT OVERWRITE data source 
tables with IF NOT EXISTS
 Key: SPARK-20831
 URL: https://issues.apache.org/jira/browse/SPARK-20831
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1, 2.2.0
Reporter: Xiao Li
Assignee: Xiao Li


Currently, we have a bug when we specify `IF NOT EXISTS` in `INSERT OVERWRITE` 
data source tables. For example, given a query:
{noformat}
INSERT OVERWRITE TABLE $tableName partition (b=2, c=3) IF NOT EXISTS SELECT 9, 
10
{noformat}
we will get the following error:
{noformat}
unresolved operator 'InsertIntoTable Relation[a#425,d#426,b#427,c#428] parquet, 
Map(b -> Some(2), c -> Some(3)), true, true;;
'InsertIntoTable Relation[a#425,d#426,b#427,c#428] parquet, Map(b -> Some(2), c 
-> Some(3)), true, true
+- Project [cast(9#423 as int) AS a#429, cast(10#424 as int) AS d#430]
   +- Project [9 AS 9#423, 10 AS 10#424]
  +- OneRowRelation$
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20831) Unresolved operator when INSERT OVERWRITE data source tables with IF NOT EXISTS


[ 
https://issues.apache.org/jira/browse/SPARK-20831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018942#comment-16018942
 ] 

Apache Spark commented on SPARK-20831:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/18050

> Unresolved operator when INSERT OVERWRITE data source tables with IF NOT 
> EXISTS
> ---
>
> Key: SPARK-20831
> URL: https://issues.apache.org/jira/browse/SPARK-20831
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Currently, we have a bug when we specify `IF NOT EXISTS` in `INSERT 
> OVERWRITE` data source tables. For example, given a query:
> {noformat}
> INSERT OVERWRITE TABLE $tableName partition (b=2, c=3) IF NOT EXISTS SELECT 
> 9, 10
> {noformat}
> we will get the following error:
> {noformat}
> unresolved operator 'InsertIntoTable Relation[a#425,d#426,b#427,c#428] 
> parquet, Map(b -> Some(2), c -> Some(3)), true, true;;
> 'InsertIntoTable Relation[a#425,d#426,b#427,c#428] parquet, Map(b -> Some(2), 
> c -> Some(3)), true, true
> +- Project [cast(9#423 as int) AS a#429, cast(10#424 as int) AS d#430]
>+- Project [9 AS 9#423, 10 AS 10#424]
>   +- OneRowRelation$
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20831) Unresolved operator when INSERT OVERWRITE data source tables with IF NOT EXISTS


 [ 
https://issues.apache.org/jira/browse/SPARK-20831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20831:


Assignee: Apache Spark  (was: Xiao Li)

> Unresolved operator when INSERT OVERWRITE data source tables with IF NOT 
> EXISTS
> ---
>
> Key: SPARK-20831
> URL: https://issues.apache.org/jira/browse/SPARK-20831
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Currently, we have a bug when we specify `IF NOT EXISTS` in `INSERT 
> OVERWRITE` data source tables. For example, given a query:
> {noformat}
> INSERT OVERWRITE TABLE $tableName partition (b=2, c=3) IF NOT EXISTS SELECT 
> 9, 10
> {noformat}
> we will get the following error:
> {noformat}
> unresolved operator 'InsertIntoTable Relation[a#425,d#426,b#427,c#428] 
> parquet, Map(b -> Some(2), c -> Some(3)), true, true;;
> 'InsertIntoTable Relation[a#425,d#426,b#427,c#428] parquet, Map(b -> Some(2), 
> c -> Some(3)), true, true
> +- Project [cast(9#423 as int) AS a#429, cast(10#424 as int) AS d#430]
>+- Project [9 AS 9#423, 10 AS 10#424]
>   +- OneRowRelation$
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20831) Unresolved operator when INSERT OVERWRITE data source tables with IF NOT EXISTS


 [ 
https://issues.apache.org/jira/browse/SPARK-20831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20831:


Assignee: Xiao Li  (was: Apache Spark)

> Unresolved operator when INSERT OVERWRITE data source tables with IF NOT 
> EXISTS
> ---
>
> Key: SPARK-20831
> URL: https://issues.apache.org/jira/browse/SPARK-20831
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Currently, we have a bug when we specify `IF NOT EXISTS` in `INSERT 
> OVERWRITE` data source tables. For example, given a query:
> {noformat}
> INSERT OVERWRITE TABLE $tableName partition (b=2, c=3) IF NOT EXISTS SELECT 
> 9, 10
> {noformat}
> we will get the following error:
> {noformat}
> unresolved operator 'InsertIntoTable Relation[a#425,d#426,b#427,c#428] 
> parquet, Map(b -> Some(2), c -> Some(3)), true, true;;
> 'InsertIntoTable Relation[a#425,d#426,b#427,c#428] parquet, Map(b -> Some(2), 
> c -> Some(3)), true, true
> +- Project [cast(9#423 as int) AS a#429, cast(10#424 as int) AS d#430]
>+- Project [9 AS 9#423, 10 AS 10#424]
>   +- OneRowRelation$
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18825) Eliminate duplicate links in SparkR API doc index


[ 
https://issues.apache.org/jira/browse/SPARK-18825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018945#comment-16018945
 ] 

Apache Spark commented on SPARK-18825:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/18051

> Eliminate duplicate links in SparkR API doc index
> -
>
> Key: SPARK-18825
> URL: https://issues.apache.org/jira/browse/SPARK-18825
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>
> The SparkR API docs contain many duplicate links with suffixes {{-method}} or 
> {{-class}} in the index . E.g., {{atan}} and {{atan-method}} link to the same 
> doc.
> Copying from [~felixcheung] in [SPARK-18332]:
> {quote}
> They are because of the
> {{@ aliases}}
> tags. I think we are adding them because CRAN checks require them to match 
> the specific format - [~shivaram] would you know?
> I am pretty sure they are double-listed because in addition to aliases we 
> also have
> {{@ rdname}}
> which automatically generate the links as well.
> I suspect if we change all the rdname to match the string in aliases then 
> there will be one link. I can take a shot at this to test this out, but 
> changes will be very extensive - is this something we could get into 2.1 
> still?
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18825) Eliminate duplicate links in SparkR API doc index


 [ 
https://issues.apache.org/jira/browse/SPARK-18825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18825:


Assignee: Apache Spark

> Eliminate duplicate links in SparkR API doc index
> -
>
> Key: SPARK-18825
> URL: https://issues.apache.org/jira/browse/SPARK-18825
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> The SparkR API docs contain many duplicate links with suffixes {{-method}} or 
> {{-class}} in the index . E.g., {{atan}} and {{atan-method}} link to the same 
> doc.
> Copying from [~felixcheung] in [SPARK-18332]:
> {quote}
> They are because of the
> {{@ aliases}}
> tags. I think we are adding them because CRAN checks require them to match 
> the specific format - [~shivaram] would you know?
> I am pretty sure they are double-listed because in addition to aliases we 
> also have
> {{@ rdname}}
> which automatically generate the links as well.
> I suspect if we change all the rdname to match the string in aliases then 
> there will be one link. I can take a shot at this to test this out, but 
> changes will be very extensive - is this something we could get into 2.1 
> still?
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18825) Eliminate duplicate links in SparkR API doc index


 [ 
https://issues.apache.org/jira/browse/SPARK-18825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18825:


Assignee: (was: Apache Spark)

> Eliminate duplicate links in SparkR API doc index
> -
>
> Key: SPARK-18825
> URL: https://issues.apache.org/jira/browse/SPARK-18825
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>
> The SparkR API docs contain many duplicate links with suffixes {{-method}} or 
> {{-class}} in the index . E.g., {{atan}} and {{atan-method}} link to the same 
> doc.
> Copying from [~felixcheung] in [SPARK-18332]:
> {quote}
> They are because of the
> {{@ aliases}}
> tags. I think we are adding them because CRAN checks require them to match 
> the specific format - [~shivaram] would you know?
> I am pretty sure they are double-listed because in addition to aliases we 
> also have
> {{@ rdname}}
> which automatically generate the links as well.
> I suspect if we change all the rdname to match the string in aliases then 
> there will be one link. I can take a shot at this to test this out, but 
> changes will be very extensive - is this something we could get into 2.1 
> still?
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18825) Eliminate duplicate links in SparkR API doc index

2017-05-21 Thread Maciej Szymkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018948#comment-16018948
 ] 

Maciej Szymkiewicz commented on SPARK-18825:


By all means. I created a PR with one possible solution to this problem. 

> Eliminate duplicate links in SparkR API doc index
> -
>
> Key: SPARK-18825
> URL: https://issues.apache.org/jira/browse/SPARK-18825
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>
> The SparkR API docs contain many duplicate links with suffixes {{-method}} or 
> {{-class}} in the index . E.g., {{atan}} and {{atan-method}} link to the same 
> doc.
> Copying from [~felixcheung] in [SPARK-18332]:
> {quote}
> They are because of the
> {{@ aliases}}
> tags. I think we are adding them because CRAN checks require them to match 
> the specific format - [~shivaram] would you know?
> I am pretty sure they are double-listed because in addition to aliases we 
> also have
> {{@ rdname}}
> which automatically generate the links as well.
> I suspect if we change all the rdname to match the string in aliases then 
> there will be one link. I can take a shot at this to test this out, but 
> changes will be very extensive - is this something we could get into 2.1 
> still?
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20792) Support same timeout operations in mapGroupsWithState function in batch queries as in streaming queries

2017-05-21 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-20792.
--
Resolution: Fixed

> Support same timeout operations in mapGroupsWithState function in batch 
> queries as in streaming queries
> ---
>
> Key: SPARK-20792
> URL: https://issues.apache.org/jira/browse/SPARK-20792
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.2.0
>
>
> Currently, in the batch queries, timeout is disabled (i.e. 
> GroupStateTimeout.NoTimeout) which means any GroupState.setTimeout*** 
> operation would throw UnsupportedOperationException. This makes it weird when 
> converting a streaming query into a batch query by changing the input DF from 
> streaming to a batch DF. If the timeout was enabled and used, then the batch 
> query will start throwing UnsupportedOperationException.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20799) Unable to infer schema for ORC on reading ORC from S3

2017-05-21 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20799:
-
Component/s: (was: Spark Core)
 SQL

> Unable to infer schema for ORC on reading ORC from S3
> -
>
> Key: SPARK-20799
> URL: https://issues.apache.org/jira/browse/SPARK-20799
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Jork Zijlstra
>
> We are getting the following exception: 
> {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. 
> It must be specified manually.{code}
> Combining following factors will cause it:
> - Use S3
> - Use format ORC
> - Don't apply a partitioning on de data
> - Embed AWS credentials in the path
> The problem is in the PartitioningAwareFileIndex def allFiles()
> {code}
> leafDirToChildrenFiles.get(qualifiedPath)
>   .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
>   .getOrElse(Array.empty)
> {code}
> leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the 
> qualifiedPath contains the path WITH credentials.
> So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no 
> data is read and the schema cannot be defined.
> Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login 
> details. This is insecure and may be unsupported in future., but this should 
> not mean that it shouldn't work anymore.
> Workaround:
> Move the AWS credentials from the path to the SparkSession
> {code}
> SparkSession.builder
>   .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
>   .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20347) Provide AsyncRDDActions in Python


[ 
https://issues.apache.org/jira/browse/SPARK-20347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018985#comment-16018985
 ] 

Apache Spark commented on SPARK-20347:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/18052

> Provide AsyncRDDActions in Python
> -
>
> Key: SPARK-20347
> URL: https://issues.apache.org/jira/browse/SPARK-20347
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: holdenk
>Priority: Minor
>
> In core Spark AsyncRDDActions allows people to perform non-blocking RDD 
> actions. In Python where threading & is a bit more involved there could be 
> value in exposing this, the easiest way might involve using the Py4J callback 
> server on the driver.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20347) Provide AsyncRDDActions in Python


 [ 
https://issues.apache.org/jira/browse/SPARK-20347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20347:


Assignee: (was: Apache Spark)

> Provide AsyncRDDActions in Python
> -
>
> Key: SPARK-20347
> URL: https://issues.apache.org/jira/browse/SPARK-20347
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: holdenk
>Priority: Minor
>
> In core Spark AsyncRDDActions allows people to perform non-blocking RDD 
> actions. In Python where threading & is a bit more involved there could be 
> value in exposing this, the easiest way might involve using the Py4J callback 
> server on the driver.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20347) Provide AsyncRDDActions in Python


 [ 
https://issues.apache.org/jira/browse/SPARK-20347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20347:


Assignee: Apache Spark

> Provide AsyncRDDActions in Python
> -
>
> Key: SPARK-20347
> URL: https://issues.apache.org/jira/browse/SPARK-20347
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Minor
>
> In core Spark AsyncRDDActions allows people to perform non-blocking RDD 
> actions. In Python where threading & is a bit more involved there could be 
> value in exposing this, the easiest way might involve using the Py4J callback 
> server on the driver.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20736) PySpark StringIndexer supports StringOrderType

2017-05-21 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-20736.
--
  Resolution: Fixed
Assignee: Wayne Zhang
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> PySpark StringIndexer supports StringOrderType
> --
>
> Key: SPARK-20736
> URL: https://issues.apache.org/jira/browse/SPARK-20736
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
> Fix For: 2.3.0
>
>
> Port new support of StringOrderType to PySpark StringIndexer. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20763) The function of `month` and `day` return a value which is not we expected


[ 
https://issues.apache.org/jira/browse/SPARK-20763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019061#comment-16019061
 ] 

Apache Spark commented on SPARK-20763:
--

User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/18053

> The function of  `month` and `day` return a value which is not we expected
> --
>
> Key: SPARK-20763
> URL: https://issues.apache.org/jira/browse/SPARK-20763
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 2.2.0
>
>
> spark-sql>select month("1582-09-28");
> spark-sql>10
> For this case, the expected result is 9, but it is 10.
> spark-sql>select day("1582-04-18");
> spark-sql>28
> For this case, the expected result is 18, but it is 28.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20764) Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version

2017-05-21 Thread Peng Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019096#comment-16019096
 ] 

Peng Meng commented on SPARK-20764:
---

Hi [~mlnick], are you working on this, if not, I can work on it.

> Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and 
> GLR - Python version
> 
>
> Key: SPARK-20764
> URL: https://issues.apache.org/jira/browse/SPARK-20764
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>Priority: Minor
>
> SPARK-20097 exposed {{degreesOfFreedom}} in {{LinearRegressionSummary}} and 
> {{numInstances}} in {{GeneralizedLinearRegressionSummary}}. Python API should 
> be updated to reflect these changes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19256) Hive bucketing support

2017-05-21 Thread Tejas Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019109#comment-16019109
 ] 

Tejas Patil commented on SPARK-19256:
-

[~cloud_fan] : After SPARK-18243, `InsertIntoHiveTable` is a `RunnableCommand` 
and not a physical operator node. This hinders the ability to add constraints 
about partitioning and sortedness of child node of insert operator. I see 
SPARK-20703 is about refactoring that part of code (but it will be only 
restricted towards getting more metrics). Would it be ok if I just change 
`InsertIntoHiveTable` to have the changes I need and move on OR wait for 
SPARK-20703 to land (which might take time). As you guys would be reviewing my 
PRs, would want to get your opinion on this.

> Hive bucketing support
> --
>
> Key: SPARK-19256
> URL: https://issues.apache.org/jira/browse/SPARK-19256
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Priority: Minor
>
> JIRA to track design discussions and tasks related to Hive bucketing support 
> in Spark.
> Proposal : 
> https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20832) Standalone master should explicitly inform drivers of worker deaths and invalidate external shuffle service outputs

Josh Rosen created SPARK-20832:
--

 Summary: Standalone master should explicitly inform drivers of 
worker deaths and invalidate external shuffle service outputs
 Key: SPARK-20832
 URL: https://issues.apache.org/jira/browse/SPARK-20832
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.0.0
Reporter: Josh Rosen


In SPARK-17370 (a patch authored by [~ekhliang] and reviewed by me), we added 
logic to the DAGScheduler to mark external shuffle service instances as 
unavailable upon task failure when the task failure reason was "SlaveLost" and 
this was known to be caused by worker death. If the Spark Master discovered 
that a worker was dead then it would notify any drivers with executors on those 
workers to mark those executors as dead. The linked patch simply piggybacked on 
this logic to have the executor death notification also imply worker death and 
to have worker-death-caused-executor-death imply shuffle file loss.

However, there are modes of external shuffle service loss which this mechanism 
does not detect, leaving the system prone race conditions. Consider the 
following:

* Spark standalone is configured to run an external shuffle service embedded in 
the Worker.
* Application has shuffle outputs and executors on Worker A.
* Stage depending on outputs of tasks that ran on Worker A starts.
* All executors on worker A are removed due to dying with exceptions, 
scaling-down via the dynamic allocation APIs, but _not_ due to worker death. 
Worker A is still healthy at this point.
* At this point the MapOutputTracker still records map output locations on 
Worker A's shuffle service. This is expected behavior. 
* Worker A dies at an instant where the application has no executors running on 
it.
* The Master knows that Worker A died but does not inform the driver (which had 
no executors on that worker at the time of its death).
* Some task from the running stage attempts to fetch map outputs from Worker A 
but these requests time out because Worker A's shuffle service isn't available.
* Due to other logic in the scheduler, these preventable FetchFailures don't 
wind up invaliding the now-invalid unavailable map output locations (this is a 
distinct bug / behavior which I'll discuss in a separate JIRA ticket).
* This behavior leads to several unsuccessful stage reattempts and ultimately 
to a job failure.

A simple way to address this would be to have the Master explicitly notify 
drivers of all Worker deaths, even if those drivers don't currently have 
executors. The Spark Standalone scheduler backend can receive the explicit 
WorkerLost message and can bubble up the right calls to the task scheduler and 
DAGScheduler to invalidate map output locations from the now-dead external 
shuffle service.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20832) Standalone master should explicitly inform drivers of worker deaths and invalidate external shuffle service outputs


 [ 
https://issues.apache.org/jira/browse/SPARK-20832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-20832:
---
Component/s: Deploy

> Standalone master should explicitly inform drivers of worker deaths and 
> invalidate external shuffle service outputs
> ---
>
> Key: SPARK-20832
> URL: https://issues.apache.org/jira/browse/SPARK-20832
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Scheduler
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>
> In SPARK-17370 (a patch authored by [~ekhliang] and reviewed by me), we added 
> logic to the DAGScheduler to mark external shuffle service instances as 
> unavailable upon task failure when the task failure reason was "SlaveLost" 
> and this was known to be caused by worker death. If the Spark Master 
> discovered that a worker was dead then it would notify any drivers with 
> executors on those workers to mark those executors as dead. The linked patch 
> simply piggybacked on this logic to have the executor death notification also 
> imply worker death and to have worker-death-caused-executor-death imply 
> shuffle file loss.
> However, there are modes of external shuffle service loss which this 
> mechanism does not detect, leaving the system prone race conditions. Consider 
> the following:
> * Spark standalone is configured to run an external shuffle service embedded 
> in the Worker.
> * Application has shuffle outputs and executors on Worker A.
> * Stage depending on outputs of tasks that ran on Worker A starts.
> * All executors on worker A are removed due to dying with exceptions, 
> scaling-down via the dynamic allocation APIs, but _not_ due to worker death. 
> Worker A is still healthy at this point.
> * At this point the MapOutputTracker still records map output locations on 
> Worker A's shuffle service. This is expected behavior. 
> * Worker A dies at an instant where the application has no executors running 
> on it.
> * The Master knows that Worker A died but does not inform the driver (which 
> had no executors on that worker at the time of its death).
> * Some task from the running stage attempts to fetch map outputs from Worker 
> A but these requests time out because Worker A's shuffle service isn't 
> available.
> * Due to other logic in the scheduler, these preventable FetchFailures don't 
> wind up invaliding the now-invalid unavailable map output locations (this is 
> a distinct bug / behavior which I'll discuss in a separate JIRA ticket).
> * This behavior leads to several unsuccessful stage reattempts and ultimately 
> to a job failure.
> A simple way to address this would be to have the Master explicitly notify 
> drivers of all Worker deaths, even if those drivers don't currently have 
> executors. The Spark Standalone scheduler backend can receive the explicit 
> WorkerLost message and can bubble up the right calls to the task scheduler 
> and DAGScheduler to invalidate map output locations from the now-dead 
> external shuffle service.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20832) Standalone master should explicitly inform drivers of worker deaths and invalidate external shuffle service outputs

[
https://issues.apache.org/jira/browse/SPARK-20832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Josh Rosen updated SPARK-20832:
---
Description:
In SPARK-17370 (a patch authored by [~ekhliang] and reviewed by me), we added
logic to the DAGScheduler to mark external shuffle service instances as
unavailable upon task failure when the task failure reason was "SlaveLost" and
this was known to be caused by worker death. If the Spark Master discovered
that a worker was dead then it would notify any drivers with executors on those
workers to mark those executors as dead. The linked patch simply piggybacked on
this logic to have the executor death notification also imply worker death and
to have worker-death-caused-executor-death imply shuffle file loss.

However, there are modes of external shuffle service loss which this mechanism
does not detect, leaving the system prone race conditions. Consider the
following:

* Spark standalone is configured to run an external shuffle service embedded in
the Worker.
* Application has shuffle outputs and executors on Worker A.
* Stage depending on outputs of tasks that ran on Worker A starts.
* All executors on worker A are removed due to dying with exceptions,
scaling-down via the dynamic allocation APIs, but _not_ due to worker death.
Worker A is still healthy at this point.
* At this point the MapOutputTracker still records map output locations on
Worker A's shuffle service. This is expected behavior.
* Worker A dies at an instant where the application has no executors running on
it.
* The Master knows that Worker A died but does not inform the driver (which had
no executors on that worker at the time of its death).
* Some task from the running stage attempts to fetch map outputs from Worker A
but these requests time out because Worker A's shuffle service isn't available.
* Due to other logic in the scheduler, these preventable FetchFailures don't
wind up invaliding the now-invalid unavailable map output locations (this is a
distinct bug / behavior which I'll discuss in a separate JIRA ticket).
* This behavior leads to several unsuccessful stage reattempts and ultimately
to a job failure.

A simple way to address this would be to have the Master explicitly notify
drivers of all Worker deaths, even if those drivers don't currently have
executors. The Spark Standalone scheduler backend can receive the explicit
WorkerLost message and can bubble up the right calls to the task scheduler and
DAGScheduler to invalidate map output locations from the now-dead external
shuffle service.

This relates to SPARK-20115 in the sense that both tickets aim to address
issues where the external shuffle service is unavailable. The key difference is
the mechanism for detection: SPARK-20115 marks the external shuffle service as
unavailable whenever any fetch failure occurs from it, whereas the proposal
here relies on more explicit signals. This JIRA ticket's proposal is scoped
only to Spark Standalone mode. As a compromise, we might be able to consider
"all of a single shuffle's outputs lost on a single external shuffle service"
following a fetch failure (to be discussed in separate JIRA).

was:
In SPARK-17370 (a patch authored by [~ekhliang] and reviewed by me), we added
logic to the DAGScheduler to mark external shuffle service instances as
unavailable upon task failure when the task failure reason was "SlaveLost" and
this was known to be caused by worker death. If the Spark Master discovered
that a worker was dead then it would notify any drivers with executors on those
workers to mark those executors as dead. The linked patch simply piggybacked on
this logic to have the executor death notification also imply worker death and
to have worker-death-caused-executor-death imply shuffle file loss.

However, there are modes of external shuffle service loss which this mechanism
does not detect, leaving the system prone race conditions. Consider the
following:

[jira] [Commented] (SPARK-20178) Improve Scheduler fetch failures


[ 
https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019123#comment-16019123
 ] 

Josh Rosen commented on SPARK-20178:


Looking over a few of the tickets linked to this fetch failure handling 
umbrella, I've noticed that there is a commonality in several linked JIRAs 
where folks are proposing to treat a single fetch failure from a node as though 
all outputs on that node were lost. While this is beneficial for avoiding the 
behavior where we keep repeatedly trying to refetch from a malfunctioning node 
or an external shuffle service which has disappeared, it may go too far in some 
situations and can cause unnecessary recomputations. For example, in a 
multi-user multi-job environment there could be a high cost to a false-positive 
where you mark a healthy block manager/shuffle service as unavailable following 
a single FetchFailure: this takes a failure which might be isolated to a single 
stage and promotes it into a wider failure that can impact other concurrently 
running stages (or which can destroy the ability to leverage the implicit 
caching of shuffle outputs across job runs).

To work around this problem, it looks like there are several proposals (but not 
PRs yet) for more complex approaches which attempt to infer whether a fetch 
failure indicates complete unavailability by keeping statistics on the number 
of fetch failures attributed to each node. The idea here is very similar to 
executor blacklisting, except applied to output locations. This is a good idea 
for the longer term because it can help to mitigate against nodes which 
silently corrupt most data written to disk (a failure mode we won't tolerate 
well today), but I don't think it's the right fix for the immediate issue being 
discussed in this ticket: these proposals will require significant amounts of 
new bookeeping logic to implement (which is hard to do efficiently and without 
causing memory leaks / perf. issues) and involve threshold-based detection 
logic which can require tuning to get correct.

As a compromise, I would like to propose a slightly weaker version of 
SPARK-20115 and SPARK-19753: when the DAGScheduler is notified of a 
FetchFailure from a node then mark _that shuffle's output locations on that 
node_ as unavailable (rather than all shuffles' outputs on that node). The 
rationale behind this is that the FetchFailure is already going to cause 
recomputation of that shuffle and the likelihood of the FetchFailure being a 
transient failure is relatively small: tasks already have internal retries when 
fetching (see both RetryingBlockFetcher and [~davies]'s patch for retrying 
within the task when small fetched shuffle blocks are determined to be 
corrupt), so if a task fails with a FetchFailure then it seems likely that the 
actual output that we tried to fetch is unavailable or corrupt. 

I think that this proposal should be simple to implement (and backport 
(optionally in a feature-flagged manner)) and hopefully won't be controversial 
because it's much more limited in the scope of the extra inferences it draws 
from FetchFailures . It also does not preclude the other proposals from being 
implemented later.

Feedback on this is very welcome. If there's support then I'd like to take a 
shot at implementing it.

> Improve Scheduler fetch failures
> 
>
> Key: SPARK-20178
> URL: https://issues.apache.org/jira/browse/SPARK-20178
> Project: Spark
>  Issue Type: Epic
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> We have been having a lot of discussions around improving the handling of 
> fetch failures.  There are 4 jira currently related to this.  
> We should try to get a list of things we want to improve and come up with one 
> cohesive design.
> SPARK-20163,  SPARK-20091,  SPARK-14649 , and SPARK-19753
> I will put my initial thoughts in a follow on comment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20763) The function of `month` and `day` return a value which is not we expected


[ 
https://issues.apache.org/jira/browse/SPARK-20763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019130#comment-16019130
 ] 

Apache Spark commented on SPARK-20763:
--

User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/18054

> The function of  `month` and `day` return a value which is not we expected
> --
>
> Key: SPARK-20763
> URL: https://issues.apache.org/jira/browse/SPARK-20763
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 2.2.0
>
>
> spark-sql>select month("1582-09-28");
> spark-sql>10
> For this case, the expected result is 9, but it is 10.
> spark-sql>select day("1582-04-18");
> spark-sql>28
> For this case, the expected result is 18, but it is 28.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20786) Improve ceil and floor handle the value which is not expected

2017-05-21 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-20786:
---

Assignee: caoxuewen

> Improve ceil and floor handle the value which is not expected
> -
>
> Key: SPARK-20786
> URL: https://issues.apache.org/jira/browse/SPARK-20786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Assignee: caoxuewen
> Fix For: 2.3.0
>
>
> spark-sql>SELECT ceil(1234567890123456);
> 1234567890123456
> spark-sql>SELECT ceil(12345678901234567);
> 12345678901234568
> spark-sql>SELECT ceil(123456789012345678);
> 123456789012345680
> when the length of the getText is greater than 16. long to double will be 
> precision loss.
> but mysql handle the value is ok.
> mysql> SELECT ceil(1234567890123456);
> ++
> | ceil(1234567890123456) |
> ++
> |   1234567890123456 |
> ++
> 1 row in set (0.00 sec)
> mysql> SELECT ceil(12345678901234567);
> +-+
> | ceil(12345678901234567) |
> +-+
> |   12345678901234567 |
> +-+
> 1 row in set (0.00 sec)
> mysql> SELECT ceil(123456789012345678);
> +--+
> | ceil(123456789012345678) |
> +--+
> |   123456789012345678 |
> +--+
> 1 row in set (0.00 sec)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20786) Improve ceil and floor handle the value which is not expected

2017-05-21 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20786.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Improve ceil and floor handle the value which is not expected
> -
>
> Key: SPARK-20786
> URL: https://issues.apache.org/jira/browse/SPARK-20786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Assignee: caoxuewen
> Fix For: 2.3.0
>
>
> spark-sql>SELECT ceil(1234567890123456);
> 1234567890123456
> spark-sql>SELECT ceil(12345678901234567);
> 12345678901234568
> spark-sql>SELECT ceil(123456789012345678);
> 123456789012345680
> when the length of the getText is greater than 16. long to double will be 
> precision loss.
> but mysql handle the value is ok.
> mysql> SELECT ceil(1234567890123456);
> ++
> | ceil(1234567890123456) |
> ++
> |   1234567890123456 |
> ++
> 1 row in set (0.00 sec)
> mysql> SELECT ceil(12345678901234567);
> +-+
> | ceil(12345678901234567) |
> +-+
> |   12345678901234567 |
> +-+
> 1 row in set (0.00 sec)
> mysql> SELECT ceil(123456789012345678);
> +--+
> | ceil(123456789012345678) |
> +--+
> |   123456789012345678 |
> +--+
> 1 row in set (0.00 sec)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20833) it seems to be a part of Android TimSort class rather than a port of .. in the first line of TimSort.java comments

2017-05-21 Thread tuplemoon (JIRA)

tuplemoon created SPARK-20833:
-

 Summary: it seems to be a part of Android TimSort class rather 
than a port of .. in the first line of TimSort.java comments
 Key: SPARK-20833
 URL: https://issues.apache.org/jira/browse/SPARK-20833
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.1.1
Reporter: tuplemoon


it seems to be a part of Android TimSort class rather than a port of .. in the 
first line of TimSort.java comments





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20833) it seems to be a part of Android TimSort class rather than a port of .. in the first line of TimSort.java comments


[ 
https://issues.apache.org/jira/browse/SPARK-20833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019169#comment-16019169
 ] 

Apache Spark commented on SPARK-20833:
--

User 'tuplemoon' has created a pull request for this issue:
https://github.com/apache/spark/pull/18056

> it seems to be a part of Android TimSort class rather than a port of .. in 
> the first line of TimSort.java comments
> --
>
> Key: SPARK-20833
> URL: https://issues.apache.org/jira/browse/SPARK-20833
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: tuplemoon
>
> it seems to be a part of Android TimSort class rather than a port of .. in 
> the first line of TimSort.java comments



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20833) it seems to be a part of Android TimSort class rather than a port of .. in the first line of TimSort.java comments


 [ 
https://issues.apache.org/jira/browse/SPARK-20833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20833:


Assignee: Apache Spark

> it seems to be a part of Android TimSort class rather than a port of .. in 
> the first line of TimSort.java comments
> --
>
> Key: SPARK-20833
> URL: https://issues.apache.org/jira/browse/SPARK-20833
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: tuplemoon
>Assignee: Apache Spark
>
> it seems to be a part of Android TimSort class rather than a port of .. in 
> the first line of TimSort.java comments



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20833) it seems to be a part of Android TimSort class rather than a port of .. in the first line of TimSort.java comments