date:20180705

[jira] [Created] (SPARK-24749) Cannot filter array with named_struct

2018-07-05 Thread pin_zhang (JIRA)

pin_zhang created SPARK-24749:
-

 Summary: Cannot filter array with named_struct
 Key: SPARK-24749
 URL: https://issues.apache.org/jira/browse/SPARK-24749
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: pin_zhang


1. Create Table

create table arr__int( arr array> )stored as parquet;

2. Insert data

insert into arr__int values( array(named_struct('a', 1)));

3. Filter with struct data

select * from arr__int where array_contains (arr, named_struct('a', 1));
Error: org.apache.spark.sql.AnalysisException: cannot resolve 
'array_contains(arr__int.`arr`, named_struct('a', 1))' due to data type 
mismatch: Arguments must be an array followed by a value of same type as the 
array members; line 1 pos 29;
'Project [*]
+- 'Filter array_contains(arr#6, named_struct(a, 1))
 +- SubqueryAlias arr__int
 +- Relation[arr#6] parquet (state=,code=0)

Caused by schema null is always false in named_struct 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24746) AWS S3 301 Moved Permanently error message even after setting fs.s3a.endpoint for bucket in Mumbai region.

2018-07-05 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534443#comment-16534443
 ] 

Liang-Chi Hsieh commented on SPARK-24746:
-

Maybe related: 
https://stackoverflow.com/questions/44284451/301-redirect-when-trying-to-access-aws-mumbai-s3-server


> AWS S3 301 Moved Permanently error message even after setting fs.s3a.endpoint 
> for bucket in Mumbai region.
> --
>
> Key: SPARK-24746
> URL: https://issues.apache.org/jira/browse/SPARK-24746
> Project: Spark
>  Issue Type: Question
>  Components: Kubernetes, PySpark
>Affects Versions: 2.3.1
>Reporter: Kushagra Singh
>Priority: Major
>
> I am trying to write parquet data to a S3 bucket in ap-south-1(Mumbai) region 
> but keep getting 301 errors even though I have specified the correct region.
> {code}
> sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", 
> "s3.ap-south-1.amazonaws.com")
> log.write.mode("overwrite").parquet("s3a://bucket/logs")
> {code}
> s3a related config in spark-defaults:
> {code:java}
> spark.hadoop.fs.s3a.implorg.apache.hadoop.fs.s3a.S3AFileSystem
> spark.hadoop.validateOutputSpecs false
> spark.executor.extraJavaOptions -Dcom.amazonaws.services.s3.enableV4=true
> spark.driver.extraJavaOptions -Dcom.amazonaws.services.s3.enableV4=true
> spark.hadoop.fs.s3a.connection.maximum 100
> {code}
> Using _spark 2.3.1_ and _hadoop 2.7_ with _aws-java-sdk-1.7.4_ and 
> _hadoop-aws-2.7.3_
> Stacktrace:
> {code:java}
> py4j.protocol.Py4JJavaError: An error occurred while calling o71.parquet.
> : org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:224)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:547)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:282)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 
> 301, AWS Service: Amazon S3, AWS Request ID: 0A48F0A6FD8AC8B5, AWS Error 
> Code: null, AWS Error Message: Moved Permanently, S3 Extended Request ID: 
>

[jira] [Commented] (SPARK-24742) Field Metadata raises NullPointerException in hashCode method

2018-07-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534437#comment-16534437
 ] 

Apache Spark commented on SPARK-24742:
--

User 'kupferk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21722

> Field Metadata raises NullPointerException in hashCode method
> -
>
> Key: SPARK-24742
> URL: https://issues.apache.org/jira/browse/SPARK-24742
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Kaya Kupferschmidt
>Priority: Minor
>
> The class org.apache.spark.sql.types.Metadata has a method *putNull* for 
> storing null pointers as Metadata. Unfortunately the hashCode method throws a 
> NullPointerException when there are null values in a Metadata object, 
> rendering the *puNull* method useless.
> h2. How to Reproduce
> The following code will raise a NullPointerException
> {code}
> new MetadataBuilder().putNull("key").build().hashCode
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24742) Field Metadata raises NullPointerException in hashCode method

2018-07-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24742:


Assignee: Apache Spark

> Field Metadata raises NullPointerException in hashCode method
> -
>
> Key: SPARK-24742
> URL: https://issues.apache.org/jira/browse/SPARK-24742
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Kaya Kupferschmidt
>Assignee: Apache Spark
>Priority: Minor
>
> The class org.apache.spark.sql.types.Metadata has a method *putNull* for 
> storing null pointers as Metadata. Unfortunately the hashCode method throws a 
> NullPointerException when there are null values in a Metadata object, 
> rendering the *puNull* method useless.
> h2. How to Reproduce
> The following code will raise a NullPointerException
> {code}
> new MetadataBuilder().putNull("key").build().hashCode
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24742) Field Metadata raises NullPointerException in hashCode method

2018-07-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24742:


Assignee: (was: Apache Spark)

> Field Metadata raises NullPointerException in hashCode method
> -
>
> Key: SPARK-24742
> URL: https://issues.apache.org/jira/browse/SPARK-24742
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Kaya Kupferschmidt
>Priority: Minor
>
> The class org.apache.spark.sql.types.Metadata has a method *putNull* for 
> storing null pointers as Metadata. Unfortunately the hashCode method throws a 
> NullPointerException when there are null values in a Metadata object, 
> rendering the *puNull* method useless.
> h2. How to Reproduce
> The following code will raise a NullPointerException
> {code}
> new MetadataBuilder().putNull("key").build().hashCode
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24742) Field Metadata raises NullPointerException in hashCode method

2018-07-05 Thread Kaya Kupferschmidt (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533286#comment-16533286
 ] 

Kaya Kupferschmidt edited comment on SPARK-24742 at 7/6/18 5:15 AM:


A patch is provided as a pull request in 
https://github.com/apache/spark/pull/21722


was (Author: kupferk):
I will provide a patch later.

> Field Metadata raises NullPointerException in hashCode method
> -
>
> Key: SPARK-24742
> URL: https://issues.apache.org/jira/browse/SPARK-24742
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Kaya Kupferschmidt
>Priority: Minor
>
> The class org.apache.spark.sql.types.Metadata has a method *putNull* for 
> storing null pointers as Metadata. Unfortunately the hashCode method throws a 
> NullPointerException when there are null values in a Metadata object, 
> rendering the *puNull* method useless.
> h2. How to Reproduce
> The following code will raise a NullPointerException
> {code}
> new MetadataBuilder().putNull("key").build().hashCode
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24692) Improvement FilterPushdownBenchmark

2018-07-05 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24692.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21677
[https://github.com/apache/spark/pull/21677]

> Improvement FilterPushdownBenchmark
> ---
>
> Key: SPARK-24692
> URL: https://issues.apache.org/jira/browse/SPARK-24692
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24692) Improvement FilterPushdownBenchmark

2018-07-05 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24692:


Assignee: Yuming Wang

> Improvement FilterPushdownBenchmark
> ---
>
> Key: SPARK-24692
> URL: https://issues.apache.org/jira/browse/SPARK-24692
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24737) Type coercion between StructTypes.

2018-07-05 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24737.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21713
[https://github.com/apache/spark/pull/21713]

> Type coercion between StructTypes.
> --
>
> Key: SPARK-24737
> URL: https://issues.apache.org/jira/browse/SPARK-24737
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.4.0
>
>
> We can support type coercion between StructTypes where all the internal types 
> are compatible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24737) Type coercion between StructTypes.

2018-07-05 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24737:


Assignee: Takuya Ueshin

> Type coercion between StructTypes.
> --
>
> Key: SPARK-24737
> URL: https://issues.apache.org/jira/browse/SPARK-24737
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.4.0
>
>
> We can support type coercion between StructTypes where all the internal types 
> are compatible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24748) Support for reporting custom metrics via Streaming Query Progress

2018-07-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534371#comment-16534371
 ] 

Apache Spark commented on SPARK-24748:
--

User 'arunmahadevan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21721

> Support for reporting custom metrics via Streaming Query Progress
> -
>
> Key: SPARK-24748
> URL: https://issues.apache.org/jira/browse/SPARK-24748
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Arun Mahadevan
>Priority: Major
>
> Currently the Structured Streaming sources and sinks does not have a way to 
> report custom metrics. Providing an option to report custom metrics and 
> making it available via Streaming Query progress can enable sources and sinks 
> to report custom progress information (E.g. the lag metrics for Kafka source).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24748) Support for reporting custom metrics via Streaming Query Progress

2018-07-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24748:


Assignee: Apache Spark

> Support for reporting custom metrics via Streaming Query Progress
> -
>
> Key: SPARK-24748
> URL: https://issues.apache.org/jira/browse/SPARK-24748
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Arun Mahadevan
>Assignee: Apache Spark
>Priority: Major
>
> Currently the Structured Streaming sources and sinks does not have a way to 
> report custom metrics. Providing an option to report custom metrics and 
> making it available via Streaming Query progress can enable sources and sinks 
> to report custom progress information (E.g. the lag metrics for Kafka source).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24748) Support for reporting custom metrics via Streaming Query Progress

2018-07-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24748:


Assignee: (was: Apache Spark)

> Support for reporting custom metrics via Streaming Query Progress
> -
>
> Key: SPARK-24748
> URL: https://issues.apache.org/jira/browse/SPARK-24748
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Arun Mahadevan
>Priority: Major
>
> Currently the Structured Streaming sources and sinks does not have a way to 
> report custom metrics. Providing an option to report custom metrics and 
> making it available via Streaming Query progress can enable sources and sinks 
> to report custom progress information (E.g. the lag metrics for Kafka source).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24748) Support for reporting custom metrics via Streaming Query Progress

2018-07-05 Thread Arun Mahadevan (JIRA)

Arun Mahadevan created SPARK-24748:
--

 Summary: Support for reporting custom metrics via Streaming Query 
Progress
 Key: SPARK-24748
 URL: https://issues.apache.org/jira/browse/SPARK-24748
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.3.1
Reporter: Arun Mahadevan


Currently the Structured Streaming sources and sinks does not have a way to 
report custom metrics. Providing an option to report custom metrics and making 
it available via Streaming Query progress can enable sources and sinks to 
report custom progress information (E.g. the lag metrics for Kafka source).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24164) Support column list as the pivot column in Pivot

2018-07-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534234#comment-16534234
 ] 

Apache Spark commented on SPARK-24164:
--

User 'maryannxue' has created a pull request for this issue:
https://github.com/apache/spark/pull/21720

> Support column list as the pivot column in Pivot
> 
>
> Key: SPARK-24164
> URL: https://issues.apache.org/jira/browse/SPARK-24164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Major
>
> This is part of a functionality extension to Pivot SQL support as SPARK-24035.
> Currently, we only support a single column as the pivot column, while a 
> column list as the pivot column would look like:
> {code:java}
> SELECT * FROM (
>   SELECT year, course, earnings FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR (course, year) IN (('dotNET', 2012), ('Java', 2013))
> );{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24164) Support column list as the pivot column in Pivot

2018-07-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24164:


Assignee: Apache Spark  (was: Maryann Xue)

> Support column list as the pivot column in Pivot
> 
>
> Key: SPARK-24164
> URL: https://issues.apache.org/jira/browse/SPARK-24164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maryann Xue
>Assignee: Apache Spark
>Priority: Major
>
> This is part of a functionality extension to Pivot SQL support as SPARK-24035.
> Currently, we only support a single column as the pivot column, while a 
> column list as the pivot column would look like:
> {code:java}
> SELECT * FROM (
>   SELECT year, course, earnings FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR (course, year) IN (('dotNET', 2012), ('Java', 2013))
> );{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24163) Support "ANY" or sub-query for Pivot "IN" clause

2018-07-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24163:


Assignee: (was: Apache Spark)

> Support "ANY" or sub-query for Pivot "IN" clause
> 
>
> Key: SPARK-24163
> URL: https://issues.apache.org/jira/browse/SPARK-24163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maryann Xue
>Priority: Major
>
> This is part of a functionality extension to Pivot SQL support as SPARK-24035.
> Currently, only literal values are allowed in Pivot "IN" clause. To support 
> ANY or a sub-query in the "IN" clause (the examples of which provided below), 
> we need to enable evaluation of a sub-query before/during query analysis time.
> {code:java}
> SELECT * FROM (
>   SELECT year, course, earnings FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR course IN ANY
> );{code}
> {code:java}
> SELECT * FROM (
>   SELECT year, course, earnings FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR course IN (
> SELECT course FROM courses
> WHERE region = 'AZ'
>   )
> );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24163) Support "ANY" or sub-query for Pivot "IN" clause

2018-07-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24163:


Assignee: Apache Spark

> Support "ANY" or sub-query for Pivot "IN" clause
> 
>
> Key: SPARK-24163
> URL: https://issues.apache.org/jira/browse/SPARK-24163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maryann Xue
>Assignee: Apache Spark
>Priority: Major
>
> This is part of a functionality extension to Pivot SQL support as SPARK-24035.
> Currently, only literal values are allowed in Pivot "IN" clause. To support 
> ANY or a sub-query in the "IN" clause (the examples of which provided below), 
> we need to enable evaluation of a sub-query before/during query analysis time.
> {code:java}
> SELECT * FROM (
>   SELECT year, course, earnings FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR course IN ANY
> );{code}
> {code:java}
> SELECT * FROM (
>   SELECT year, course, earnings FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR course IN (
> SELECT course FROM courses
> WHERE region = 'AZ'
>   )
> );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24163) Support "ANY" or sub-query for Pivot "IN" clause

2018-07-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534233#comment-16534233
 ] 

Apache Spark commented on SPARK-24163:
--

User 'maryannxue' has created a pull request for this issue:
https://github.com/apache/spark/pull/21720

> Support "ANY" or sub-query for Pivot "IN" clause
> 
>
> Key: SPARK-24163
> URL: https://issues.apache.org/jira/browse/SPARK-24163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maryann Xue
>Priority: Major
>
> This is part of a functionality extension to Pivot SQL support as SPARK-24035.
> Currently, only literal values are allowed in Pivot "IN" clause. To support 
> ANY or a sub-query in the "IN" clause (the examples of which provided below), 
> we need to enable evaluation of a sub-query before/during query analysis time.
> {code:java}
> SELECT * FROM (
>   SELECT year, course, earnings FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR course IN ANY
> );{code}
> {code:java}
> SELECT * FROM (
>   SELECT year, course, earnings FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR course IN (
> SELECT course FROM courses
> WHERE region = 'AZ'
>   )
> );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24164) Support column list as the pivot column in Pivot

2018-07-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24164:


Assignee: Maryann Xue  (was: Apache Spark)

> Support column list as the pivot column in Pivot
> 
>
> Key: SPARK-24164
> URL: https://issues.apache.org/jira/browse/SPARK-24164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Major
>
> This is part of a functionality extension to Pivot SQL support as SPARK-24035.
> Currently, we only support a single column as the pivot column, while a 
> column list as the pivot column would look like:
> {code:java}
> SELECT * FROM (
>   SELECT year, course, earnings FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR (course, year) IN (('dotNET', 2012), ('Java', 2013))
> );{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24694) Integration tests pass only one app argument

2018-07-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-24694:
-

Assignee: Stavros Kontopoulos

> Integration tests pass only one app argument
> 
>
> Key: SPARK-24694
> URL: https://issues.apache.org/jira/browse/SPARK-24694
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
> Fix For: 2.4.0
>
>
> I tried to add another test in the current suite which uses more than one 
> argument and it fails:
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
> "$@")
> + exec /sbin/tini -s – /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=9.0.10.29 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.properties --class 
> org.apache.spark.examples.DFSReadWriteTest spark-internal '/etc/resolv.conf 
> hdfs:///test-SGzsB'
>  2018-06-29 15:31:51 WARN Utils:66 - Kubernetes master URL uses HTTP instead 
> of HTTPS.
>  2018-06-29 15:31:52 WARN NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
>  args size: 1
>  Arg: /etc/resolv.conf hdfs:///test-SGzsB
>  DFS Read-Write Test
>  Usage: localFile dfsDir
>  localFile - (string) local file to use in test
> Reason is this line here: 
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesTestComponents.scala#L109]
>  which adds all args to one element in the final array. But Processbuilder 
> will not split args later on:on: 
> [https://github.com/apache/spark/blob/f6e6899a8b8af99cd06e84cae7c69e0fc35bc60a/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/ProcessUtils.scala#L32]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24694) Integration tests pass only one app argument

2018-07-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24694.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21672
[https://github.com/apache/spark/pull/21672]

> Integration tests pass only one app argument
> 
>
> Key: SPARK-24694
> URL: https://issues.apache.org/jira/browse/SPARK-24694
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
> Fix For: 2.4.0
>
>
> I tried to add another test in the current suite which uses more than one 
> argument and it fails:
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
> "$@")
> + exec /sbin/tini -s – /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=9.0.10.29 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.properties --class 
> org.apache.spark.examples.DFSReadWriteTest spark-internal '/etc/resolv.conf 
> hdfs:///test-SGzsB'
>  2018-06-29 15:31:51 WARN Utils:66 - Kubernetes master URL uses HTTP instead 
> of HTTPS.
>  2018-06-29 15:31:52 WARN NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
>  args size: 1
>  Arg: /etc/resolv.conf hdfs:///test-SGzsB
>  DFS Read-Write Test
>  Usage: localFile dfsDir
>  localFile - (string) local file to use in test
> Reason is this line here: 
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesTestComponents.scala#L109]
>  which adds all args to one element in the final array. But Processbuilder 
> will not split args later on:on: 
> [https://github.com/apache/spark/blob/f6e6899a8b8af99cd06e84cae7c69e0fc35bc60a/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/ProcessUtils.scala#L32]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24747) Make spark.ml.util.Instrumentation class more flexible

2018-07-05 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24747:
--
Shepherd: Xiangrui Meng

> Make spark.ml.util.Instrumentation class more flexible
> --
>
> Key: SPARK-24747
> URL: https://issues.apache.org/jira/browse/SPARK-24747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Major
>
> The Instrumentation class (which is an internal private class) is some what 
> limited by it's current APIs. The class requires an estimator and dataset be 
> passed to the constructor which limits how it can be used. Furthermore, the 
> current APIs make it hard to intercept failures and record anything related 
> to those failures.
> The following changes could make the instrumentation class easier to work 
> with. All these changes are for private APIs and should not be visible to 
> users.
> {code}
> // New no-argument constructor:
> Instrumentation()
> // New api to log previous constructor arguments.
> logTrainingContext(estimator: Estimator[_], dataset: Dataset[_])
> logFailure(e: Throwable): Unit
> // Log success with no arguments
> logSuccess(): Unit
> // Log result model explicitly instead of passing to logSuccess
> logModel(model: Model[_]): Unit
> // On Companion object
> Instrumentation.instrumented[T](body: (Instrumentation => T)): T
> // The above API will allow us to write instrumented methods more clearly and 
> handle logging success and failure automatically:
> def someMethod(...): T = instrumented { instr =>
>   instr.logNamedValue(name, value)
>   // more code here
>   instr.logModel(model)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24747) Make spark.ml.util.Instrumentation class more flexible

2018-07-05 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-24747:
-

Assignee: Bago Amirbekian

> Make spark.ml.util.Instrumentation class more flexible
> --
>
> Key: SPARK-24747
> URL: https://issues.apache.org/jira/browse/SPARK-24747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Major
>
> The Instrumentation class (which is an internal private class) is some what 
> limited by it's current APIs. The class requires an estimator and dataset be 
> passed to the constructor which limits how it can be used. Furthermore, the 
> current APIs make it hard to intercept failures and record anything related 
> to those failures.
> The following changes could make the instrumentation class easier to work 
> with. All these changes are for private APIs and should not be visible to 
> users.
> {code}
> // New no-argument constructor:
> Instrumentation()
> // New api to log previous constructor arguments.
> logTrainingContext(estimator: Estimator[_], dataset: Dataset[_])
> logFailure(e: Throwable): Unit
> // Log success with no arguments
> logSuccess(): Unit
> // Log result model explicitly instead of passing to logSuccess
> logModel(model: Model[_]): Unit
> // On Companion object
> Instrumentation.instrumented[T](body: (Instrumentation => T)): T
> // The above API will allow us to write instrumented methods more clearly and 
> handle logging success and failure automatically:
> def someMethod(...): T = instrumented { instr =>
>   instr.logNamedValue(name, value)
>   // more code here
>   instr.logModel(model)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24747) Make spark.ml.util.Instrumentation class more flexible

2018-07-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533998#comment-16533998
 ] 

Apache Spark commented on SPARK-24747:
--

User 'MrBago' has created a pull request for this issue:
https://github.com/apache/spark/pull/21719

> Make spark.ml.util.Instrumentation class more flexible
> --
>
> Key: SPARK-24747
> URL: https://issues.apache.org/jira/browse/SPARK-24747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Bago Amirbekian
>Priority: Major
>
> The Instrumentation class (which is an internal private class) is some what 
> limited by it's current APIs. The class requires an estimator and dataset be 
> passed to the constructor which limits how it can be used. Furthermore, the 
> current APIs make it hard to intercept failures and record anything related 
> to those failures.
> The following changes could make the instrumentation class easier to work 
> with. All these changes are for private APIs and should not be visible to 
> users.
> {code}
> // New no-argument constructor:
> Instrumentation()
> // New api to log previous constructor arguments.
> logTrainingContext(estimator: Estimator[_], dataset: Dataset[_])
> logFailure(e: Throwable): Unit
> // Log success with no arguments
> logSuccess(): Unit
> // Log result model explicitly instead of passing to logSuccess
> logModel(model: Model[_]): Unit
> // On Companion object
> Instrumentation.instrumented[T](body: (Instrumentation => T)): T
> // The above API will allow us to write instrumented methods more clearly and 
> handle logging success and failure automatically:
> def someMethod(...): T = instrumented { instr =>
>   instr.logNamedValue(name, value)
>   // more code here
>   instr.logModel(model)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24747) Make spark.ml.util.Instrumentation class more flexible

2018-07-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24747:


Assignee: (was: Apache Spark)

> Make spark.ml.util.Instrumentation class more flexible
> --
>
> Key: SPARK-24747
> URL: https://issues.apache.org/jira/browse/SPARK-24747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Bago Amirbekian
>Priority: Major
>
> The Instrumentation class (which is an internal private class) is some what 
> limited by it's current APIs. The class requires an estimator and dataset be 
> passed to the constructor which limits how it can be used. Furthermore, the 
> current APIs make it hard to intercept failures and record anything related 
> to those failures.
> The following changes could make the instrumentation class easier to work 
> with. All these changes are for private APIs and should not be visible to 
> users.
> {code}
> // New no-argument constructor:
> Instrumentation()
> // New api to log previous constructor arguments.
> logTrainingContext(estimator: Estimator[_], dataset: Dataset[_])
> logFailure(e: Throwable): Unit
> // Log success with no arguments
> logSuccess(): Unit
> // Log result model explicitly instead of passing to logSuccess
> logModel(model: Model[_]): Unit
> // On Companion object
> Instrumentation.instrumented[T](body: (Instrumentation => T)): T
> // The above API will allow us to write instrumented methods more clearly and 
> handle logging success and failure automatically:
> def someMethod(...): T = instrumented { instr =>
>   instr.logNamedValue(name, value)
>   // more code here
>   instr.logModel(model)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24747) Make spark.ml.util.Instrumentation class more flexible

2018-07-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24747:


Assignee: Apache Spark

> Make spark.ml.util.Instrumentation class more flexible
> --
>
> Key: SPARK-24747
> URL: https://issues.apache.org/jira/browse/SPARK-24747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Bago Amirbekian
>Assignee: Apache Spark
>Priority: Major
>
> The Instrumentation class (which is an internal private class) is some what 
> limited by it's current APIs. The class requires an estimator and dataset be 
> passed to the constructor which limits how it can be used. Furthermore, the 
> current APIs make it hard to intercept failures and record anything related 
> to those failures.
> The following changes could make the instrumentation class easier to work 
> with. All these changes are for private APIs and should not be visible to 
> users.
> {code}
> // New no-argument constructor:
> Instrumentation()
> // New api to log previous constructor arguments.
> logTrainingContext(estimator: Estimator[_], dataset: Dataset[_])
> logFailure(e: Throwable): Unit
> // Log success with no arguments
> logSuccess(): Unit
> // Log result model explicitly instead of passing to logSuccess
> logModel(model: Model[_]): Unit
> // On Companion object
> Instrumentation.instrumented[T](body: (Instrumentation => T)): T
> // The above API will allow us to write instrumented methods more clearly and 
> handle logging success and failure automatically:
> def someMethod(...): T = instrumented { instr =>
>   instr.logNamedValue(name, value)
>   // more code here
>   instr.logModel(model)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24747) Make spark.ml.util.Instrumentation class more flexible

2018-07-05 Thread Bago Amirbekian (JIRA)

Bago Amirbekian created SPARK-24747:
---

 Summary: Make spark.ml.util.Instrumentation class more flexible
 Key: SPARK-24747
 URL: https://issues.apache.org/jira/browse/SPARK-24747
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.1
Reporter: Bago Amirbekian


The Instrumentation class (which is an internal private class) is some what 
limited by it's current APIs. The class requires an estimator and dataset be 
passed to the constructor which limits how it can be used. Furthermore, the 
current APIs make it hard to intercept failures and record anything related to 
those failures.

The following changes could make the instrumentation class easier to work with. 
All these changes are for private APIs and should not be visible to users.
{code}
// New no-argument constructor:
Instrumentation()

// New api to log previous constructor arguments.
logTrainingContext(estimator: Estimator[_], dataset: Dataset[_])

logFailure(e: Throwable): Unit

// Log success with no arguments
logSuccess(): Unit

// Log result model explicitly instead of passing to logSuccess
logModel(model: Model[_]): Unit

// On Companion object
Instrumentation.instrumented[T](body: (Instrumentation => T)): T

// The above API will allow us to write instrumented methods more clearly and 
handle logging success and failure automatically:
def someMethod(...): T = instrumented { instr =>
  instr.logNamedValue(name, value)
  // more code here
  instr.logModel(model)
}

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24438) Empty strings and null strings are written to the same partition

2018-07-05 Thread Mukul Murthy (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533977#comment-16533977
 ] 

Mukul Murthy commented on SPARK-24438:
--

Are null and empty string both invalid partition values? I kind of dislike that 
it's causing the actual data to be changed, although it is minor, and as you 
guys said it's actually a Hive bug so I don't think it's straightforward to fix.

 

> Empty strings and null strings are written to the same partition
> 
>
> Key: SPARK-24438
> URL: https://issues.apache.org/jira/browse/SPARK-24438
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Mukul Murthy
>Priority: Major
>
> When you partition on a string column that has empty strings and nulls, they 
> are both written to the same default partition. When you read the data back, 
> all those values get read back as null.
> {code:java}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.encoders.RowEncoder
> val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, 
> null))
> val schema = new StructType().add("a", IntegerType).add("b", StringType)
> val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> display(df) 
> => 
> a b
> 1 
> 2 
> 3 
> 4 hello
> 5 null
> df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4")
> val df2 = spark.read.load("/home/mukul/weird_test_data4")
> display(df2)
> => 
> a b
> 4 hello
> 3 null
> 2 null
> 1 null
> 5 null
> {code}
> Seems to affect multiple types of tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24438) Empty strings and null strings are written to the same partition

2018-07-05 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533929#comment-16533929
 ] 

Dongjoon Hyun commented on SPARK-24438:
---

Yep. It works as designed for now. Shall we close this issue?

> Empty strings and null strings are written to the same partition
> 
>
> Key: SPARK-24438
> URL: https://issues.apache.org/jira/browse/SPARK-24438
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Mukul Murthy
>Priority: Major
>
> When you partition on a string column that has empty strings and nulls, they 
> are both written to the same default partition. When you read the data back, 
> all those values get read back as null.
> {code:java}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.encoders.RowEncoder
> val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, 
> null))
> val schema = new StructType().add("a", IntegerType).add("b", StringType)
> val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> display(df) 
> => 
> a b
> 1 
> 2 
> 3 
> 4 hello
> 5 null
> df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4")
> val df2 = spark.read.load("/home/mukul/weird_test_data4")
> display(df2)
> => 
> a b
> 4 hello
> 3 null
> 2 null
> 1 null
> 5 null
> {code}
> Seems to affect multiple types of tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24675) Rename table: validate existence of new location

2018-07-05 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24675.
-
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 2.4.0

> Rename table: validate existence of new location
> 
>
> Key: SPARK-24675
> URL: https://issues.apache.org/jira/browse/SPARK-24675
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> If table is renamed to a existing new location, data won't show up.
>  
> scala>  Seq("hello").toDF("a").write.format("parquet").saveAsTable("t")
> scala> sql("select * from t").show()
> +-+
> |    a|
> +-+
> |hello|
> +-+ 
> scala> sql("alter table t rename to test")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("select * from test").show()
> +---+
> |  a|
> +---+
> +—+
>  
> In Hive, if the new location exists, the renaming will fail even the location 
> is empty.
> We should have the same validation in catalog, in case of unexpected bugs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24746) AWS S3 301 Moved Permanently error message even after setting fs.s3a.endpoint for bucket in Mumbai region.

2018-07-05 Thread Kushagra Singh (JIRA)

Kushagra Singh created SPARK-24746:
--

 Summary: AWS S3 301 Moved Permanently error message even after 
setting fs.s3a.endpoint for bucket in Mumbai region.
 Key: SPARK-24746
 URL: https://issues.apache.org/jira/browse/SPARK-24746
 Project: Spark
  Issue Type: Question
  Components: Kubernetes, PySpark
Affects Versions: 2.3.1
Reporter: Kushagra Singh


I am trying to write parquet data to a S3 bucket in ap-south-1(Mumbai) region 
but keep getting 301 errors even though I have specified the correct region.
{code}
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", 
"s3.ap-south-1.amazonaws.com")
log.write.mode("overwrite").parquet("s3a://bucket/logs")
{code}
s3a related config in spark-defaults:
{code:java}
spark.hadoop.fs.s3a.implorg.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.validateOutputSpecs false
spark.executor.extraJavaOptions -Dcom.amazonaws.services.s3.enableV4=true
spark.driver.extraJavaOptions -Dcom.amazonaws.services.s3.enableV4=true
spark.hadoop.fs.s3a.connection.maximum 100
{code}
Using _spark 2.3.1_ and _hadoop 2.7_ with _aws-java-sdk-1.7.4_ and 
_hadoop-aws-2.7.3_

Stacktrace:
{code:java}
py4j.protocol.Py4JJavaError: An error occurred while calling o71.parquet.
: org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:224)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:547)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 301, 
AWS Service: Amazon S3, AWS Request ID: 0A48F0A6FD8AC8B5, AWS Error Code: null, 
AWS Error Message: Moved Permanently, S3 Extended Request ID: 
lPmrY0rkTFpMASMjvFaDTbCPfTgX+PatF25gmvaSrNjCaJk/ljuA/TwyY2d4M/FNT1kiW6z6d5E=
at 
com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at 
com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at 
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at 
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at 
com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
at

[jira] [Resolved] (SPARK-24743) Update the JavaDirectKafkaWordCount example to support the new API of Kafka

2018-07-05 Thread Cody Koeninger (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger resolved SPARK-24743.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21717
[https://github.com/apache/spark/pull/21717]

> Update the JavaDirectKafkaWordCount example to support the new API of Kafka
> ---
>
> Key: SPARK-24743
> URL: https://issues.apache.org/jira/browse/SPARK-24743
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.3.1
>Reporter: luochuan
>Assignee: luochuan
>Priority: Minor
> Fix For: 2.4.0
>
>
> when I ran the example JavaDirectKafkaWordCount as follows:
> ./bin/spark-submit --class 
> org.apache.spark.examples.streaming.JavaDirectKafkaWordCount --master local 
> --jars /somepath/spark-streaming-kafka-0-10-assembly_2.11-2.3.1.jar 
> examples/jars/spark-examples_2.11-2.3.1.jar kafka-broker:port topic
> Then a error happened: 
> 'org.apache.kafka.common.config.ConfigException:Missing required 
> configuration "bootstrap.servers" which has no default value'. So I looked 
> into the code and found the example JavaDirectKafkaWordCount  misses some 
> kafka required configs.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24743) Update the JavaDirectKafkaWordCount example to support the new API of Kafka

2018-07-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-24743:
-

Assignee: luochuan
Priority: Minor  (was: Major)

> Update the JavaDirectKafkaWordCount example to support the new API of Kafka
> ---
>
> Key: SPARK-24743
> URL: https://issues.apache.org/jira/browse/SPARK-24743
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.3.1
>Reporter: luochuan
>Assignee: luochuan
>Priority: Minor
>
> when I ran the example JavaDirectKafkaWordCount as follows:
> ./bin/spark-submit --class 
> org.apache.spark.examples.streaming.JavaDirectKafkaWordCount --master local 
> --jars /somepath/spark-streaming-kafka-0-10-assembly_2.11-2.3.1.jar 
> examples/jars/spark-examples_2.11-2.3.1.jar kafka-broker:port topic
> Then a error happened: 
> 'org.apache.kafka.common.config.ConfigException:Missing required 
> configuration "bootstrap.servers" which has no default value'. So I looked 
> into the code and found the example JavaDirectKafkaWordCount  misses some 
> kafka required configs.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24711) Integration tests will not work with exclude/include tags

2018-07-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-24711:
-

Assignee: Stavros Kontopoulos

> Integration tests will not work with exclude/include tags
> -
>
> Key: SPARK-24711
> URL: https://issues.apache.org/jira/browse/SPARK-24711
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
> Fix For: 2.4.0
>
>
> I tried to exclude some tests when adding mine and I got errors of the form:
> [INFO] BUILD FAILURE
> [INFO] 
> 
>  [INFO] Total time: 6.798 s
>  [INFO] Finished at: 2018-07-01T18:34:13+03:00
>  [INFO] Final Memory: 36M/652M
>  [INFO] 
> 
>  [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.20.1:test (default-test) on 
> project spark-kubernetes-integration-tests_2.11: There are test failures.
>  [ERROR] 
>  [ERROR] Please refer to 
> /home/stavros/Desktop/workspace/OSS/spark/resource-managers/kubernetes/integration-tests/target/surefire-reports
>  for the individual test results.
>  [ERROR] Please refer to dump files (if any exist) [date]-jvmRun[N].dump, 
> [date].dumpstream and [date]-jvmRun[N].dumpstream.
>  [ERROR] There was an error in the forked process
>  [ERROR] Unable to load category: noDcos
>  
> This will not happen if maven surfire plugin is disabled as stated here: 
> [http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin]
> I will create a PR shortly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24711) Integration tests will not work with exclude/include tags

2018-07-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24711.
---
Resolution: Fixed

Issue resolved by pull request 21697
[https://github.com/apache/spark/pull/21697]

> Integration tests will not work with exclude/include tags
> -
>
> Key: SPARK-24711
> URL: https://issues.apache.org/jira/browse/SPARK-24711
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
> Fix For: 2.4.0
>
>
> I tried to exclude some tests when adding mine and I got errors of the form:
> [INFO] BUILD FAILURE
> [INFO] 
> 
>  [INFO] Total time: 6.798 s
>  [INFO] Finished at: 2018-07-01T18:34:13+03:00
>  [INFO] Final Memory: 36M/652M
>  [INFO] 
> 
>  [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.20.1:test (default-test) on 
> project spark-kubernetes-integration-tests_2.11: There are test failures.
>  [ERROR] 
>  [ERROR] Please refer to 
> /home/stavros/Desktop/workspace/OSS/spark/resource-managers/kubernetes/integration-tests/target/surefire-reports
>  for the individual test results.
>  [ERROR] Please refer to dump files (if any exist) [date]-jvmRun[N].dump, 
> [date].dumpstream and [date]-jvmRun[N].dumpstream.
>  [ERROR] There was an error in the forked process
>  [ERROR] Unable to load category: noDcos
>  
> This will not happen if maven surfire plugin is disabled as stated here: 
> [http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin]
> I will create a PR shortly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23820) Allow the long form of call sites to be recorded in the log

2018-07-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23820.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Resolved by 
https://github.com/apache/spark/pull/21433#pullrequestreview-134650236

> Allow the long form of call sites to be recorded in the log
> ---
>
> Key: SPARK-23820
> URL: https://issues.apache.org/jira/browse/SPARK-23820
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Michael Mior
>Assignee: Michael Mior
>Priority: Trivial
> Fix For: 2.4.0
>
>
> It would be nice if the long form of the callsite information could be 
> included in the log. An example of what I'm proposing is here: 
> https://github.com/michaelmior/spark/commit/4b4076cfb1d51ceb20fd2b0a3b1b5be2aebb6416



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23820) Allow the long form of call sites to be recorded in the log

2018-07-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23820:
-

Assignee: Michael Mior
Priority: Trivial  (was: Major)

> Allow the long form of call sites to be recorded in the log
> ---
>
> Key: SPARK-23820
> URL: https://issues.apache.org/jira/browse/SPARK-23820
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Michael Mior
>Assignee: Michael Mior
>Priority: Trivial
>
> It would be nice if the long form of the callsite information could be 
> included in the log. An example of what I'm proposing is here: 
> https://github.com/michaelmior/spark/commit/4b4076cfb1d51ceb20fd2b0a3b1b5be2aebb6416



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24361) Polish code block manipulation API

2018-07-05 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24361.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21405
[https://github.com/apache/spark/pull/21405]

> Polish code block manipulation API
> --
>
> Key: SPARK-24361
> URL: https://issues.apache.org/jira/browse/SPARK-24361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> Current code block manipulation API is immature and hacky. We should have a 
> formal API to manipulate code blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24361) Polish code block manipulation API

2018-07-05 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24361:
---

Assignee: Liang-Chi Hsieh

> Polish code block manipulation API
> --
>
> Key: SPARK-24361
> URL: https://issues.apache.org/jira/browse/SPARK-24361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>
> Current code block manipulation API is immature and hacky. We should have a 
> formal API to manipulate code blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24474) Cores are left idle when there are a lot of tasks to run

2018-07-05 Thread Hari Sekhon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533499#comment-16533499
 ] 

Hari Sekhon edited comment on SPARK-24474 at 7/5/18 10:43 AM:
--

My main concern with this workaround is pulling half the blocks over the 
network, which would deteriorate performance across jobs and queues on our 
clusters if everyone does it, since there is no network quota isolation.

I've raised a request for HDFS Anti-Affinity Block Placement improvement to 
solve dataset placement skew across a subset of datanodes. An improved spread 
of a dataset across datanodes would allow data local task scheduling to work as 
it is intended, which seems like a much better long term fix. Please vote up 
the issue here if this is affecting you:

https://issues.apache.org/jira/browse/HDFS-13720

 


was (Author: harisekhon):
My main concern with this workaround is pulling half the blocks over the 
network, which would deteriorate our clusters if everyone does it.

I've raised a request for HDFS Anti-Affinity Block Placement improvement to 
solve dataset placement skew across a subset of datanodes. An improved spread 
of a dataset across datanodes would allow data local task scheduling to work as 
it is intended, which seems like a much better long term fix. Please vote up 
the issue here if this is affecting you:

https://issues.apache.org/jira/browse/HDFS-13720

 

> Cores are left idle when there are a lot of tasks to run
> 
>
> Key: SPARK-24474
> URL: https://issues.apache.org/jira/browse/SPARK-24474
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Al M
>Priority: Major
>
> I've observed an issue happening consistently when:
>  * A job contains a join of two datasets
>  * One dataset is much larger than the other
>  * Both datasets require some processing before they are joined
> What I have observed is:
>  * 2 stages are initially active to run processing on the two datasets
>  ** These stages are run in parallel
>  ** One stage has significantly more tasks than the other (e.g. one has 30k 
> tasks and the other has 2k tasks)
>  ** Spark allocates a similar (though not exactly equal) number of cores to 
> each stage
>  * First stage completes (for the smaller dataset)
>  ** Now there is only one stage running
>  ** It still has many tasks left (usually > 20k tasks)
>  ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 
> 103)
>  ** This continues until the second stage completes
>  * Second stage completes, and third begins (the stage that actually joins 
> the data)
>  ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
> tasks = 200)
> Other interesting things about this:
>  * It seems that when we have multiple stages active, and one of them 
> finishes, it does not actually release any cores to existing stages
>  * Once all active stages are done, we release all cores to new stages
>  * I can't reproduce this locally on my machine, only on a cluster with YARN 
> enabled
>  * It happens when dynamic allocation is enabled, and when it is disabled
>  * The stage that hangs (referred to as "Second stage" above) has a lower 
> 'Stage Id' than the first one that completes
>  * This happens with spark.shuffle.service.enabled set to true and false



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24474) Cores are left idle when there are a lot of tasks to run

2018-07-05 Thread Hari Sekhon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533499#comment-16533499
 ] 

Hari Sekhon commented on SPARK-24474:
-

My main concern with this workaround is pulling half the blocks over the 
network, which would deteriorate our clusters if everyone does it.

I've raised a request for HDFS Anti-Affinity Block Placement improvement to 
solve dataset placement skew across a subset of datanodes. An improved spread 
of a dataset across datanodes would allow data local task scheduling to work as 
it is intended, which seems like a much better long term fix. Please vote up 
the issue here if this is affecting you:

https://issues.apache.org/jira/browse/HDFS-13720

 

> Cores are left idle when there are a lot of tasks to run
> 
>
> Key: SPARK-24474
> URL: https://issues.apache.org/jira/browse/SPARK-24474
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Al M
>Priority: Major
>
> I've observed an issue happening consistently when:
>  * A job contains a join of two datasets
>  * One dataset is much larger than the other
>  * Both datasets require some processing before they are joined
> What I have observed is:
>  * 2 stages are initially active to run processing on the two datasets
>  ** These stages are run in parallel
>  ** One stage has significantly more tasks than the other (e.g. one has 30k 
> tasks and the other has 2k tasks)
>  ** Spark allocates a similar (though not exactly equal) number of cores to 
> each stage
>  * First stage completes (for the smaller dataset)
>  ** Now there is only one stage running
>  ** It still has many tasks left (usually > 20k tasks)
>  ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 
> 103)
>  ** This continues until the second stage completes
>  * Second stage completes, and third begins (the stage that actually joins 
> the data)
>  ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
> tasks = 200)
> Other interesting things about this:
>  * It seems that when we have multiple stages active, and one of them 
> finishes, it does not actually release any cores to existing stages
>  * Once all active stages are done, we release all cores to new stages
>  * I can't reproduce this locally on my machine, only on a cluster with YARN 
> enabled
>  * It happens when dynamic allocation is enabled, and when it is disabled
>  * The stage that hangs (referred to as "Second stage" above) has a lower 
> 'Stage Id' than the first one that completes
>  * This happens with spark.shuffle.service.enabled set to true and false



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24464) Unit tests for MLlib's Instrumentation

2018-07-05 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533461#comment-16533461
 ] 

Liang-Chi Hsieh commented on SPARK-24464:
-

Because {{Instrumentation}} is used for logging, I wonder the test for it is to 
verify logging console output?

> Unit tests for MLlib's Instrumentation
> --
>
> Key: SPARK-24464
> URL: https://issues.apache.org/jira/browse/SPARK-24464
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We added Instrumentation to MLlib to log params and metrics during machine 
> learning training and inference. However, the code has zero test coverage, 
> which usually means bugs and regressions in the future. I created this JIRA 
> to discuss how we should test Instrumentation.
> cc: [~thunterdb] [~josephkb] [~lu.DB]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24744) Structured Streaming set SparkSession configuration with the value in the metadata if there is not a option set by user.

2018-07-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24744:


Assignee: Apache Spark

> Structured Streaming set SparkSession configuration with the value in the 
> metadata if there is not a option set by user.   
> ---
>
> Key: SPARK-24744
> URL: https://issues.apache.org/jira/browse/SPARK-24744
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: bjkonglu
>Assignee: Apache Spark
>Priority: Minor
>
> h3. Background
> When I use structured streaming to construct my application, there is 
> something odd! The application always set option 
> [spark.sql.shuffle.partitions] to default value [200]. Even though, I set 
> [spark.sql.shuffle.partitions] to other value by SparkConf or --conf 
> spark.sql.shuffle.partitions=100,  but it doesn't work. The option value is 
> default value as before.
> h3. Analyse
> I review the relevant code. The relevant code is in 
> [org.apache.spark.sql.execution.streaming.OffsetSeqMetadata].
> {code:scala}
> /** Set the SparkSession configuration with the values in the metadata */
>   def setSessionConf(metadata: OffsetSeqMetadata, sessionConf: 
> RuntimeConfig): Unit = {
> OffsetSeqMetadata.relevantSQLConfs.map(_.key).foreach { confKey =>
>   metadata.conf.get(confKey) match {
> case Some(valueInMetadata) =>
>   // Config value exists in the metadata, update the session config 
> with this value
>   val optionalValueInSession = sessionConf.getOption(confKey)
>   if (optionalValueInSession.isDefined && optionalValueInSession.get 
> != valueInMetadata) {
> logWarning(s"Updating the value of conf '$confKey' in current 
> session from " +
>   s"'${optionalValueInSession.get}' to '$valueInMetadata'.")
>   }
>   sessionConf.set(confKey, valueInMetadata)
> case None =>
>   // For backward compatibility, if a config was not recorded in the 
> offset log,
>   // then log it, and let the existing conf value in SparkSession 
> prevail.
>   logWarning (s"Conf '$confKey' was not found in the offset log, 
> using existing value")
>   }
> }
>   }
> {code}
> In this code, we can find it always set some option in metadata value. But as 
> user, we want to those option can set by user.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24744) Structured Streaming set SparkSession configuration with the value in the metadata if there is not a option set by user.

2018-07-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24744:


Assignee: (was: Apache Spark)

> Structured Streaming set SparkSession configuration with the value in the 
> metadata if there is not a option set by user.   
> ---
>
> Key: SPARK-24744
> URL: https://issues.apache.org/jira/browse/SPARK-24744
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: bjkonglu
>Priority: Minor
>
> h3. Background
> When I use structured streaming to construct my application, there is 
> something odd! The application always set option 
> [spark.sql.shuffle.partitions] to default value [200]. Even though, I set 
> [spark.sql.shuffle.partitions] to other value by SparkConf or --conf 
> spark.sql.shuffle.partitions=100,  but it doesn't work. The option value is 
> default value as before.
> h3. Analyse
> I review the relevant code. The relevant code is in 
> [org.apache.spark.sql.execution.streaming.OffsetSeqMetadata].
> {code:scala}
> /** Set the SparkSession configuration with the values in the metadata */
>   def setSessionConf(metadata: OffsetSeqMetadata, sessionConf: 
> RuntimeConfig): Unit = {
> OffsetSeqMetadata.relevantSQLConfs.map(_.key).foreach { confKey =>
>   metadata.conf.get(confKey) match {
> case Some(valueInMetadata) =>
>   // Config value exists in the metadata, update the session config 
> with this value
>   val optionalValueInSession = sessionConf.getOption(confKey)
>   if (optionalValueInSession.isDefined && optionalValueInSession.get 
> != valueInMetadata) {
> logWarning(s"Updating the value of conf '$confKey' in current 
> session from " +
>   s"'${optionalValueInSession.get}' to '$valueInMetadata'.")
>   }
>   sessionConf.set(confKey, valueInMetadata)
> case None =>
>   // For backward compatibility, if a config was not recorded in the 
> offset log,
>   // then log it, and let the existing conf value in SparkSession 
> prevail.
>   logWarning (s"Conf '$confKey' was not found in the offset log, 
> using existing value")
>   }
> }
>   }
> {code}
> In this code, we can find it always set some option in metadata value. But as 
> user, we want to those option can set by user.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24744) Structured Streaming set SparkSession configuration with the value in the metadata if there is not a option set by user.

2018-07-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533429#comment-16533429
 ] 

Apache Spark commented on SPARK-24744:
--

User 'bjkonglu' has created a pull request for this issue:
https://github.com/apache/spark/pull/21718

> Structured Streaming set SparkSession configuration with the value in the 
> metadata if there is not a option set by user.   
> ---
>
> Key: SPARK-24744
> URL: https://issues.apache.org/jira/browse/SPARK-24744
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: bjkonglu
>Priority: Minor
>
> h3. Background
> When I use structured streaming to construct my application, there is 
> something odd! The application always set option 
> [spark.sql.shuffle.partitions] to default value [200]. Even though, I set 
> [spark.sql.shuffle.partitions] to other value by SparkConf or --conf 
> spark.sql.shuffle.partitions=100,  but it doesn't work. The option value is 
> default value as before.
> h3. Analyse
> I review the relevant code. The relevant code is in 
> [org.apache.spark.sql.execution.streaming.OffsetSeqMetadata].
> {code:scala}
> /** Set the SparkSession configuration with the values in the metadata */
>   def setSessionConf(metadata: OffsetSeqMetadata, sessionConf: 
> RuntimeConfig): Unit = {
> OffsetSeqMetadata.relevantSQLConfs.map(_.key).foreach { confKey =>
>   metadata.conf.get(confKey) match {
> case Some(valueInMetadata) =>
>   // Config value exists in the metadata, update the session config 
> with this value
>   val optionalValueInSession = sessionConf.getOption(confKey)
>   if (optionalValueInSession.isDefined && optionalValueInSession.get 
> != valueInMetadata) {
> logWarning(s"Updating the value of conf '$confKey' in current 
> session from " +
>   s"'${optionalValueInSession.get}' to '$valueInMetadata'.")
>   }
>   sessionConf.set(confKey, valueInMetadata)
> case None =>
>   // For backward compatibility, if a config was not recorded in the 
> offset log,
>   // then log it, and let the existing conf value in SparkSession 
> prevail.
>   logWarning (s"Conf '$confKey' was not found in the offset log, 
> using existing value")
>   }
> }
>   }
> {code}
> In this code, we can find it always set some option in metadata value. But as 
> user, we want to those option can set by user.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24744) Structured Streaming set SparkSession configuration with the value in the metadata if there is not a option set by user.

2018-07-05 Thread bjkonglu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bjkonglu updated SPARK-24744:
-
Description: 
h3. Background
When I use structured streaming to construct my application, there is something 
odd! The application always set option [spark.sql.shuffle.partitions] to 
default value [200]. Even though, I set [spark.sql.shuffle.partitions] to other 
value by SparkConf or --conf spark.sql.shuffle.partitions=100,  but it doesn't 
work. The option value is default value as before.
h3. Analyse
I review the relevant code. The relevant code is in 
[org.apache.spark.sql.execution.streaming.OffsetSeqMetadata].

{code:scala}
/** Set the SparkSession configuration with the values in the metadata */
  def setSessionConf(metadata: OffsetSeqMetadata, sessionConf: RuntimeConfig): 
Unit = {
OffsetSeqMetadata.relevantSQLConfs.map(_.key).foreach { confKey =>

  metadata.conf.get(confKey) match {

case Some(valueInMetadata) =>
  // Config value exists in the metadata, update the session config 
with this value
  val optionalValueInSession = sessionConf.getOption(confKey)
  if (optionalValueInSession.isDefined && optionalValueInSession.get != 
valueInMetadata) {
logWarning(s"Updating the value of conf '$confKey' in current 
session from " +
  s"'${optionalValueInSession.get}' to '$valueInMetadata'.")
  }
  sessionConf.set(confKey, valueInMetadata)

case None =>
  // For backward compatibility, if a config was not recorded in the 
offset log,
  // then log it, and let the existing conf value in SparkSession 
prevail.
  logWarning (s"Conf '$confKey' was not found in the offset log, using 
existing value")
  }
}
  }
{code}

In this code, we can find it always set some option in metadata value. But as 
user, we want to those option can set by user.


  was:
h3. Background
When I use structured streaming to construct my application, there is some odd! 
The application always set option [spark.sql.shuffle.partitions] to default 
value [200]. Even though, I set [spark.sql.shuffle.partitions] to other value 
by SparkConf or --conf spark.sql.shuffle.partitions=100,  but it doesn't work. 
The option value is default value as before.
h3. Analyse
I review the relevant code. The relevant code is in 
[org.apache.spark.sql.execution.streaming.OffsetSeqMetadata].

{code:scala}
/** Set the SparkSession configuration with the values in the metadata */
  def setSessionConf(metadata: OffsetSeqMetadata, sessionConf: RuntimeConfig): 
Unit = {
OffsetSeqMetadata.relevantSQLConfs.map(_.key).foreach { confKey =>

  metadata.conf.get(confKey) match {

case Some(valueInMetadata) =>
  // Config value exists in the metadata, update the session config 
with this value
  val optionalValueInSession = sessionConf.getOption(confKey)
  if (optionalValueInSession.isDefined && optionalValueInSession.get != 
valueInMetadata) {
logWarning(s"Updating the value of conf '$confKey' in current 
session from " +
  s"'${optionalValueInSession.get}' to '$valueInMetadata'.")
  }
  sessionConf.set(confKey, valueInMetadata)

case None =>
  // For backward compatibility, if a config was not recorded in the 
offset log,
  // then log it, and let the existing conf value in SparkSession 
prevail.
  logWarning (s"Conf '$confKey' was not found in the offset log, using 
existing value")
  }
}
  }
{code}

In this code, we can find it always set some option in metadata value. But as 
user, we want to those option can set by user.



> Structured Streaming set SparkSession configuration with the value in the 
> metadata if there is not a option set by user.   
> ---
>
> Key: SPARK-24744
> URL: https://issues.apache.org/jira/browse/SPARK-24744
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: bjkonglu
>Priority: Minor
>
> h3. Background
> When I use structured streaming to construct my application, there is 
> something odd! The application always set option 
> [spark.sql.shuffle.partitions] to default value [200]. Even though, I set 
> [spark.sql.shuffle.partitions] to other value by SparkConf or --conf 
> spark.sql.shuffle.partitions=100,  but it doesn't work. The option value is 
> default value as before.
> h3. Analyse
> I review the relevant code. The relevant code is in 
> [org.apache.spark.sql.execution.streaming.OffsetSeqMetadata].
> {code:scala}
> /** Set the SparkSession configuration with the values in the metadata */
>   def setSessionConf(metadata:

[jira] [Created] (SPARK-24745) Map function does not keep rdd name

2018-07-05 Thread Igor Pergenitsa (JIRA)

Igor Pergenitsa created SPARK-24745:
---

 Summary: Map function does not keep rdd name 
 Key: SPARK-24745
 URL: https://issues.apache.org/jira/browse/SPARK-24745
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Igor Pergenitsa


This snippet
{code:scala}
val namedRdd = sparkContext.makeRDD(List("abc", "123")).setName("named_rdd")
println(namedRdd.name)
val mappedRdd = namedRdd.map(_.length)
println(mappedRdd.name){code}
outputs:
named_rdd
null



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24743) Update the JavaDirectKafkaWordCount example to support the new API of Kafka

2018-07-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533403#comment-16533403
 ] 

Apache Spark commented on SPARK-24743:
--

User 'cluo512' has created a pull request for this issue:
https://github.com/apache/spark/pull/21717

> Update the JavaDirectKafkaWordCount example to support the new API of Kafka
> ---
>
> Key: SPARK-24743
> URL: https://issues.apache.org/jira/browse/SPARK-24743
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.3.1
>Reporter: luochuan
>Priority: Major
>
> when I ran the example JavaDirectKafkaWordCount as follows:
> ./bin/spark-submit --class 
> org.apache.spark.examples.streaming.JavaDirectKafkaWordCount --master local 
> --jars /somepath/spark-streaming-kafka-0-10-assembly_2.11-2.3.1.jar 
> examples/jars/spark-examples_2.11-2.3.1.jar kafka-broker:port topic
> Then a error happened: 
> 'org.apache.kafka.common.config.ConfigException:Missing required 
> configuration "bootstrap.servers" which has no default value'. So I looked 
> into the code and found the example JavaDirectKafkaWordCount  misses some 
> kafka required configs.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24743) Update the JavaDirectKafkaWordCount example to support the new API of Kafka

2018-07-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24743:


Assignee: Apache Spark

> Update the JavaDirectKafkaWordCount example to support the new API of Kafka
> ---
>
> Key: SPARK-24743
> URL: https://issues.apache.org/jira/browse/SPARK-24743
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.3.1
>Reporter: luochuan
>Assignee: Apache Spark
>Priority: Major
>
> when I ran the example JavaDirectKafkaWordCount as follows:
> ./bin/spark-submit --class 
> org.apache.spark.examples.streaming.JavaDirectKafkaWordCount --master local 
> --jars /somepath/spark-streaming-kafka-0-10-assembly_2.11-2.3.1.jar 
> examples/jars/spark-examples_2.11-2.3.1.jar kafka-broker:port topic
> Then a error happened: 
> 'org.apache.kafka.common.config.ConfigException:Missing required 
> configuration "bootstrap.servers" which has no default value'. So I looked 
> into the code and found the example JavaDirectKafkaWordCount  misses some 
> kafka required configs.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24743) Update the JavaDirectKafkaWordCount example to support the new API of Kafka

2018-07-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24743:


Assignee: (was: Apache Spark)

> Update the JavaDirectKafkaWordCount example to support the new API of Kafka
> ---
>
> Key: SPARK-24743
> URL: https://issues.apache.org/jira/browse/SPARK-24743
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.3.1
>Reporter: luochuan
>Priority: Major
>
> when I ran the example JavaDirectKafkaWordCount as follows:
> ./bin/spark-submit --class 
> org.apache.spark.examples.streaming.JavaDirectKafkaWordCount --master local 
> --jars /somepath/spark-streaming-kafka-0-10-assembly_2.11-2.3.1.jar 
> examples/jars/spark-examples_2.11-2.3.1.jar kafka-broker:port topic
> Then a error happened: 
> 'org.apache.kafka.common.config.ConfigException:Missing required 
> configuration "bootstrap.servers" which has no default value'. So I looked 
> into the code and found the example JavaDirectKafkaWordCount  misses some 
> kafka required configs.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24673) scala sql function from_utc_timestamp second argument could be Column instead of String

2018-07-05 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533391#comment-16533391
 ] 

Hyukjin Kwon edited comment on SPARK-24673 at 7/5/18 8:13 AM:
--

Fixed in https://github.com/apache/spark/pull/21693


was (Author: hyukjin.kwon):
Fixed in https://github.com/apache/spark/pull/21693/files

> scala sql function from_utc_timestamp second argument could be Column instead 
> of String
> ---
>
> Key: SPARK-24673
> URL: https://issues.apache.org/jira/browse/SPARK-24673
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Antonio Murgia
>Assignee: Antonio Murgia
>Priority: Minor
> Fix For: 2.4.0
>
>
> As of 2.3.1 the scala API for the built-in function from_utc_timestamp 
> (org.apache.spark.sql.functions#from_utc_timestamp) is less powerful than its 
> SQL counter part. In particular, given a dataset/dataframe with the following 
> schema:
> {code:java}
> CREATE TABLE MY_TABLE (
>   ts TIMESTAMP,
>   tz STRING
> ){code}
> from the SQL api I can do something like:
> {code:java}
> SELECT FROM_UTC_TIMESTAMP(TS, TZ){code}
> while from the programmatic api I simply cannot because
> {code:java}
> functions.from_utc_timestamp(ts: Column, tz: String){code}
> second argument is a String.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24673) scala sql function from_utc_timestamp second argument could be Column instead of String

2018-07-05 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24673.
--
   Resolution: Fixed
 Assignee: Antonio Murgia
Fix Version/s: 2.4.0

Fixed in https://github.com/apache/spark/pull/21693/files

> scala sql function from_utc_timestamp second argument could be Column instead 
> of String
> ---
>
> Key: SPARK-24673
> URL: https://issues.apache.org/jira/browse/SPARK-24673
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Antonio Murgia
>Assignee: Antonio Murgia
>Priority: Minor
> Fix For: 2.4.0
>
>
> As of 2.3.1 the scala API for the built-in function from_utc_timestamp 
> (org.apache.spark.sql.functions#from_utc_timestamp) is less powerful than its 
> SQL counter part. In particular, given a dataset/dataframe with the following 
> schema:
> {code:java}
> CREATE TABLE MY_TABLE (
>   ts TIMESTAMP,
>   tz STRING
> ){code}
> from the SQL api I can do something like:
> {code:java}
> SELECT FROM_UTC_TIMESTAMP(TS, TZ){code}
> while from the programmatic api I simply cannot because
> {code:java}
> functions.from_utc_timestamp(ts: Column, tz: String){code}
> second argument is a String.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24744) Structured Streaming set SparkSession configuration with the value in the metadata if there is not a option set by user.

2018-07-05 Thread bjkonglu (JIRA)

bjkonglu created SPARK-24744:


 Summary: Structured Streaming set SparkSession configuration with 
the value in the metadata if there is not a option set by user.   
 Key: SPARK-24744
 URL: https://issues.apache.org/jira/browse/SPARK-24744
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.3.1
Reporter: bjkonglu


h3. Background
When I use structured streaming to construct my application, there is some odd! 
The application always set option [spark.sql.shuffle.partitions] to default 
value [200]. Even though, I set [spark.sql.shuffle.partitions] to other value 
by SparkConf or --conf spark.sql.shuffle.partitions=100,  but it doesn't work. 
The option value is default value as before.
h3. Analyse
I review the relevant code. The relevant code is in 
[org.apache.spark.sql.execution.streaming.OffsetSeqMetadata].

{code:scala}
/** Set the SparkSession configuration with the values in the metadata */
  def setSessionConf(metadata: OffsetSeqMetadata, sessionConf: RuntimeConfig): 
Unit = {
OffsetSeqMetadata.relevantSQLConfs.map(_.key).foreach { confKey =>

  metadata.conf.get(confKey) match {

case Some(valueInMetadata) =>
  // Config value exists in the metadata, update the session config 
with this value
  val optionalValueInSession = sessionConf.getOption(confKey)
  if (optionalValueInSession.isDefined && optionalValueInSession.get != 
valueInMetadata) {
logWarning(s"Updating the value of conf '$confKey' in current 
session from " +
  s"'${optionalValueInSession.get}' to '$valueInMetadata'.")
  }
  sessionConf.set(confKey, valueInMetadata)

case None =>
  // For backward compatibility, if a config was not recorded in the 
offset log,
  // then log it, and let the existing conf value in SparkSession 
prevail.
  logWarning (s"Conf '$confKey' was not found in the offset log, using 
existing value")
  }
}
  }
{code}

In this code, we can find it always set some option in metadata value. But as 
user, we want to those option can set by user.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24528) Missing optimization for Aggregations/Windowing on a bucketed table

2018-07-05 Thread Ohad Raviv (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1659#comment-1659
 ] 

Ohad Raviv commented on SPARK-24528:


After digging a little bit in the code and Jira I understand that this is just 
a special case of SPARK-2926, just that the performance boost is greater.

over there they deal with moving the sort work from reducers to mappers and 
show reducers performance boost of ~10x and an overall performance boost of ~2x 
(I'm not sure why it has never got merged). In our case because the data is 
already sorted in the buckets we should expect this great 10x boost!

because most of the needed code is already in there I guess it will be wise to 
migrate it (altough it contains some more fancy things like Tiered Merger that 
I'm not sure we need).

> Missing optimization for Aggregations/Windowing on a bucketed table
> ---
>
> Key: SPARK-24528
> URL: https://issues.apache.org/jira/browse/SPARK-24528
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Ohad Raviv
>Priority: Major
>
> Closely related to  SPARK-24410, we're trying to optimize a very common use 
> case we have of getting the most updated row by id from a fact table.
> We're saving the table bucketed to skip the shuffle stage, but we're still 
> "waste" time on the Sort operator evethough the data is already sorted.
> here's a good example:
> {code:java}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("key", "t1")
> .saveAsTable("a1"){code}
> {code:java}
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, 
> key#24L, t1, t1#25L, t2, t2#26L))])
> +- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, 
> t1#25L, key, key#24L, t1, t1#25L, t2, t2#26L))])
> +- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, 
> Format: Parquet, Location: ...{code}
>  
> and here's a bad example, but more realistic:
> {code:java}
> sparkSession.sql("set spark.sql.shuffle.partitions=2")
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, 
> key#32L, t1, t1#33L, t2, t2#34L))])
> +- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, 
> t1#33L, key, key#32L, t1, t1#33L, t2, t2#34L))])
> +- *(1) Sort [key#32L ASC NULLS FIRST], false, 0
> +- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, 
> Format: Parquet, Location: ...
> {code}
>  
> I've traced the problem to DataSourceScanExec#235:
> {code:java}
> val sortOrder = if (sortColumns.nonEmpty) {
>   // In case of bucketing, its possible to have multiple files belonging to 
> the
>   // same bucket in a given relation. Each of these files are locally sorted
>   // but those files combined together are not globally sorted. Given that,
>   // the RDD partition will not be sorted even if the relation has sort 
> columns set
>   // Current solution is to check if all the buckets have a single file in it
>   val files = selectedPartitions.flatMap(partition => partition.files)
>   val bucketToFilesGrouping =
> files.map(_.getPath.getName).groupBy(file => 
> BucketingUtils.getBucketId(file))
>   val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= 
> 1){code}
> so obviously the code avoids dealing with this situation now..
> could you think of a way to solve this or bypass it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24743) Update the JavaDirectKafkaWordCount example to support the new API of Kafka

2018-07-05 Thread luochuan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

luochuan updated SPARK-24743:
-
Summary: Update the JavaDirectKafkaWordCount example to support the new API 
of Kafka  (was: Update the DirectKafkaWordCount example to support the new API 
of Kafka)

> Update the JavaDirectKafkaWordCount example to support the new API of Kafka
> ---
>
> Key: SPARK-24743
> URL: https://issues.apache.org/jira/browse/SPARK-24743
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.3.1
>Reporter: luochuan
>Priority: Major
>
> when I ran the example JavaDirectKafkaWordCount as follows:
> ./bin/spark-submit --class 
> org.apache.spark.examples.streaming.JavaDirectKafkaWordCount --master local 
> --jars /somepath/spark-streaming-kafka-0-10-assembly_2.11-2.3.1.jar 
> examples/jars/spark-examples_2.11-2.3.1.jar kafka-broker:port topic
> Then a error happened: 
> 'org.apache.kafka.common.config.ConfigException:Missing required 
> configuration "bootstrap.servers" which has no default value'. So I looked 
> into the code and found the example JavaDirectKafkaWordCount  misses some 
> kafka required configs.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24743) Update the DirectKafkaWordCount example to support the new API of Kafka

2018-07-05 Thread luochuan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

luochuan updated SPARK-24743:
-
Description: 
when I ran the example JavaDirectKafkaWordCount as follows:

./bin/spark-submit --class 
org.apache.spark.examples.streaming.JavaDirectKafkaWordCount --master local 
--jars /somepath/spark-streaming-kafka-0-10-assembly_2.11-2.3.1.jar 
examples/jars/spark-examples_2.11-2.3.1.jar kafka-broker:port topic

Then a error happened: 'org.apache.kafka.common.config.ConfigException:Missing 
required configuration "bootstrap.servers" which has no default value'. So I 
looked into the code and found the example JavaDirectKafkaWordCount  misses 
some kafka required configs.  

> Update the DirectKafkaWordCount example to support the new API of Kafka
> ---
>
> Key: SPARK-24743
> URL: https://issues.apache.org/jira/browse/SPARK-24743
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.3.1
>Reporter: luochuan
>Priority: Major
>
> when I ran the example JavaDirectKafkaWordCount as follows:
> ./bin/spark-submit --class 
> org.apache.spark.examples.streaming.JavaDirectKafkaWordCount --master local 
> --jars /somepath/spark-streaming-kafka-0-10-assembly_2.11-2.3.1.jar 
> examples/jars/spark-examples_2.11-2.3.1.jar kafka-broker:port topic
> Then a error happened: 
> 'org.apache.kafka.common.config.ConfigException:Missing required 
> configuration "bootstrap.servers" which has no default value'. So I looked 
> into the code and found the example JavaDirectKafkaWordCount  misses some 
> kafka required configs.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24743) Update the DirectKafkaWordCount example to support the new API of Kafka

2018-07-05 Thread luochuan (JIRA)

luochuan created SPARK-24743:


 Summary: Update the DirectKafkaWordCount example to support the 
new API of Kafka
 Key: SPARK-24743
 URL: https://issues.apache.org/jira/browse/SPARK-24743
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 2.3.1
Reporter: luochuan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24742) Field Metadata raises NullPointerException in hashCode method

2018-07-05 Thread Kaya Kupferschmidt (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533286#comment-16533286
 ] 

Kaya Kupferschmidt commented on SPARK-24742:


I will provide a patch later.

> Field Metadata raises NullPointerException in hashCode method
> -
>
> Key: SPARK-24742
> URL: https://issues.apache.org/jira/browse/SPARK-24742
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Kaya Kupferschmidt
>Priority: Minor
>
> The class org.apache.spark.sql.types.Metadata has a method *putNull* for 
> storing null pointers as Metadata. Unfortunately the hashCode method throws a 
> NullPointerException when there are null values in a Metadata object, 
> rendering the *puNull* method useless.
> h2. How to Reproduce
> The following code will raise a NullPointerException
> {code}
> new MetadataBuilder().putNull("key").build().hashCode
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24742) Field Metadata raises NullPointerException in hashCode method

2018-07-05 Thread Kaya Kupferschmidt (JIRA)

Kaya Kupferschmidt created SPARK-24742:
--

 Summary: Field Metadata raises NullPointerException in hashCode 
method
 Key: SPARK-24742
 URL: https://issues.apache.org/jira/browse/SPARK-24742
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Kaya Kupferschmidt


The class org.apache.spark.sql.types.Metadata has a method *putNull* for 
storing null pointers as Metadata. Unfortunately the hashCode method throws a 
NullPointerException when there are null values in a Metadata object, rendering 
the *puNull* method useless.

h2. How to Reproduce
The following code will raise a NullPointerException
{code}
new MetadataBuilder().putNull("key").build().hashCode
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

60 matches

Mail list logo