date:20180905

[jira] [Commented] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605335#comment-16605335
 ] 

Apache Spark commented on SPARK-25313:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/22346

> Fix regression in FileFormatWriter output schema
> 
>
> Key: SPARK-25313
> URL: https://issues.apache.org/jira/browse/SPARK-25313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> In the follow example:
> val location = "/tmp/t"
> val df = spark.range(10).toDF("id")
> df.write.format("parquet").saveAsTable("tbl")
> spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
> spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
> $location")
> spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
> println(spark.read.parquet(location).schema)
> spark.table("tbl2").show()
> The output column name in schema will be id instead of ID, thus the last 
> query shows nothing from tbl2.
> By enabling the debug message we can see that the output naming is changed 
> from ID to id, and then the outputColumns in 
> InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
> To guarantee correctness, we should change the output columns from 
> `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
> optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12321) JSON format for logical/physical execution plans

2018-09-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-12321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605317#comment-16605317
 ] 

Apache Spark commented on SPARK-12321:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/22345

> JSON format for logical/physical execution plans
> 
>
> Key: SPARK-12321
> URL: https://issues.apache.org/jira/browse/SPARK-12321
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12321) JSON format for logical/physical execution plans

2018-09-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-12321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605316#comment-16605316
 ] 

Apache Spark commented on SPARK-12321:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/22345

> JSON format for logical/physical execution plans
> 
>
> Key: SPARK-12321
> URL: https://issues.apache.org/jira/browse/SPARK-12321
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-09-05 Thread Gengliang Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605308#comment-16605308
 ] 

Gengliang Wang commented on SPARK-24771:


[~vanzin] I am OK with either way. Shading Avro 1.8 in data source only seems 
reasonable.
But I am not confident enough to do the change. Can you open a PR for it?



> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25352) Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold

2018-09-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605263#comment-16605263
 ] 

Apache Spark commented on SPARK-25352:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/22344

> Perform ordered global limit when limit number is bigger than 
> topKSortFallbackThreshold
> ---
>
> Key: SPARK-25352
> URL: https://issues.apache.org/jira/browse/SPARK-25352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> We have optimization on global limit to evenly distribute limit rows across 
> all partitions. This optimization doesn't work for ordered results.
> For a query ending with sort + limit, in most cases it is performed by 
> `TakeOrderedAndProjectExec`.
> But if limit number is bigger than `SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD`, 
> global limit will be used. At this moment, we need to do ordered global limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25352) Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold

2018-09-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25352:


Assignee: Apache Spark

> Perform ordered global limit when limit number is bigger than 
> topKSortFallbackThreshold
> ---
>
> Key: SPARK-25352
> URL: https://issues.apache.org/jira/browse/SPARK-25352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> We have optimization on global limit to evenly distribute limit rows across 
> all partitions. This optimization doesn't work for ordered results.
> For a query ending with sort + limit, in most cases it is performed by 
> `TakeOrderedAndProjectExec`.
> But if limit number is bigger than `SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD`, 
> global limit will be used. At this moment, we need to do ordered global limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25352) Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold

2018-09-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25352:


Assignee: (was: Apache Spark)

> Perform ordered global limit when limit number is bigger than 
> topKSortFallbackThreshold
> ---
>
> Key: SPARK-25352
> URL: https://issues.apache.org/jira/browse/SPARK-25352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> We have optimization on global limit to evenly distribute limit rows across 
> all partitions. This optimization doesn't work for ordered results.
> For a query ending with sort + limit, in most cases it is performed by 
> `TakeOrderedAndProjectExec`.
> But if limit number is bigger than `SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD`, 
> global limit will be used. At this moment, we need to do ordered global limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25352) Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold

2018-09-05 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-25352:
---

 Summary: Perform ordered global limit when limit number is bigger 
than topKSortFallbackThreshold
 Key: SPARK-25352
 URL: https://issues.apache.org/jira/browse/SPARK-25352
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Liang-Chi Hsieh


We have optimization on global limit to evenly distribute limit rows across all 
partitions. This optimization doesn't work for ordered results.

For a query ending with sort + limit, in most cases it is performed by 
`TakeOrderedAndProjectExec`.

But if limit number is bigger than `SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD`, 
global limit will be used. At this moment, we need to do ordered global limit.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25252) Support arrays of any types in to_json

2018-09-05 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25252.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 6
[https://github.com/apache/spark/pull/6]

> Support arrays of any types in to_json
> --
>
> Key: SPARK-25252
> URL: https://issues.apache.org/jira/browse/SPARK-25252
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> Need to improve the to_json function and make it more consistent with 
> from_json by supporting arrays of any types (as root types). For now, it 
> supports only arrays of structs and arrays of maps.  After the changes the 
> following code should work:
> {code:scala}
> select to_json(array('1','2','3'))
> > ["1","2","3"]
> select to_json(array(array(1,2,3),array(4)))
> > [[1,2,3],[4]]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25252) Support arrays of any types in to_json

2018-09-05 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25252:


Assignee: Maxim Gekk

> Support arrays of any types in to_json
> --
>
> Key: SPARK-25252
> URL: https://issues.apache.org/jira/browse/SPARK-25252
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Need to improve the to_json function and make it more consistent with 
> from_json by supporting arrays of any types (as root types). For now, it 
> supports only arrays of structs and arrays of maps.  After the changes the 
> following code should work:
> {code:scala}
> select to_json(array('1','2','3'))
> > ["1","2","3"]
> select to_json(array(array(1,2,3),array(4)))
> > [[1,2,3],[4]]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25344) Break large tests.py files into smaller files

2018-09-05 Thread Imran Rashid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605250#comment-16605250
 ] 

Imran Rashid commented on SPARK-25344:
--

kinda related, maybe this should get its own jira -- when you run the 
"pyspark-sql" tests, it also somehow runs {{SparkSubmitTests}}, which really 
should only be in the "pyspark-core" module.  For me they take 80s, would be 
nice to eliminate that.

I don't really understand why they get run in that module, but it does seem if 
I comment out the import in sql/tests.py, then they don't get run that extra 
time.  We can't really do that, as the import is needed for the 
{{HiveSparkSubmitTests}}.  But we should figure out why just importing it makes 
them run, and if we can do avoid that.

> Break large tests.py files into smaller files
> -
>
> Key: SPARK-25344
> URL: https://issues.apache.org/jira/browse/SPARK-25344
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: newbie
>
> We've got a ton of tests in one humongous tests.py file, rather than breaking 
> it out into smaller files.
> Having one huge file doesn't seem great for code organization, and it also 
> makes the test parallelization in run-tests.py not work as well.  On my 
> laptop, tests.py takes 150s, and the next longest test file takes only 20s.  
> There are similarly large files in other pyspark modules, eg. sql/tests.py, 
> ml/tests.py, mllib/tests.py, streaming/tests.py.
> It seems that at least for some of these files, its already broken into 
> independent test classes, so it shouldn't be too hard to just move them into 
> their own files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25337) HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasour

2018-09-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25337:
-

Assignee: Dongjoon Hyun

> HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: 
> org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)
> 
>
> Key: SPARK-25337
> URL: https://issues.apache.org/jira/browse/SPARK-25337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.0
>
>
> Observed in the Scala 2.12 pull request builder consistently now. I don't see 
> this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of 
> hard to see how.
> CC [~sadhen]
> {code:java}
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED ***
> Exception encountered when invoking run on a nested suite - spark-submit 
> returned with exit code 1.
> Command line: './bin/spark-submit' '--name' 'prepare testing tables' 
> '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 
> 'spark.master.rest.enabled=false' '--conf' 
> 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
>  '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' 
> '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
>  
> '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py'
> ...
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/session.py",
>  line 545, in sql
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> 2018-09-04 20:00:04.95 - stdout> py4j.protocol.Py4JJavaError: An error 
> occurred while calling o27.sql.
> 2018-09-04 20:00:04.95 - stdout> : java.util.ServiceConfigurationError: 
> org.apache.spark.sql.sources.DataSourceRegister: Provider 
> org.apache.spark.sql.hive.execution.HiveFileFormat could not be instantiated
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25337) HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasour

2018-09-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25337.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22340
[https://github.com/apache/spark/pull/22340]

> HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: 
> org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)
> 
>
> Key: SPARK-25337
> URL: https://issues.apache.org/jira/browse/SPARK-25337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.0
>
>
> Observed in the Scala 2.12 pull request builder consistently now. I don't see 
> this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of 
> hard to see how.
> CC [~sadhen]
> {code:java}
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED ***
> Exception encountered when invoking run on a nested suite - spark-submit 
> returned with exit code 1.
> Command line: './bin/spark-submit' '--name' 'prepare testing tables' 
> '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 
> 'spark.master.rest.enabled=false' '--conf' 
> 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
>  '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' 
> '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
>  
> '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py'
> ...
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/session.py",
>  line 545, in sql
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> 2018-09-04 20:00:04.95 - stdout> py4j.protocol.Py4JJavaError: An error 
> occurred while calling o27.sql.
> 2018-09-04 20:00:04.95 - stdout> : java.util.ServiceConfigurationError: 
> org.apache.spark.sql.sources.DataSourceRegister: Provider 
> org.apache.spark.sql.hive.execution.HiveFileFormat could not be instantiated
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20918) Use FunctionIdentifier as function identifiers in FunctionRegistry

2018-09-05 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-20918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-20918:

Labels: release-notes  (was: )

> Use FunctionIdentifier as function identifiers in FunctionRegistry
> --
>
> Key: SPARK-20918
> URL: https://issues.apache.org/jira/browse/SPARK-20918
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>  Labels: release-notes
> Fix For: 2.3.0
>
>
> Currently, the unquoted string of a function identifier is being used as the 
> function identifier in the function registry. This could cause the incorrect 
> the behavior when users use `.` in the function names. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-05 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25313.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22320
[https://github.com/apache/spark/pull/22320]

> Fix regression in FileFormatWriter output schema
> 
>
> Key: SPARK-25313
> URL: https://issues.apache.org/jira/browse/SPARK-25313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> In the follow example:
> val location = "/tmp/t"
> val df = spark.range(10).toDF("id")
> df.write.format("parquet").saveAsTable("tbl")
> spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
> spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
> $location")
> spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
> println(spark.read.parquet(location).schema)
> spark.table("tbl2").show()
> The output column name in schema will be id instead of ID, thus the last 
> query shows nothing from tbl2.
> By enabling the debug message we can see that the output naming is changed 
> from ID to id, and then the outputColumns in 
> InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
> To guarantee correctness, we should change the output columns from 
> `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
> optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-05 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25313:
---

Assignee: Gengliang Wang

> Fix regression in FileFormatWriter output schema
> 
>
> Key: SPARK-25313
> URL: https://issues.apache.org/jira/browse/SPARK-25313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> In the follow example:
> val location = "/tmp/t"
> val df = spark.range(10).toDF("id")
> df.write.format("parquet").saveAsTable("tbl")
> spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
> spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
> $location")
> spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
> println(spark.read.parquet(location).schema)
> spark.table("tbl2").show()
> The output column name in schema will be id instead of ID, thus the last 
> query shows nothing from tbl2.
> By enabling the debug message we can see that the output naming is changed 
> from ID to id, and then the outputColumns in 
> InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
> To guarantee correctness, we should change the output columns from 
> `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
> optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-05 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605186#comment-16605186
 ] 

Hyukjin Kwon commented on SPARK-18112:
--

We need the metastore jar if I understood correctly. FWIW, I am seeing few 
tests internally running with different metastore support. I doubt if there's 
an issue about its supportability itself.

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25346) Document Spark builtin data sources

2018-09-05 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605141#comment-16605141
 ] 

Hyukjin Kwon commented on SPARK-25346:
--

Avro - documentation was added in SPARK-25133

I agree there isn't an explicit documentation that lists builtin datasource; 
however, I wonder if it actually blocks SPARK-25347 since it can be added in 
other forms like the examples above.

> Document Spark builtin data sources
> ---
>
> Key: SPARK-25346
> URL: https://issues.apache.org/jira/browse/SPARK-25346
> Project: Spark
>  Issue Type: Story
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> It would be nice to list built-in data sources in the doc site. So users know 
> what are available by default. However, I didn't find any from 2.3.1 docs.
>  
> cc: [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25346) Document Spark builtin data sources

2018-09-05 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605140#comment-16605140
 ] 

Hyukjin Kwon commented on SPARK-25346:
--

[~mengxr], actually there are documentation for several datasources. For 
example,

Parquet - 
https://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
ORC - https://spark.apache.org/docs/latest/sql-programming-guide.html#orc-files
JSON - 
https://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
CSV - 
https://spark.apache.org/docs/latest/sql-programming-guide.html#manually-specifying-options
 (there were few tries for CSV documentation but they were failed for the sake 
of duplicated API documentation in DataFrameReader)
JDBC - 
https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases

> Document Spark builtin data sources
> ---
>
> Key: SPARK-25346
> URL: https://issues.apache.org/jira/browse/SPARK-25346
> Project: Spark
>  Issue Type: Story
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> It would be nice to list built-in data sources in the doc site. So users know 
> what are available by default. However, I didn't find any from 2.3.1 docs.
>  
> cc: [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25330) Permission issue after upgrade hadoop version to 2.7.7

2018-09-05 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605131#comment-16605131
 ] 

Yuming Wang edited comment on SPARK-25330 at 9/6/18 1:09 AM:
-

I try to build Hadoop 2.7.7 with 
[{{Configuration.getRestrictParserDefault(Object 
resource)}}|https://github.com/apache/hadoop/blob/release-2.7.7-RC0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java#L236]
 = true and false.
 It succeeded when {{Configuration.getRestrictParserDefault(Object 
resource)=false}}, but failed when 
{{Configuration.getRestrictParserDefault(Object resource)=true}}.


was (Author: q79969786):
I try to build Hadoop 2.7.7 
with[{{Configuration.getRestrictParserDefault(Object 
resource)}}|https://github.com/apache/hadoop/blob/release-2.7.7-RC0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java#L236]
 = true and false.
It succeeded when {{Configuration.getRestrictParserDefault(Object 
resource)=false}}, but failed when 
{{Configuration.getRestrictParserDefault(Object resource)=true}}.

> Permission issue after upgrade hadoop version to 2.7.7
> --
>
> Key: SPARK-25330
> URL: https://issues.apache.org/jira/browse/SPARK-25330
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:java}
> # build spark
> ./dev/make-distribution.sh --name SPARK-25330 --tgz  -Phadoop-2.7 -Phive 
> -Phive-thriftserver -Pyarn
> tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd 
> spark-2.4.0-SNAPSHOT-bin-SPARK-25330
> export HADOOP_PROXY_USER=user_a
> bin/spark-sql
> export HADOOP_PROXY_USER=user_b
> bin/spark-sql{code}
>  
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: 
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=user_b, access=EXECUTE, 
> inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx--
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25330) Permission issue after upgrade hadoop version to 2.7.7

2018-09-05 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605131#comment-16605131
 ] 

Yuming Wang commented on SPARK-25330:
-

I try to build Hadoop 2.7.7 
with[{{Configuration.getRestrictParserDefault(Object 
resource)}}|https://github.com/apache/hadoop/blob/release-2.7.7-RC0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java#L236]
 = true and false.
It succeeded when {{Configuration.getRestrictParserDefault(Object 
resource)=false}}, but failed when 
{{Configuration.getRestrictParserDefault(Object resource)=true}}.

> Permission issue after upgrade hadoop version to 2.7.7
> --
>
> Key: SPARK-25330
> URL: https://issues.apache.org/jira/browse/SPARK-25330
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:java}
> # build spark
> ./dev/make-distribution.sh --name SPARK-25330 --tgz  -Phadoop-2.7 -Phive 
> -Phive-thriftserver -Pyarn
> tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd 
> spark-2.4.0-SNAPSHOT-bin-SPARK-25330
> export HADOOP_PROXY_USER=user_a
> bin/spark-sql
> export HADOOP_PROXY_USER=user_b
> bin/spark-sql{code}
>  
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: 
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=user_b, access=EXECUTE, 
> inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx--
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25330) Permission issue after upgrade hadoop version to 2.7.7

2018-09-05 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605129#comment-16605129
 ] 

Yuming Wang commented on SPARK-25330:
-

No. The issue occurred in this commit: 
[apache/hadoop@{{feb886f}}|https://github.com/apache/hadoop/commit/feb886f2093ea5da0cd09c69bd1360a335335c86].

> Permission issue after upgrade hadoop version to 2.7.7
> --
>
> Key: SPARK-25330
> URL: https://issues.apache.org/jira/browse/SPARK-25330
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:java}
> # build spark
> ./dev/make-distribution.sh --name SPARK-25330 --tgz  -Phadoop-2.7 -Phive 
> -Phive-thriftserver -Pyarn
> tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd 
> spark-2.4.0-SNAPSHOT-bin-SPARK-25330
> export HADOOP_PROXY_USER=user_a
> bin/spark-sql
> export HADOOP_PROXY_USER=user_b
> bin/spark-sql{code}
>  
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: 
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=user_b, access=EXECUTE, 
> inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx--
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25330) Permission issue after upgrade hadoop version to 2.7.7

2018-09-05 Thread Eric Yang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605114#comment-16605114
 ] 

Eric Yang commented on SPARK-25330:
---

[~yumwang] Does Hadoop 2.7.5 works?  It might help us to isolate the release 
that started the regression to isolate the number of JIRAs that Hadoop team 
needs to go through.  Thanks

> Permission issue after upgrade hadoop version to 2.7.7
> --
>
> Key: SPARK-25330
> URL: https://issues.apache.org/jira/browse/SPARK-25330
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:java}
> # build spark
> ./dev/make-distribution.sh --name SPARK-25330 --tgz  -Phadoop-2.7 -Phive 
> -Phive-thriftserver -Pyarn
> tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd 
> spark-2.4.0-SNAPSHOT-bin-SPARK-25330
> export HADOOP_PROXY_USER=user_a
> bin/spark-sql
> export HADOOP_PROXY_USER=user_b
> bin/spark-sql{code}
>  
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: 
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=user_b, access=EXECUTE, 
> inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx--
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25268) runParallelPersonalizedPageRank throws serialization Exception

2018-09-05 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-25268:
-

Assignee: shahid

> runParallelPersonalizedPageRank throws serialization Exception
> --
>
> Key: SPARK-25268
> URL: https://issues.apache.org/jira/browse/SPARK-25268
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Assignee: shahid
>Priority: Critical
>
> A recent change to PageRank introduced a bug in the 
> ParallelPersonalizedPageRank implementation. The change prevents 
> serialization of a Map which needs to be broadcast to all workers. The issue 
> is in this line here: 
> [https://github.com/apache/spark/blob/6c5cb85856235efd464b109558896f81ae2c4c75/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala#L201]
> Because graphx units tests are run in local mode, the Serialization issue is 
> not caught.
>  
> {code:java}
> [info] - Star example parallel personalized PageRank *** FAILED *** (2 
> seconds, 160 milliseconds)
> [info] java.io.NotSerializableException: 
> scala.collection.immutable.MapLike$$anon$2
> [info] Serialization stack:
> [info] - object not serializable (class: 
> scala.collection.immutable.MapLike$$anon$2, value: Map(1 -> 
> SparseVector(3)((0,1.0)), 2 -> SparseVector(3)((1,1.0)), 3 -> 
> SparseVector(3)((2,1.0
> [info] at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> [info] at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291)
> [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1348)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:292)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:127)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> [info] at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
> [info] at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1489)
> [info] at 
> org.apache.spark.graphx.lib.PageRank$.runParallelPersonalizedPageRank(PageRank.scala:205)
> [info] at 
> org.apache.spark.graphx.lib.GraphXHelpers$.runParallelPersonalizedPageRank(GraphXHelpers.scala:31)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRank$.run(ParallelPersonalizedPageRank.scala:115)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRank.run(ParallelPersonalizedPageRank.scala:84)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply$mcV$sp(ParallelPersonalizedPageRankSuite.scala:62)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51)
> [info] at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info] at org.graphframes.SparkFunSuite.withFixture(SparkFunSuite.scala:40)
> [info] at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info] at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
> [info] at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
> [info] at scala.collection.immutable.List.foreach(List.scala:383)

[jira] [Updated] (SPARK-20901) Feature parity for ORC with Parquet

2018-09-05 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20901:
--
Affects Version/s: 2.4.0

> Feature parity for ORC with Parquet
> ---
>
> Key: SPARK-20901
> URL: https://issues.apache.org/jira/browse/SPARK-20901
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0, 2.2.1, 2.3.0, 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to track the feature parity for ORC with Parquet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23774) `Cast` to CHAR/VARCHAR should truncate the values

2018-09-05 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-23774.
---
Resolution: Won't Do

Per review comments, we will revisit this when we can support CHAR/VARCHAR 
natively.

> `Cast` to CHAR/VARCHAR should truncate the values
> -
>
> Key: SPARK-23774
> URL: https://issues.apache.org/jira/browse/SPARK-23774
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to fix the following `CAST` behavior on `CHAR/VARCHAR` types.
> Since HiveStringType is used only in parsing, this PR is also about parsing.
> *Spark*
> {code}
> scala> sql("SELECT CAST('123' AS CHAR(1)), CAST('123' AS VARCHAR(1))").show
> +---+---+
> |CAST(123 AS STRING)|CAST(123 AS STRING)|
> +---+---+
> |123|123|
> +---+---+
> scala> sql("SELECT CAST('123' AS CHAR(0)), CAST('123' AS VARCHAR(0))").show
> +---+---+
> |CAST(123 AS STRING)|CAST(123 AS STRING)|
> +---+---+
> |123|123|
> +---+---+
> {code}
> *Hive*
> {code}
> hive> SELECT CAST('123' AS CHAR(1)), CAST('123' AS VARCHAR(1));
> OK
> 1 1
> hive> SELECT CAST('123' AS CHAR(0)), CAST('123' AS VARCHAR(0));
> FAILED: RuntimeException Char length 0 out of allowed range [1, 255]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23131) Kryo raises StackOverflow during serializing GLR model

2018-09-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23131:
-

Assignee: Yuming Wang

> Kryo raises StackOverflow during serializing GLR model
> --
>
> Key: SPARK-23131
> URL: https://issues.apache.org/jira/browse/SPARK-23131
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Peigen
>Assignee: Yuming Wang
>Priority: Minor
> Fix For: 2.4.0
>
>
> When trying to use GeneralizedLinearRegression model and set SparkConf to use 
> KryoSerializer(JavaSerializer is fine)
> It causes StackOverflowException
> {quote}Exception in thread "dispatcher-event-loop-34" 
> java.lang.StackOverflowError
>  at java.util.HashMap.hash(HashMap.java:338)
>  at java.util.HashMap.get(HashMap.java:556)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:61)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
> {quote}
> This is very likely to be 
> [https://github.com/EsotericSoftware/kryo/issues/341]
> Upgrade Kryo to 4.0+ probably could fix this
>  
> Wish for upgrade Kryo version for spark



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25176) Kryo fails to serialize a parametrised type hierarchy

2018-09-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25176:
-

Assignee: Yuming Wang

> Kryo fails to serialize a parametrised type hierarchy
> -
>
> Key: SPARK-25176
> URL: https://issues.apache.org/jira/browse/SPARK-25176
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Mikhail Pryakhin
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> I'm using the latest spark version spark-core_2.11:2.3.1 which 
> transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the
> com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo 
> serializer contains an issue [1,2] which results in throwing 
> ClassCastExceptions when serialising parameterised type hierarchy.
> This issue has been fixed in kryo version 4.0.0 [3]. It would be great to 
> have this update in Spark as well. Could you please upgrade the version of 
> com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2?
> You can find a simple test to reproduce the issue [4].
> [1] https://github.com/EsotericSoftware/kryo/issues/384
> [2] https://github.com/EsotericSoftware/kryo/issues/377
> [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0
> [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25258) Upgrade kryo package to version 4.0.2

2018-09-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25258:
-

Assignee: Yuming Wang

> Upgrade kryo package to version 4.0.2
> -
>
> Key: SPARK-25258
> URL: https://issues.apache.org/jira/browse/SPARK-25258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.1
>Reporter: liupengcheng
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Recently, we encountered a kryo performance issue in spark2.1.0, and the 
> issue affect all kryo below 4.0.2, so it seems that all spark version might 
> encounter this issue.
> Issue description:
> In shuffle write phase or some spilling operation, spark will use kryo 
> serializer to serialize data if `spark.serializer` is set to 
> `KryoSerializer`, however, when data contains some extremely large records, 
> kryoSerializer's MapReferenceResolver would be expand, and it's `reset` 
> method will take a long time to reset all items in writtenObjects table to 
> null.
> com.esotericsoftware.kryo.util.MapReferenceResolver
> {code:java}
> public void reset () {
>  readObjects.clear();
>  writtenObjects.clear();
> }
> public void clear () {
>  K[] keyTable = this.keyTable;
>  for (int i = capacity + stashSize; i-- > 0;)
>   keyTable[i] = null;
>  size = 0;
>  stashSize = 0;
> }
> {code}
> I checked the kryo project in github, and this issue seems fixed in 4.0.2+
> [https://github.com/EsotericSoftware/kryo/commit/77935c696ee4976963aa5c6ac53d53d9b40b8bdd#diff-215fa9846e1e4e54bbeede0500de1e28]
>  
> I was wondering if we can make spark kryo package upgrade to 4.0.2+ to fix 
> this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25176) Kryo fails to serialize a parametrised type hierarchy

2018-09-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25176.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22179
[https://github.com/apache/spark/pull/22179]

> Kryo fails to serialize a parametrised type hierarchy
> -
>
> Key: SPARK-25176
> URL: https://issues.apache.org/jira/browse/SPARK-25176
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Mikhail Pryakhin
>Priority: Major
> Fix For: 2.4.0
>
>
> I'm using the latest spark version spark-core_2.11:2.3.1 which 
> transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the
> com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo 
> serializer contains an issue [1,2] which results in throwing 
> ClassCastExceptions when serialising parameterised type hierarchy.
> This issue has been fixed in kryo version 4.0.0 [3]. It would be great to 
> have this update in Spark as well. Could you please upgrade the version of 
> com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2?
> You can find a simple test to reproduce the issue [4].
> [1] https://github.com/EsotericSoftware/kryo/issues/384
> [2] https://github.com/EsotericSoftware/kryo/issues/377
> [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0
> [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23131) Kryo raises StackOverflow during serializing GLR model

2018-09-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23131.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22179
[https://github.com/apache/spark/pull/22179]

> Kryo raises StackOverflow during serializing GLR model
> --
>
> Key: SPARK-23131
> URL: https://issues.apache.org/jira/browse/SPARK-23131
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Peigen
>Priority: Minor
> Fix For: 2.4.0
>
>
> When trying to use GeneralizedLinearRegression model and set SparkConf to use 
> KryoSerializer(JavaSerializer is fine)
> It causes StackOverflowException
> {quote}Exception in thread "dispatcher-event-loop-34" 
> java.lang.StackOverflowError
>  at java.util.HashMap.hash(HashMap.java:338)
>  at java.util.HashMap.get(HashMap.java:556)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:61)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
> {quote}
> This is very likely to be 
> [https://github.com/EsotericSoftware/kryo/issues/341]
> Upgrade Kryo to 4.0+ probably could fix this
>  
> Wish for upgrade Kryo version for spark



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25258) Upgrade kryo package to version 4.0.2

2018-09-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25258.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22179
[https://github.com/apache/spark/pull/22179]

> Upgrade kryo package to version 4.0.2
> -
>
> Key: SPARK-25258
> URL: https://issues.apache.org/jira/browse/SPARK-25258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.1
>Reporter: liupengcheng
>Priority: Major
> Fix For: 2.4.0
>
>
> Recently, we encountered a kryo performance issue in spark2.1.0, and the 
> issue affect all kryo below 4.0.2, so it seems that all spark version might 
> encounter this issue.
> Issue description:
> In shuffle write phase or some spilling operation, spark will use kryo 
> serializer to serialize data if `spark.serializer` is set to 
> `KryoSerializer`, however, when data contains some extremely large records, 
> kryoSerializer's MapReferenceResolver would be expand, and it's `reset` 
> method will take a long time to reset all items in writtenObjects table to 
> null.
> com.esotericsoftware.kryo.util.MapReferenceResolver
> {code:java}
> public void reset () {
>  readObjects.clear();
>  writtenObjects.clear();
> }
> public void clear () {
>  K[] keyTable = this.keyTable;
>  for (int i = capacity + stashSize; i-- > 0;)
>   keyTable[i] = null;
>  size = 0;
>  stashSize = 0;
> }
> {code}
> I checked the kryo project in github, and this issue seems fixed in 4.0.2+
> [https://github.com/EsotericSoftware/kryo/commit/77935c696ee4976963aa5c6ac53d53d9b40b8bdd#diff-215fa9846e1e4e54bbeede0500de1e28]
>  
> I was wondering if we can make spark kryo package upgrade to 4.0.2+ to fix 
> this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25335) Skip Zinc downloading if it's installed in the system

2018-09-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25335.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22333
[https://github.com/apache/spark/pull/22333]

> Skip Zinc downloading if it's installed in the system
> -
>
> Key: SPARK-25335
> URL: https://issues.apache.org/jira/browse/SPARK-25335
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> Zinc is 23.5MB.
> {code}
> $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
>   % Total% Received % Xferd  Average Speed   TimeTime Time  
> Current
>  Dload  Upload   Total   SpentLeft  Speed
> 100 23.5M  100 23.5M0 0  35.4M  0 --:--:-- --:--:-- --:--:-- 35.3M
> {code}
> Currently, Spark downloads Zinc once. However, it occurs too many times in 
> build systems. This issue aims to skip Zinc downloading when the system 
> already has it.
> {code}
> $ build/mvn clean
> exec: curl --progress-bar -L 
> https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
>  
> 100.0%
> {code}
> This will reduce many resources(CPU/Networks/DISK) at least in Mac and 
> Docker-based build system.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25335) Skip Zinc downloading if it's installed in the system

2018-09-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25335:
-

Assignee: Dongjoon Hyun

> Skip Zinc downloading if it's installed in the system
> -
>
> Key: SPARK-25335
> URL: https://issues.apache.org/jira/browse/SPARK-25335
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> Zinc is 23.5MB.
> {code}
> $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
>   % Total% Received % Xferd  Average Speed   TimeTime Time  
> Current
>  Dload  Upload   Total   SpentLeft  Speed
> 100 23.5M  100 23.5M0 0  35.4M  0 --:--:-- --:--:-- --:--:-- 35.3M
> {code}
> Currently, Spark downloads Zinc once. However, it occurs too many times in 
> build systems. This issue aims to skip Zinc downloading when the system 
> already has it.
> {code}
> $ build/mvn clean
> exec: curl --progress-bar -L 
> https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
>  
> 100.0%
> {code}
> This will reduce many resources(CPU/Networks/DISK) at least in Mac and 
> Docker-based build system.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23243) Shuffle+Repartition on an RDD could lead to incorrect answers

2018-09-05 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23243.
-
   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 2.4.0

> Shuffle+Repartition on an RDD could lead to incorrect answers
> -
>
> Key: SPARK-23243
> URL: https://issues.apache.org/jira/browse/SPARK-23243
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Jiang Xingbo
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.0
>
>
> The RDD repartition also uses the round-robin way to distribute data, this 
> can also cause incorrect answers on RDD workload the similar way as in 
> https://issues.apache.org/jira/browse/SPARK-23207
> The approach that fixes DataFrame.repartition() doesn't apply on the RDD 
> repartition issue, as discussed in 
> https://github.com/apache/spark/pull/20393#issuecomment-360912451
> We track for alternative solutions for this issue in this task.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25268) runParallelPersonalizedPageRank throws serialization Exception

2018-09-05 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-25268:
--
Shepherd: Joseph K. Bradley

> runParallelPersonalizedPageRank throws serialization Exception
> --
>
> Key: SPARK-25268
> URL: https://issues.apache.org/jira/browse/SPARK-25268
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> A recent change to PageRank introduced a bug in the 
> ParallelPersonalizedPageRank implementation. The change prevents 
> serialization of a Map which needs to be broadcast to all workers. The issue 
> is in this line here: 
> [https://github.com/apache/spark/blob/6c5cb85856235efd464b109558896f81ae2c4c75/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala#L201]
> Because graphx units tests are run in local mode, the Serialization issue is 
> not caught.
>  
> {code:java}
> [info] - Star example parallel personalized PageRank *** FAILED *** (2 
> seconds, 160 milliseconds)
> [info] java.io.NotSerializableException: 
> scala.collection.immutable.MapLike$$anon$2
> [info] Serialization stack:
> [info] - object not serializable (class: 
> scala.collection.immutable.MapLike$$anon$2, value: Map(1 -> 
> SparseVector(3)((0,1.0)), 2 -> SparseVector(3)((1,1.0)), 3 -> 
> SparseVector(3)((2,1.0
> [info] at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> [info] at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291)
> [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1348)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:292)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:127)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> [info] at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
> [info] at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1489)
> [info] at 
> org.apache.spark.graphx.lib.PageRank$.runParallelPersonalizedPageRank(PageRank.scala:205)
> [info] at 
> org.apache.spark.graphx.lib.GraphXHelpers$.runParallelPersonalizedPageRank(GraphXHelpers.scala:31)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRank$.run(ParallelPersonalizedPageRank.scala:115)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRank.run(ParallelPersonalizedPageRank.scala:84)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply$mcV$sp(ParallelPersonalizedPageRankSuite.scala:62)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51)
> [info] at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info] at org.graphframes.SparkFunSuite.withFixture(SparkFunSuite.scala:40)
> [info] at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info] at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
> [info] at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
> [info] at scala.collection.immutable.List.foreach(List.scala:383)
> [info] at org.scalatest

[jira] [Resolved] (SPARK-25231) Running a Large Job with Speculation On Causes Executor Heartbeats to Time Out on Driver

2018-09-05 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-25231.
---
   Resolution: Fixed
 Assignee: Parth Gandhi
Fix Version/s: 2.4.0
   2.3.2

> Running a Large Job with Speculation On Causes Executor Heartbeats to Time 
> Out on Driver
> 
>
> Key: SPARK-25231
> URL: https://issues.apache.org/jira/browse/SPARK-25231
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.1
>Reporter: Parth Gandhi
>Assignee: Parth Gandhi
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> Running a large Spark job with speculation turned on was causing executor 
> heartbeats to time out on the driver end after sometime and eventually, after 
> hitting the max number of executor failures, the job would fail. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow

2018-09-05 Thread Bryan Cutler (JIRA)

Bryan Cutler created SPARK-25351:


 Summary: Handle Pandas category type when converting from Python 
with Arrow
 Key: SPARK-25351
 URL: https://issues.apache.org/jira/browse/SPARK-25351
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 2.3.1
Reporter: Bryan Cutler


There needs to be some handling of category types done when calling 
{{createDataFrame}} with Arrow or the return value of {{pandas_udf}}.  Without 
Arrow, Spark casts each element to the category. For example 

{noformat}
In [1]: import pandas as pd

In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})

In [3]: pdf["B"] = pdf["A"].astype('category')

In [4]: pdf
Out[4]: 
   A  B
0  a  a
1  b  b
2  c  c
3  a  a

In [5]: pdf.dtypes
Out[5]: 
A  object
Bcategory
dtype: object

In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)

In [8]: df = spark.createDataFrame(pdf)

In [9]: df.show()
+---+---+
|  A|  B|
+---+---+
|  a|  a|
|  b|  b|
|  c|  c|
|  a|  a|
+---+---+


In [10]: df.printSchema()
root
 |-- A: string (nullable = true)
 |-- B: string (nullable = true)

In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)

In [19]: df = spark.createDataFrame(pdf)   

   1667 spark_type = ArrayType(from_arrow_type(at.value_type))
   1668 else:
-> 1669 raise TypeError("Unsupported type in conversion from Arrow: " + 
str(at))
   1670 return spark_type
   1671 

TypeError: Unsupported type in conversion from Arrow: dictionary
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2018-09-05 Thread Bryan Cutler (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-21187:
-
Description: 
This is to track adding the remaining type support in Arrow Converters. 
Currently, only primitive data types are supported. '

Remaining types:
 * -*Date*-
 * -*Timestamp*-
 * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map
 * -*Decimal*-
 * -*Binary*-
* Categorical when converting from Pandas

Some things to do before closing this out:
 * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
values as BigDecimal)-
 * -Need to add some user docs-
 * -Make sure Python tests are thorough-
 * Check into complex type support mentioned in comments by [~leif], should we 
support mulit-indexing?

  was:
This is to track adding the remaining type support in Arrow Converters. 
Currently, only primitive data types are supported. '

Remaining types:
 * -*Date*-
 * -*Timestamp*-
 * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map
 * -*Decimal*-
 * -*Binary*-

Some things to do before closing this out:
 * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
values as BigDecimal)-
 * -Need to add some user docs-
 * -Make sure Python tests are thorough-
 * Check into complex type support mentioned in comments by [~leif], should we 
support mulit-indexing?


> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> This is to track adding the remaining type support in Arrow Converters. 
> Currently, only primitive data types are supported. '
> Remaining types:
>  * -*Date*-
>  * -*Timestamp*-
>  * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map
>  * -*Decimal*-
>  * -*Binary*-
> * Categorical when converting from Pandas
> Some things to do before closing this out:
>  * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)-
>  * -Need to add some user docs-
>  * -Make sure Python tests are thorough-
>  * Check into complex type support mentioned in comments by [~leif], should 
> we support mulit-indexing?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file

2018-09-05 Thread Shirish Tatikonda (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604926#comment-16604926
 ] 

Shirish Tatikonda commented on SPARK-19809:
---

Thank you [~dongjoon]

> NullPointerException on zero-size ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1
>Reporter: Michał Dawid
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: image-2018-02-26-20-29-49-410.png, 
> spark.sql.hive.convertMetastoreOrc.txt
>
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
>

[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-09-05 Thread Marcelo Vanzin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604877#comment-16604877
 ] 

Marcelo Vanzin commented on SPARK-24771:


I ran a couple of our tests that exercise avro and they worked fine with 2.4. 
They're not comprehensive, though:

- one uses the data source to read / write data, and that shouldn't really be 
affected by the change
- the other uses {{GenericRecord}}, so it doesn't really use generated Avro 
types.

So I don't really have a test that can say for sure what will break when you 
use generated types, which is the part that is explicitly called as being 
changed in 1.8. I still think it would be good to try to shade Avro 1.8 in the 
data source, and not expose it to other parts of Spark, but otherwise a 
strongly worded release note might be ok, although not optimal.

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25350) Spark Serving

2018-09-05 Thread Mark Hamilton (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604842#comment-16604842
 ] 

Mark Hamilton commented on SPARK-25350:
---

Hey, [~rxin], we had talked about this contribution at this past Spark + AI 
summit and I was wondering if you could at mention someone from your team who 
would like to check it out and give comments. Thanks so much for the help! 

> Spark Serving
> -
>
> Key: SPARK-25350
> URL: https://issues.apache.org/jira/browse/SPARK-25350
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mark Hamilton
>Priority: Major
>  Labels: features
>
> Microsoft has created a new system to turn Structured Streaming jobs into 
> RESTful web services. We would like to commit this work back to the 
> community. 
> More information can be found at the [ MMLSpark 
> website|[http://www.aka.ms/spark]]
> And the [ Spark Serving Documentation 
> page|[https://github.com/Azure/mmlspark/blob/master/docs/mmlspark-serving.md]]
>  
> The code can be found in the MMLSpark Repo and a PR will be made soon:
> [https://github.com/Azure/mmlspark/blob/master/src/io/http/src/main/scala/HTTPSource.scala]
>  
> Thanks for your help and feedback!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25350) Spark Serving

2018-09-05 Thread Mark Hamilton (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamilton updated SPARK-25350:
--
Description: 
Microsoft has created a new system to turn Structured Streaming jobs into 
RESTful web services. We would like to commit this work back to the community. 

More information can be found at the [MMLSpark 
website|[http://www.aka.ms/spark]]

And the [Spark Serving Documentation 
page|[https://github.com/Azure/mmlspark/blob/master/docs/mmlspark-serving.md] ]

 

The code can be found in the MMLSpark Repo and a PR will be made soon:

[https://github.com/Azure/mmlspark/blob/master/src/io/http/src/main/scala/HTTPSource.scala]

 

Thanks for your help and feedback!

  was:
Microsoft has created a new system to turn Structured Streaming jobs into 
RESTful web services. We would like to commit this work back to the community. 

More information can be found at the [MMLSpark website | 
[http://www.aka.ms/spark]]

And the [Spark Serving Documentation page | 
[https://github.com/Azure/mmlspark/blob/master/docs/mmlspark-serving.md] ]

 

The code can be found in the MMLSpark Repo and a PR will be made soon:

[https://github.com/Azure/mmlspark/blob/master/src/io/http/src/main/scala/HTTPSource.scala]

 

Thanks for your help and feedback!


> Spark Serving
> -
>
> Key: SPARK-25350
> URL: https://issues.apache.org/jira/browse/SPARK-25350
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mark Hamilton
>Priority: Major
>  Labels: features
>
> Microsoft has created a new system to turn Structured Streaming jobs into 
> RESTful web services. We would like to commit this work back to the 
> community. 
> More information can be found at the [MMLSpark 
> website|[http://www.aka.ms/spark]]
> And the [Spark Serving Documentation 
> page|[https://github.com/Azure/mmlspark/blob/master/docs/mmlspark-serving.md] 
> ]
>  
> The code can be found in the MMLSpark Repo and a PR will be made soon:
> [https://github.com/Azure/mmlspark/blob/master/src/io/http/src/main/scala/HTTPSource.scala]
>  
> Thanks for your help and feedback!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25350) Spark Serving

2018-09-05 Thread Mark Hamilton (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamilton updated SPARK-25350:
--
Description: 
Microsoft has created a new system to turn Structured Streaming jobs into 
RESTful web services. We would like to commit this work back to the community. 

More information can be found at the [ MMLSpark 
website|[http://www.aka.ms/spark]]

And the [ Spark Serving Documentation 
page|[https://github.com/Azure/mmlspark/blob/master/docs/mmlspark-serving.md]]

 

The code can be found in the MMLSpark Repo and a PR will be made soon:

[https://github.com/Azure/mmlspark/blob/master/src/io/http/src/main/scala/HTTPSource.scala]

 

Thanks for your help and feedback!

  was:
Microsoft has created a new system to turn Structured Streaming jobs into 
RESTful web services. We would like to commit this work back to the community. 

More information can be found at the [MMLSpark 
website|[http://www.aka.ms/spark]]

And the [Spark Serving Documentation 
page|[https://github.com/Azure/mmlspark/blob/master/docs/mmlspark-serving.md] ]

 

The code can be found in the MMLSpark Repo and a PR will be made soon:

[https://github.com/Azure/mmlspark/blob/master/src/io/http/src/main/scala/HTTPSource.scala]

 

Thanks for your help and feedback!


> Spark Serving
> -
>
> Key: SPARK-25350
> URL: https://issues.apache.org/jira/browse/SPARK-25350
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mark Hamilton
>Priority: Major
>  Labels: features
>
> Microsoft has created a new system to turn Structured Streaming jobs into 
> RESTful web services. We would like to commit this work back to the 
> community. 
> More information can be found at the [ MMLSpark 
> website|[http://www.aka.ms/spark]]
> And the [ Spark Serving Documentation 
> page|[https://github.com/Azure/mmlspark/blob/master/docs/mmlspark-serving.md]]
>  
> The code can be found in the MMLSpark Repo and a PR will be made soon:
> [https://github.com/Azure/mmlspark/blob/master/src/io/http/src/main/scala/HTTPSource.scala]
>  
> Thanks for your help and feedback!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25350) Spark Serving

2018-09-05 Thread Mark Hamilton (JIRA)

Mark Hamilton created SPARK-25350:
-

 Summary: Spark Serving
 Key: SPARK-25350
 URL: https://issues.apache.org/jira/browse/SPARK-25350
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 2.3.1
Reporter: Mark Hamilton


Microsoft has created a new system to turn Structured Streaming jobs into 
RESTful web services. We would like to commit this work back to the community. 

More information can be found at the [MMLSpark website | 
[http://www.aka.ms/spark]]

And the [Spark Serving Documentation page | 
[https://github.com/Azure/mmlspark/blob/master/docs/mmlspark-serving.md] ]

 

The code can be found in the MMLSpark Repo and a PR will be made soon:

[https://github.com/Azure/mmlspark/blob/master/src/io/http/src/main/scala/HTTPSource.scala]

 

Thanks for your help and feedback!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25347) Document image data source in doc site

2018-09-05 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25347:
--
Summary: Document image data source in doc site  (was: Document image data 
sources in doc site)

> Document image data source in doc site
> --
>
> Key: SPARK-25347
> URL: https://issues.apache.org/jira/browse/SPARK-25347
> Project: Spark
>  Issue Type: Story
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Currently, we only have Scala/Java API docs for image data source. It would 
> be nice to have some documentation in the doc site. So Python/R users can 
> also discover this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25345) Deprecate public APIs from ImageSchema

2018-09-05 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25345:
--
Description: After SPARK-22328, we can deprecate the public APIs in 
ImageSchema (Scala/Python) and remove them in Spark 3.0 (TODO: create JIRA). So 
users get a unified approach to load images w/ Spark.  (was: After SPARK-22328, 
we can deprecate the public APIs in ImageSchema and remove them in Spark 3.0 
(TODO: create JIRA). So users get a unified approach to load images w/ Spark.)

> Deprecate public APIs from ImageSchema
> --
>
> Key: SPARK-25345
> URL: https://issues.apache.org/jira/browse/SPARK-25345
> Project: Spark
>  Issue Type: Story
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> After SPARK-22328, we can deprecate the public APIs in ImageSchema 
> (Scala/Python) and remove them in Spark 3.0 (TODO: create JIRA). So users get 
> a unified approach to load images w/ Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25349) Support sample pushdown in Data Source V2

2018-09-05 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-25349:
-

 Summary: Support sample pushdown in Data Source V2
 Key: SPARK-25349
 URL: https://issues.apache.org/jira/browse/SPARK-25349
 Project: Spark
  Issue Type: Story
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xiangrui Meng


Support sample pushdown would help file-based data source implementation save 
I/O cost significantly if it can decide whether to read a file or not.

 

cc: [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25348) Data source for binary files

2018-09-05 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-25348:
-

 Summary: Data source for binary files
 Key: SPARK-25348
 URL: https://issues.apache.org/jira/browse/SPARK-25348
 Project: Spark
  Issue Type: Story
  Components: ML, SQL
Affects Versions: 3.0.0
Reporter: Xiangrui Meng


It would be useful to have a data source implementation for binary files, which 
can be used to build features to load images, audio, and videos.

Microsoft has an implementation at 
[https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be 
great if we can merge it into Spark main repo.

cc: [~mhamilton] and [~imatiach]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25347) Document image data sources in doc site

2018-09-05 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-25347:
-

 Summary: Document image data sources in doc site
 Key: SPARK-25347
 URL: https://issues.apache.org/jira/browse/SPARK-25347
 Project: Spark
  Issue Type: Story
  Components: Documentation
Affects Versions: 2.4.0
Reporter: Xiangrui Meng


Currently, we only have Scala/Java API docs for image data source. It would be 
nice to have some documentation in the doc site. So Python/R users can also 
discover this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25346) Document Spark builtin data sources

2018-09-05 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25346:
--
Summary: Document Spark builtin data sources  (was: Document Spark built-in 
data sources)

> Document Spark builtin data sources
> ---
>
> Key: SPARK-25346
> URL: https://issues.apache.org/jira/browse/SPARK-25346
> Project: Spark
>  Issue Type: Story
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> It would be nice to list built-in data sources in the doc site. So users know 
> what are available by default. However, I didn't find any from 2.3.1 docs.
>  
> cc: [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25346) Document Spark built-in data sources

2018-09-05 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25346:
--
Summary: Document Spark built-in data sources  (was: Document Spark buit-in 
data sources)

> Document Spark built-in data sources
> 
>
> Key: SPARK-25346
> URL: https://issues.apache.org/jira/browse/SPARK-25346
> Project: Spark
>  Issue Type: Story
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> It would be nice to list built-in data sources in the doc site. So users know 
> what are available by default. However, I didn't find any from 2.3.1 docs.
>  
> cc: [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25346) Document Spark buit-in data sources

2018-09-05 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-25346:
-

 Summary: Document Spark buit-in data sources
 Key: SPARK-25346
 URL: https://issues.apache.org/jira/browse/SPARK-25346
 Project: Spark
  Issue Type: Story
  Components: Documentation
Affects Versions: 2.4.0
Reporter: Xiangrui Meng


It would be nice to list built-in data sources in the doc site. So users know 
what are available by default. However, I didn't find any from 2.3.1 docs.

 

cc: [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25345) Deprecate public APIs from ImageSchema

2018-09-05 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-25345:
-

 Summary: Deprecate public APIs from ImageSchema
 Key: SPARK-25345
 URL: https://issues.apache.org/jira/browse/SPARK-25345
 Project: Spark
  Issue Type: Story
  Components: ML
Affects Versions: 2.4.0
Reporter: Xiangrui Meng


After SPARK-22328, we can deprecate the public APIs in ImageSchema and remove 
them in Spark 3.0 (TODO: create JIRA). So users get a unified approach to load 
images w/ Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22666) Spark datasource for image format

2018-09-05 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-22666.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22328
[https://github.com/apache/spark/pull/22328]

> Spark datasource for image format
> -
>
> Key: SPARK-22666
> URL: https://issues.apache.org/jira/browse/SPARK-22666
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Timothy Hunter
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> The current API for the new image format is implemented as a standalone 
> feature, in order to make it reside within the mllib package. As discussed in 
> SPARK-21866, users should be able to load images through the more common 
> spark source reader interface.
> This ticket is concerned with adding image reading support in the spark 
> source API, through either of the following interfaces:
>  - {{spark.read.format("image")...}}
>  - {{spark.read.image}}
> The output is a dataframe that contains images (and the file names for 
> example), following the semantics discussed already in SPARK-21866.
> A few technical notes:
> * since the functionality is implemented in {{mllib}}, calling this function 
> may fail at runtime if users have not imported the {{spark-mllib}} dependency
> * How to deal with very flat directories? It is common to have millions of 
> files in a single "directory" (like in S3), which seems to have caused some 
> issues to some users. If this issue is too complex to handle in this ticket, 
> it can be dealt with separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25344) Break large tests.py files into smaller files

2018-09-05 Thread Imran Rashid (JIRA)

Imran Rashid created SPARK-25344:


 Summary: Break large tests.py files into smaller files
 Key: SPARK-25344
 URL: https://issues.apache.org/jira/browse/SPARK-25344
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Imran Rashid


We've got a ton of tests in one humongous tests.py file, rather than breaking 
it out into smaller files.

Having one huge file doesn't seem great for code organization, and it also 
makes the test parallelization in run-tests.py not work as well.  On my laptop, 
tests.py takes 150s, and the next longest test file takes only 20s.  There are 
similarly large files in other pyspark modules, eg. sql/tests.py, ml/tests.py, 
mllib/tests.py, streaming/tests.py.

It seems that at least for some of these files, its already broken into 
independent test classes, so it shouldn't be too hard to just move them into 
their own files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24360) Support Hive 3.0 metastore

2018-09-05 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604748#comment-16604748
 ] 

Dongjoon Hyun commented on SPARK-24360:
---

[~toopt4]. Yep. We should support Hive 3.1 in this JIRA.

> Support Hive 3.0 metastore
> --
>
> Key: SPARK-24360
> URL: https://issues.apache.org/jira/browse/SPARK-24360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Hive 3.0.0 is released. This issue aims to support Hive Metastore 3.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24360) Support Hive 3.1 metastore

2018-09-05 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24360:
--
Summary: Support Hive 3.1 metastore  (was: Support Hive 3.0 metastore)

> Support Hive 3.1 metastore
> --
>
> Key: SPARK-24360
> URL: https://issues.apache.org/jira/browse/SPARK-24360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Hive 3.0.0 is released. This issue aims to support Hive Metastore 3.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25343) Extend CSV parsing to Dataset[List[String]]

2018-09-05 Thread Frank Kemmer (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Kemmer updated SPARK-25343:
-
Description: 
With the cvs() method it is currenty possible to create a Dataframe from 
Dataset[String], where the given string contains comma separated values. This 
is really great.

But very often we have to parse files where we have to split the values of a 
line by very individual value separators and regular expressions. The result is 
a Dataset[List[String]]. This list corresponds to what you would get, after 
splitting the values of a CSV string at the separators.

It would be great, if the csv() method would also accept such a Dataset as 
input especially given a target schema. The csv parser usually casts the 
separated values against the schema and can sort out lines where the values of 
the columns do not fit with the schema.

This is especially interesting with PERMISSIVE mode and a column for corrupt 
records which then should contain the input list of strings as a dumped JSON 
string.

This is the functionality I am looking for and I think the CSV parser is very 
close to it.

  was:
With the cvs() method it is currenty possible to create a Dataframe from 
Dataset[String], where the given string contains comma separated values. This 
is really great.

But very often we have to parse files where we have to split the values of a 
line by very individual value separators and regular expressions. The result is 
a Dataset[List[String]]. This list corresponds to what you would get, after 
splitting the values of a CSV string at the separators.

It would be great, if the csv() method would also accept such a Dataset as 
input especially given a target schema. The csv parser usually casts the 
separated values against the schema and can sort out lines where the values of 
the columns do not fit with the schema.

This is especially interesting with PERMISSIVE mode and a column for corrupt 
records which then should contain the input list of strings as a dumped JSON 
string.

This is the functionality I am looking for and I think it is already 
implemented in the CSV parser.


> Extend CSV parsing to Dataset[List[String]]
> ---
>
> Key: SPARK-25343
> URL: https://issues.apache.org/jira/browse/SPARK-25343
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Frank Kemmer
>Priority: Minor
>
> With the cvs() method it is currenty possible to create a Dataframe from 
> Dataset[String], where the given string contains comma separated values. This 
> is really great.
> But very often we have to parse files where we have to split the values of a 
> line by very individual value separators and regular expressions. The result 
> is a Dataset[List[String]]. This list corresponds to what you would get, 
> after splitting the values of a CSV string at the separators.
> It would be great, if the csv() method would also accept such a Dataset as 
> input especially given a target schema. The csv parser usually casts the 
> separated values against the schema and can sort out lines where the values 
> of the columns do not fit with the schema.
> This is especially interesting with PERMISSIVE mode and a column for corrupt 
> records which then should contain the input list of strings as a dumped JSON 
> string.
> This is the functionality I am looking for and I think the CSV parser is very 
> close to it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25343) Extend CSV parsing to Dataset[List[String]]

2018-09-05 Thread Frank Kemmer (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Kemmer updated SPARK-25343:
-
Description: 
With the cvs() method it is currenty possible to create a Dataframe from 
Dataset[String], where the given string contains comma separated values. This 
is really great.

But very often we have to parse files where we have to split the values of a 
line by very individual value separators and regular expressions. The result is 
a Dataset[List[String]]. This list corresponds to what you would get, after 
splitting the values of a CSV string at the separators.

It would be great, if the csv() method would also accept such a Dataset as 
input especially given a target schema. The csv parser usually casts the 
separated values against the schema and can sort out lines where the values of 
the columns do not fit with the schema.

This is especially interesting with PERMISSIVE mode and a column for corrupt 
records which then should contain the input list of strings as a dumped JSON 
string.

This is the functionality I am looking for and I think it is already 
implemented in the CSV parser.

  was:
With the cvs() method it is currenty possible to create a Dataframe from 
Dataset[String], where the given string contains comma separated values. This 
is really great.

But very often we have to parse files where we have to split the values of a 
line by very individual value separators and regular expressions. The result is 
a Dataset[List[String]]. This list corresponds to what you would get, after 
splitting the values of a CSV string.

It would be great, if the csv() method would also accept such a Dataset as 
input especially given a target schema. The csv parser usually casts the 
separated values against the schema and can sort out lines where the values of 
the columns do not fit with the schema.

This is especially interesting with PERMISSIVE mode and a column for corrupt 
records which then should contain the input list of strings as a dumped JSON 
string.

This is the functionality I am looking for and I think it is already 
implemented in the CSV parser.


> Extend CSV parsing to Dataset[List[String]]
> ---
>
> Key: SPARK-25343
> URL: https://issues.apache.org/jira/browse/SPARK-25343
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Frank Kemmer
>Priority: Minor
>
> With the cvs() method it is currenty possible to create a Dataframe from 
> Dataset[String], where the given string contains comma separated values. This 
> is really great.
> But very often we have to parse files where we have to split the values of a 
> line by very individual value separators and regular expressions. The result 
> is a Dataset[List[String]]. This list corresponds to what you would get, 
> after splitting the values of a CSV string at the separators.
> It would be great, if the csv() method would also accept such a Dataset as 
> input especially given a target schema. The csv parser usually casts the 
> separated values against the schema and can sort out lines where the values 
> of the columns do not fit with the schema.
> This is especially interesting with PERMISSIVE mode and a column for corrupt 
> records which then should contain the input list of strings as a dumped JSON 
> string.
> This is the functionality I am looking for and I think it is already 
> implemented in the CSV parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25343) Extend CSV parsing to Dataset[List[String]]

2018-09-05 Thread Frank Kemmer (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Kemmer updated SPARK-25343:
-
Description: 
With the cvs() method it is currenty possible to create a Dataframe from 
Dataset[String], where the given string contains comma separated values. This 
is really great.

But very often we have to parse files where we have to split the values of a 
line by very individual value separators and regular expressions. The result is 
a Dataset[List[String]]. This list corresponds to what you would get, after 
splitting the values of a CSV string.

It would be great, if the csv() method would also accept such a Dataset as 
input especially given a target schema. The csv parser usually casts the 
separated values against the schema and can sort out lines where the values of 
the columns do not fit with the schema.

This is especially interesting with PERMISSIVE mode and a column for corrupt 
records which then should contain the input list of strings as a dumped JSON 
string.

This is the functionality I am looking for and I think it is already 
implemented in the CSV parser.

  was:
With the cvs() method it is currenty possible to create a Dataframe from 
Dataset[String], where the given string contains comma separated values. This 
is really great.

But very often we have to parse files where we have to split the values of a 
line by very individual value separators and regular expressions. The result is 
a Dataset[List[String]]. This list corresponds to what you would get, after 
splitting the values of a CSV string.

It would be great, if the csv() method would also accept such a Dataset as 
input especially given a target schema. The csv parser usually casts the 
separated values against the schema and can sort out lines where the values of 
the columns do not fit with the schema.

This is the functionality I am looking for and I think it is already 
implemented in the CSV parser.


> Extend CSV parsing to Dataset[List[String]]
> ---
>
> Key: SPARK-25343
> URL: https://issues.apache.org/jira/browse/SPARK-25343
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Frank Kemmer
>Priority: Minor
>
> With the cvs() method it is currenty possible to create a Dataframe from 
> Dataset[String], where the given string contains comma separated values. This 
> is really great.
> But very often we have to parse files where we have to split the values of a 
> line by very individual value separators and regular expressions. The result 
> is a Dataset[List[String]]. This list corresponds to what you would get, 
> after splitting the values of a CSV string.
> It would be great, if the csv() method would also accept such a Dataset as 
> input especially given a target schema. The csv parser usually casts the 
> separated values against the schema and can sort out lines where the values 
> of the columns do not fit with the schema.
> This is especially interesting with PERMISSIVE mode and a column for corrupt 
> records which then should contain the input list of strings as a dumped JSON 
> string.
> This is the functionality I am looking for and I think it is already 
> implemented in the CSV parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25339) Refactor FilterPushdownBenchmark to use main method

2018-09-05 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604738#comment-16604738
 ] 

Dongjoon Hyun commented on SPARK-25339:
---

Thank you for filing this in order not to forget this. I'm okay. If you want, 
you can work on this, [~yumwang].

> Refactor FilterPushdownBenchmark to use main method
> ---
>
> Key: SPARK-25339
> URL: https://issues.apache.org/jira/browse/SPARK-25339
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> Wenchen commented on the PR: 
> https://github.com/apache/spark/pull/22336#issuecomment-418604019



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25343) Extend CSV parsing to Dataset[List[String]]

2018-09-05 Thread Frank Kemmer (JIRA)

Frank Kemmer created SPARK-25343:


 Summary: Extend CSV parsing to Dataset[List[String]]
 Key: SPARK-25343
 URL: https://issues.apache.org/jira/browse/SPARK-25343
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Frank Kemmer


With the cvs() method it is currenty possible to create a Dataframe from 
Dataset[String], where the given string contains comma separated values. This 
is really great.

But very often we have to parse files where we have to split the values of a 
line by very individual value separators and regular expressions. The result is 
a Dataset[List[String]]. This list corresponds to what you would get, after 
splitting the values of a CSV string.

It would be great, if the csv() method would also accept such a Dataset as 
input especially given a target schema. The csv parser usually casts the 
separated values against the schema and can sort out lines where the values of 
the columns do not fit with the schema.

This is the functionality I am looking for and I think it is already 
implemented in the CSV parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25342) Support rolling back a result stage

2018-09-05 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-25342:
---

 Summary: Support rolling back a result stage
 Key: SPARK-25342
 URL: https://issues.apache.org/jira/browse/SPARK-25342
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Wenchen Fan


This is a follow up of https://issues.apache.org/jira/browse/SPARK-23243

To completely fix that problem, Spark needs to be able to rollback a result 
stage and rerun all the result tasks.

However, the result stage may do file committing, which does not support 
re-commit a task currently. We should either support to rollback a committed 
task, or abort the entire committing and do it again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25341) Support rolling back a shuffle map stage and re-generate the shuffle files

2018-09-05 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-25341:
---

 Summary: Support rolling back a shuffle map stage and re-generate 
the shuffle files
 Key: SPARK-25341
 URL: https://issues.apache.org/jira/browse/SPARK-25341
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Wenchen Fan


This is a follow up of https://issues.apache.org/jira/browse/SPARK-23243

To completely fix that problem, Spark needs to be able to rollback a shuffle 
map stage and rerun all the map tasks.

According to https://github.com/apache/spark/pull/9214 , Spark doesn't support 
it currently, as in shuffle writing "first write wins".

Since overwriting shuffle files is hard, we can extend the shuffle id to 
include a "shuffle generation number". Then the reduce task can specify which 
generation of shuffle it wants to read. 
https://github.com/apache/spark/pull/6648 seems in the right direction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24748) Support for reporting custom metrics via Streaming Query Progress

2018-09-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24748:


Assignee: (was: Apache Spark)

> Support for reporting custom metrics via Streaming Query Progress
> -
>
> Key: SPARK-24748
> URL: https://issues.apache.org/jira/browse/SPARK-24748
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Arun Mahadevan
>Priority: Major
>
> Currently the Structured Streaming sources and sinks does not have a way to 
> report custom metrics. Providing an option to report custom metrics and 
> making it available via Streaming Query progress can enable sources and sinks 
> to report custom progress information (E.g. the lag metrics for Kafka source).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24748) Support for reporting custom metrics via Streaming Query Progress

2018-09-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24748:


Assignee: Apache Spark

> Support for reporting custom metrics via Streaming Query Progress
> -
>
> Key: SPARK-24748
> URL: https://issues.apache.org/jira/browse/SPARK-24748
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Arun Mahadevan
>Assignee: Apache Spark
>Priority: Major
>
> Currently the Structured Streaming sources and sinks does not have a way to 
> report custom metrics. Providing an option to report custom metrics and 
> making it available via Streaming Query progress can enable sources and sinks 
> to report custom progress information (E.g. the lag metrics for Kafka source).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24539) HistoryServer does not display metrics from tasks that complete after stage failure

2018-09-05 Thread Ankur Gupta (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Gupta resolved SPARK-24539.
-
Resolution: Duplicate

Resolving this as it has been fixed by SPARK-24415.

> HistoryServer does not display metrics from tasks that complete after stage 
> failure
> ---
>
> Key: SPARK-24539
> URL: https://issues.apache.org/jira/browse/SPARK-24539
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Imran Rashid
>Priority: Major
>
> I noticed that task metrics for completed tasks with a stage failure do not 
> show up in the new history server.  I have a feeling this is because all of 
> the tasks succeeded *after* the stage had been failed (so they were 
> completions from a "zombie" taskset).  The task metrics (eg. the shuffle read 
> size & shuffle write size) do not show up at all, either in the task table, 
> the executor table, or the overall stage summary metrics.  (they might not 
> show up in the job summary page either, but in the event logs I have, there 
> is another successful stage attempt after this one, and that is the only 
> thing which shows up in the jobs page.)  If you get task details from the api 
> endpoint (eg. 
> http://[host]:[port]/api/v1/applications/[app-id]/stages/[stage-id]/[stage-attempt])
>  then you can see the successful tasks and all the metrics
> Unfortunately the event logs I have are huge and I don't have a small repro 
> handy, but I hope that description is enough to go on.
> I loaded the event logs I have in the SHS from spark 2.2 and they appear fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24415) Stage page aggregated executor metrics wrong when failures

2018-09-05 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24415:
--

Assignee: Ankur Gupta

> Stage page aggregated executor metrics wrong when failures 
> ---
>
> Key: SPARK-24415
> URL: https://issues.apache.org/jira/browse/SPARK-24415
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Ankur Gupta
>Priority: Critical
> Fix For: 2.4.0
>
> Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png
>
>
> Running with spark 2.3 on yarn and having task failures and blacklisting, the 
> aggregated metrics by executor are not correct.  In my example it should have 
> 2 failed tasks but it only shows one.    Note I tested with master branch to 
> verify its not fixed.
> I will attach screen shot.
> To reproduce:
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
> --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
> --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
> "spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
> "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
> "spark.blacklist.killBlacklistedExecutors=true"
> import org.apache.spark.SparkEnv 
> sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt 
> >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad 
> executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24415) Stage page aggregated executor metrics wrong when failures

2018-09-05 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24415.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22209
[https://github.com/apache/spark/pull/22209]

> Stage page aggregated executor metrics wrong when failures 
> ---
>
> Key: SPARK-24415
> URL: https://issues.apache.org/jira/browse/SPARK-24415
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Critical
> Fix For: 2.4.0
>
> Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png
>
>
> Running with spark 2.3 on yarn and having task failures and blacklisting, the 
> aggregated metrics by executor are not correct.  In my example it should have 
> 2 failed tasks but it only shows one.    Note I tested with master branch to 
> verify its not fixed.
> I will attach screen shot.
> To reproduce:
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
> --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
> --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
> "spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
> "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
> "spark.blacklist.killBlacklistedExecutors=true"
> import org.apache.spark.SparkEnv 
> sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt 
> >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad 
> executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14922) Alter Table Drop Partition Using Predicate-based Partition Spec

2018-09-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604618#comment-16604618
 ] 

Apache Spark commented on SPARK-14922:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/20999

> Alter Table Drop Partition Using Predicate-based Partition Spec
> ---
>
> Key: SPARK-14922
> URL: https://issues.apache.org/jira/browse/SPARK-14922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.2, 2.2.1
>Reporter: Xiao Li
>Priority: Major
>
> Below is allowed in Hive, but not allowed in Spark.
> {noformat}
> alter table ptestfilter drop partition (c='US', d<'2')
> {noformat}
> This example is copied from drop_partitions_filter.q



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25279) Throw exception: zzcclp java.io.NotSerializableException: org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc

2018-09-05 Thread Zhichao Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhichao  Zhang resolved SPARK-25279.

Resolution: Won't Fix

> Throw exception: zzcclp   java.io.NotSerializableException: 
> org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc
> ---
>
> Key: SPARK-25279
> URL: https://issues.apache.org/jira/browse/SPARK-25279
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.1
>Reporter: Zhichao  Zhang
>Priority: Minor
>
> Hi dev: 
>   I am using Spark-Shell to run the example which is in section 
> '[http://spark.apache.org/docs/2.2.2/sql-programming-guide.html#type-safe-user-defined-aggregate-functions'],
>  
> and there is an error: 
> {code:java}
> Caused by: java.io.NotSerializableException: 
> org.apache.spark.sql.TypedColumn 
> Serialization stack: 
>         - object not serializable (class: org.apache.spark.sql.TypedColumn, 
> value: 
> myaverage() AS `average_salary`) 
>         - field (class: $iw, name: averageSalary, type: class 
> org.apache.spark.sql.TypedColumn) 
>         - object (class $iw, $iw@4b2f8ae9) 
>         - field (class: MyAverage$, name: $outer, type: class $iw) 
>         - object (class MyAverage$, MyAverage$@2be41d90) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, 
> name: aggregator, type: class org.apache.spark.sql.expressions.Aggregator) 
>         - object (class 
> org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, 
> MyAverage(Employee)) 
>         - field (class: 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, 
> name: aggregateFunction, type: class 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction) 
>         - object (class 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, 
> partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class Employee)), 
> Some(class Employee), Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0)) 
>         - writeObject data (class: 
> scala.collection.immutable.List$SerializationProxy) 
>         - object (class scala.collection.immutable.List$SerializationProxy, 
> scala.collection.immutable.List$SerializationProxy@5e92c46f) 
>         - writeReplace data (class: 
> scala.collection.immutable.List$SerializationProxy) 
>         - object (class scala.collection.immutable.$colon$colon, 
> List(partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class 
> Employee)), Some(class Employee), 
> Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0))) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, name: 
> aggregateExpressions, type: interface scala.collection.Seq) 
>         - object (class 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, 
> ObjectHashAggregate(keys=[], 
> functions=[partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class 
> Employee)), Some(class Employee), 
> Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0)], output=[buf#37]) 
> +- *FileScan json [name#8,salary#9L] Batched: false, Format: JSON, Location: 
> InMemoryFileIndex[file:/opt/spark2/examples/src/main/resources/employees.json],
>  
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct 
> ) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1,
>  
> name: $outer, type: class 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec) 
>         - object (class 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1,
>  
> ) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2,
>  
> name: $outer, type: class 
> org.apache.spark.sql.execution

[jira] [Closed] (SPARK-25279) Throw exception: zzcclp java.io.NotSerializableException: org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc

2018-09-05 Thread Zhichao Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhichao  Zhang closed SPARK-25279.
--

> Throw exception: zzcclp   java.io.NotSerializableException: 
> org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc
> ---
>
> Key: SPARK-25279
> URL: https://issues.apache.org/jira/browse/SPARK-25279
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.1
>Reporter: Zhichao  Zhang
>Priority: Minor
>
> Hi dev: 
>   I am using Spark-Shell to run the example which is in section 
> '[http://spark.apache.org/docs/2.2.2/sql-programming-guide.html#type-safe-user-defined-aggregate-functions'],
>  
> and there is an error: 
> {code:java}
> Caused by: java.io.NotSerializableException: 
> org.apache.spark.sql.TypedColumn 
> Serialization stack: 
>         - object not serializable (class: org.apache.spark.sql.TypedColumn, 
> value: 
> myaverage() AS `average_salary`) 
>         - field (class: $iw, name: averageSalary, type: class 
> org.apache.spark.sql.TypedColumn) 
>         - object (class $iw, $iw@4b2f8ae9) 
>         - field (class: MyAverage$, name: $outer, type: class $iw) 
>         - object (class MyAverage$, MyAverage$@2be41d90) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, 
> name: aggregator, type: class org.apache.spark.sql.expressions.Aggregator) 
>         - object (class 
> org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, 
> MyAverage(Employee)) 
>         - field (class: 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, 
> name: aggregateFunction, type: class 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction) 
>         - object (class 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, 
> partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class Employee)), 
> Some(class Employee), Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0)) 
>         - writeObject data (class: 
> scala.collection.immutable.List$SerializationProxy) 
>         - object (class scala.collection.immutable.List$SerializationProxy, 
> scala.collection.immutable.List$SerializationProxy@5e92c46f) 
>         - writeReplace data (class: 
> scala.collection.immutable.List$SerializationProxy) 
>         - object (class scala.collection.immutable.$colon$colon, 
> List(partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class 
> Employee)), Some(class Employee), 
> Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0))) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, name: 
> aggregateExpressions, type: interface scala.collection.Seq) 
>         - object (class 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, 
> ObjectHashAggregate(keys=[], 
> functions=[partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class 
> Employee)), Some(class Employee), 
> Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0)], output=[buf#37]) 
> +- *FileScan json [name#8,salary#9L] Batched: false, Format: JSON, Location: 
> InMemoryFileIndex[file:/opt/spark2/examples/src/main/resources/employees.json],
>  
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct 
> ) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1,
>  
> name: $outer, type: class 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec) 
>         - object (class 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1,
>  
> ) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2,
>  
> name: $outer, type: class 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregate

[jira] [Commented] (SPARK-25279) Throw exception: zzcclp java.io.NotSerializableException: org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc

2018-09-05 Thread Zhichao Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604604#comment-16604604
 ] 

Zhichao  Zhang commented on SPARK-25279:


[~viirya], Thanks. I closed this issue.

> Throw exception: zzcclp   java.io.NotSerializableException: 
> org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc
> ---
>
> Key: SPARK-25279
> URL: https://issues.apache.org/jira/browse/SPARK-25279
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.1
>Reporter: Zhichao  Zhang
>Priority: Minor
>
> Hi dev: 
>   I am using Spark-Shell to run the example which is in section 
> '[http://spark.apache.org/docs/2.2.2/sql-programming-guide.html#type-safe-user-defined-aggregate-functions'],
>  
> and there is an error: 
> {code:java}
> Caused by: java.io.NotSerializableException: 
> org.apache.spark.sql.TypedColumn 
> Serialization stack: 
>         - object not serializable (class: org.apache.spark.sql.TypedColumn, 
> value: 
> myaverage() AS `average_salary`) 
>         - field (class: $iw, name: averageSalary, type: class 
> org.apache.spark.sql.TypedColumn) 
>         - object (class $iw, $iw@4b2f8ae9) 
>         - field (class: MyAverage$, name: $outer, type: class $iw) 
>         - object (class MyAverage$, MyAverage$@2be41d90) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, 
> name: aggregator, type: class org.apache.spark.sql.expressions.Aggregator) 
>         - object (class 
> org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, 
> MyAverage(Employee)) 
>         - field (class: 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, 
> name: aggregateFunction, type: class 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction) 
>         - object (class 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, 
> partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class Employee)), 
> Some(class Employee), Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0)) 
>         - writeObject data (class: 
> scala.collection.immutable.List$SerializationProxy) 
>         - object (class scala.collection.immutable.List$SerializationProxy, 
> scala.collection.immutable.List$SerializationProxy@5e92c46f) 
>         - writeReplace data (class: 
> scala.collection.immutable.List$SerializationProxy) 
>         - object (class scala.collection.immutable.$colon$colon, 
> List(partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class 
> Employee)), Some(class Employee), 
> Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0))) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, name: 
> aggregateExpressions, type: interface scala.collection.Seq) 
>         - object (class 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, 
> ObjectHashAggregate(keys=[], 
> functions=[partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class 
> Employee)), Some(class Employee), 
> Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0)], output=[buf#37]) 
> +- *FileScan json [name#8,salary#9L] Batched: false, Format: JSON, Location: 
> InMemoryFileIndex[file:/opt/spark2/examples/src/main/resources/employees.json],
>  
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct 
> ) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1,
>  
> name: $outer, type: class 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec) 
>         - object (class 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1,
>  
> ) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfu

[jira] [Commented] (SPARK-25132) Case-insensitive field resolution when reading from Parquet

2018-09-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604532#comment-16604532
 ] 

Apache Spark commented on SPARK-25132:
--

User 'seancxmao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22343

> Case-insensitive field resolution when reading from Parquet
> ---
>
> Key: SPARK-25132
> URL: https://issues.apache.org/jira/browse/SPARK-25132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: Chenxiao Mao
>Assignee: Chenxiao Mao
>Priority: Major
>  Labels: Parquet
> Fix For: 2.4.0
>
>
> Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases, regardless of spark.sql.caseSensitive 
> set to true or false.
> Here is a simple example to reproduce this issue:
> scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1")
> spark-sql> show create table t1;
> CREATE TABLE `t1` (`id` BIGINT)
> USING parquet
> OPTIONS (
>  `serialization.format` '1'
> )
> spark-sql> CREATE TABLE `t2` (`ID` BIGINT)
>  > USING parquet
>  > LOCATION 'hdfs://localhost/user/hive/warehouse/t1';
> spark-sql> select * from t1;
> 0
> 1
> 2
> 3
> 4
> spark-sql> select * from t2;
> NULL
> NULL
> NULL
> NULL
> NULL
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25132) Case-insensitive field resolution when reading from Parquet

2018-09-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604529#comment-16604529
 ] 

Apache Spark commented on SPARK-25132:
--

User 'seancxmao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22343

> Case-insensitive field resolution when reading from Parquet
> ---
>
> Key: SPARK-25132
> URL: https://issues.apache.org/jira/browse/SPARK-25132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: Chenxiao Mao
>Assignee: Chenxiao Mao
>Priority: Major
>  Labels: Parquet
> Fix For: 2.4.0
>
>
> Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases, regardless of spark.sql.caseSensitive 
> set to true or false.
> Here is a simple example to reproduce this issue:
> scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1")
> spark-sql> show create table t1;
> CREATE TABLE `t1` (`id` BIGINT)
> USING parquet
> OPTIONS (
>  `serialization.format` '1'
> )
> spark-sql> CREATE TABLE `t2` (`ID` BIGINT)
>  > USING parquet
>  > LOCATION 'hdfs://localhost/user/hive/warehouse/t1';
> spark-sql> select * from t1;
> 0
> 1
> 2
> 3
> 4
> spark-sql> select * from t2;
> NULL
> NULL
> NULL
> NULL
> NULL
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23443) Spark with Glue as external catalog

2018-09-05 Thread Ameen Tayyebi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604441#comment-16604441
 ] 

Ameen Tayyebi commented on SPARK-23443:
---

I've been sidetracked with lots of other projects, so at this time, I don't
have bandwidth to work on this unfortunately :( :(




> Spark with Glue as external catalog
> ---
>
> Key: SPARK-23443
> URL: https://issues.apache.org/jira/browse/SPARK-23443
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Ameen Tayyebi
>Priority: Major
>
> AWS Glue Catalog is an external Hive metastore backed by a web service. It 
> allows permanent storage of catalog data for BigData use cases.
> To find out more information about AWS Glue, please consult:
>  * AWS Glue - [https://aws.amazon.com/glue/]
>  * Using Glue as a Metastore catalog for Spark - 
> [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html]
> Today, the integration of Glue and Spark is through the Hive layer. Glue 
> implements the IMetaStore interface of Hive and for installations of Spark 
> that contain Hive, Glue can be used as the metastore.
> The feature set that Glue supports does not align 1-1 with the set of 
> features that the latest version of Spark supports. For example, Glue 
> interface supports more advanced partition pruning that the latest version of 
> Hive embedded in Spark.
> To enable a more natural integration with Spark and to allow leveraging 
> latest features of Glue, without being coupled to Hive, a direct integration 
> through Spark's own Catalog API is proposed. This Jira tracks this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25228) Add executor CPU Time metric

2018-09-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25228.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22218
[https://github.com/apache/spark/pull/22218]

> Add executor CPU Time metric 
> -
>
> Key: SPARK-25228
> URL: https://issues.apache.org/jira/browse/SPARK-25228
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: Spark_Metric_executorCPUTIme_Grafana_dashboard.PNG
>
>
> I propose to add a new metric to measure the executor's process CPU time. 
> This allows implementing monitoring of CPU resources used by Spark for 
> example using a Grafana dashboard, as in the attached example screenshot. 
> Note: this is similar and builds on top of the work in SPARK-22190.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25228) Add executor CPU Time metric

2018-09-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25228:
-

Assignee: Luca Canali

> Add executor CPU Time metric 
> -
>
> Key: SPARK-25228
> URL: https://issues.apache.org/jira/browse/SPARK-25228
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: Spark_Metric_executorCPUTIme_Grafana_dashboard.PNG
>
>
> I propose to add a new metric to measure the executor's process CPU time. 
> This allows implementing monitoring of CPU resources used by Spark for 
> example using a Grafana dashboard, as in the attached example screenshot. 
> Note: this is similar and builds on top of the work in SPARK-22190.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-05 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604391#comment-16604391
 ] 

Sean Owen commented on SPARK-18112:
---

I don't know much about this part, but do we need Hive 2.x on the Spark 
(client) side in order to read from Hive 2.x metastore? Are you including Hive 
2.x in your app? I don't know if that works.

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25340) Pushes down Sample beneath deterministic Project

2018-09-05 Thread Takeshi Yamamuro (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-25340:
-
Description: 
If computations in Project are heavy (e.g., UDFs), it is useful to push down 
sample nodes into deterministic projects;

{code}
scala> spark.range(10).selectExpr("id + 3").sample(0.5).explain(true)
// without this proposal
== Analyzed Logical Plan ==
(id + 3): bigint
Sample 0.0, 0.5, false, 3370873312340343855
+- Project [(id#0L + cast(3 as bigint)) AS (id + 3)#2L]
   +- Range (0, 10, step=1, splits=Some(4))

== Optimized Logical Plan ==
Sample 0.0, 0.5, false, 3370873312340343855
+- Project [(id#0L + 3) AS (id + 3)#2L]
   +- Range (0, 10, step=1, splits=Some(4))

// with this proposal
== Optimized Logical Plan ==
Project [(id#0L + 3) AS (id + 3)#2L]
+- Sample 0.0, 0.5, false, -6519017078291024113
   +- Range (0, 10, step=1, splits=Some(4))
{code}

POC: https://github.com/apache/spark/compare/master...maropu:SamplePushdown

  was:
If computations in Project are heavy (e.g., UDFs), it is useful to push down 
sample nodes into deterministic projects;

{code}
scala> spark.range(10).selectExpr("id + 3").sample(0.5).explain(true)
// without this proposal
== Analyzed Logical Plan ==
(id + 3): bigint
Sample 0.0, 0.5, false, 3370873312340343855
+- Project [(id#0L + cast(3 as bigint)) AS (id + 3)#2L]
   +- Range (0, 10, step=1, splits=Some(4))

== Optimized Logical Plan ==
Sample 0.0, 0.5, false, 3370873312340343855
+- Project [(id#0L + 3) AS (id + 3)#2L]
   +- Range (0, 10, step=1, splits=Some(4))

// with this proposal
== Optimized Logical Plan ==
Project [(id#0L + 3) AS (id + 3)#2L]
+- Sample 0.0, 0.5, false, -6519017078291024113
   +- Range (0, 10, step=1, splits=Some(4))
{code}


> Pushes down Sample beneath deterministic Project
> 
>
> Key: SPARK-25340
> URL: https://issues.apache.org/jira/browse/SPARK-25340
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> If computations in Project are heavy (e.g., UDFs), it is useful to push down 
> sample nodes into deterministic projects;
> {code}
> scala> spark.range(10).selectExpr("id + 3").sample(0.5).explain(true)
> // without this proposal
> == Analyzed Logical Plan ==
> (id + 3): bigint
> Sample 0.0, 0.5, false, 3370873312340343855
> +- Project [(id#0L + cast(3 as bigint)) AS (id + 3)#2L]
>+- Range (0, 10, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Sample 0.0, 0.5, false, 3370873312340343855
> +- Project [(id#0L + 3) AS (id + 3)#2L]
>+- Range (0, 10, step=1, splits=Some(4))
> // with this proposal
> == Optimized Logical Plan ==
> Project [(id#0L + 3) AS (id + 3)#2L]
> +- Sample 0.0, 0.5, false, -6519017078291024113
>+- Range (0, 10, step=1, splits=Some(4))
> {code}
> POC: https://github.com/apache/spark/compare/master...maropu:SamplePushdown



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25340) Pushes down Sample beneath deterministic Project

2018-09-05 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604390#comment-16604390
 ] 

Takeshi Yamamuro commented on SPARK-25340:
--

Is this feasible? [~smilegator]

> Pushes down Sample beneath deterministic Project
> 
>
> Key: SPARK-25340
> URL: https://issues.apache.org/jira/browse/SPARK-25340
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> If computations in Project are heavy (e.g., UDFs), it is useful to push down 
> sample nodes into deterministic projects;
> {code}
> scala> spark.range(10).selectExpr("id + 3").sample(0.5).explain(true)
> // without this proposal
> == Analyzed Logical Plan ==
> (id + 3): bigint
> Sample 0.0, 0.5, false, 3370873312340343855
> +- Project [(id#0L + cast(3 as bigint)) AS (id + 3)#2L]
>+- Range (0, 10, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Sample 0.0, 0.5, false, 3370873312340343855
> +- Project [(id#0L + 3) AS (id + 3)#2L]
>+- Range (0, 10, step=1, splits=Some(4))
> // with this proposal
> == Optimized Logical Plan ==
> Project [(id#0L + 3) AS (id + 3)#2L]
> +- Sample 0.0, 0.5, false, -6519017078291024113
>+- Range (0, 10, step=1, splits=Some(4))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25340) Pushes down Sample beneath deterministic Project

2018-09-05 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created SPARK-25340:


 Summary: Pushes down Sample beneath deterministic Project
 Key: SPARK-25340
 URL: https://issues.apache.org/jira/browse/SPARK-25340
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.3.1
Reporter: Takeshi Yamamuro


If computations in Project are heavy (e.g., UDFs), it is useful to push down 
sample nodes into deterministic projects;

{code}
scala> spark.range(10).selectExpr("id + 3").sample(0.5).explain(true)
// without this proposal
== Analyzed Logical Plan ==
(id + 3): bigint
Sample 0.0, 0.5, false, 3370873312340343855
+- Project [(id#0L + cast(3 as bigint)) AS (id + 3)#2L]
   +- Range (0, 10, step=1, splits=Some(4))

== Optimized Logical Plan ==
Sample 0.0, 0.5, false, 3370873312340343855
+- Project [(id#0L + 3) AS (id + 3)#2L]
   +- Range (0, 10, step=1, splits=Some(4))

// with this proposal
== Optimized Logical Plan ==
Project [(id#0L + 3) AS (id + 3)#2L]
+- Sample 0.0, 0.5, false, -6519017078291024113
   +- Range (0, 10, step=1, splits=Some(4))
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23443) Spark with Glue as external catalog

2018-09-05 Thread t oo (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604331#comment-16604331
 ] 

t oo commented on SPARK-23443:
--

[~ameen.tayy...@gmail.com] any luck with the first PR?

> Spark with Glue as external catalog
> ---
>
> Key: SPARK-23443
> URL: https://issues.apache.org/jira/browse/SPARK-23443
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Ameen Tayyebi
>Priority: Major
>
> AWS Glue Catalog is an external Hive metastore backed by a web service. It 
> allows permanent storage of catalog data for BigData use cases.
> To find out more information about AWS Glue, please consult:
>  * AWS Glue - [https://aws.amazon.com/glue/]
>  * Using Glue as a Metastore catalog for Spark - 
> [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html]
> Today, the integration of Glue and Spark is through the Hive layer. Glue 
> implements the IMetaStore interface of Hive and for installations of Spark 
> that contain Hive, Glue can be used as the metastore.
> The feature set that Glue supports does not align 1-1 with the set of 
> features that the latest version of Spark supports. For example, Glue 
> interface supports more advanced partition pruning that the latest version of 
> Hive embedded in Spark.
> To enable a more natural integration with Spark and to allow leveraging 
> latest features of Glue, without being coupled to Hive, a direct integration 
> through Spark's own Catalog API is proposed. This Jira tracks this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence

2018-09-05 Thread Mathew (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604298#comment-16604298
 ] 

Mathew commented on SPARK-24632:


[~bryanc] that line is only there because we use the java object name to get 
the name of the python object to read, it is the bane of my life when 
developing external transformer packages and allowing them to support pipeline 
persistence. 

> Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers 
> for persistence
> --
>
> Key: SPARK-24632
> URL: https://issues.apache.org/jira/browse/SPARK-24632
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This is a follow-up for [SPARK-17025], which allowed users to implement 
> Python PipelineStages in 3rd-party libraries, include them in Pipelines, and 
> use Pipeline persistence.  This task is to make it easier for 3rd-party 
> libraries to have PipelineStages written in Java and then to use pyspark.ml 
> abstractions to create wrappers around those Java classes.  This is currently 
> possible, except that users hit bugs around persistence.
> I spent a bit thinking about this and wrote up thoughts and a proposal in the 
> doc linked below.  Summary of proposal:
> Require that 3rd-party libraries with Java classes with Python wrappers 
> implement a trait which provides the corresponding Python classpath in some 
> field:
> {code}
> trait PythonWrappable {
>   def pythonClassPath: String = …
> }
> MyJavaType extends PythonWrappable
> {code}
> This will not be required for MLlib wrappers, which we can handle specially.
> One issue for this task will be that we may have trouble writing unit tests.  
> They would ideally test a Java class + Python wrapper class pair sitting 
> outside of pyspark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24360) Support Hive 3.0 metastore

2018-09-05 Thread t oo (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604292#comment-16604292
 ] 

t oo commented on SPARK-24360:
--

[~dongjoon] Can this be merged to master? Also, can hive3.1 support be added 
easily?

> Support Hive 3.0 metastore
> --
>
> Key: SPARK-24360
> URL: https://issues.apache.org/jira/browse/SPARK-24360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Hive 3.0.0 is released. This issue aims to support Hive Metastore 3.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25279) Throw exception: zzcclp java.io.NotSerializableException: org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc

2018-09-05 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604254#comment-16604254
 ] 

Liang-Chi Hsieh edited comment on SPARK-25279 at 9/5/18 10:34 AM:
--

The paste mode in REPL wraps pasted code as a single object and so the 
`TypedColumn` object is wrapped together. `TypedColumn` is not serializable.

Seems to me this shouldn't be as a bug in Spark.


was (Author: viirya):
The paste mode in REPL wraps pasted code as a single object and so the 
`TypedColumn` object is wrapped together. `TypedColumn` is not serializable.


 

 

 

> Throw exception: zzcclp   java.io.NotSerializableException: 
> org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc
> ---
>
> Key: SPARK-25279
> URL: https://issues.apache.org/jira/browse/SPARK-25279
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.1
>Reporter: Zhichao  Zhang
>Priority: Minor
>
> Hi dev: 
>   I am using Spark-Shell to run the example which is in section 
> '[http://spark.apache.org/docs/2.2.2/sql-programming-guide.html#type-safe-user-defined-aggregate-functions'],
>  
> and there is an error: 
> {code:java}
> Caused by: java.io.NotSerializableException: 
> org.apache.spark.sql.TypedColumn 
> Serialization stack: 
>         - object not serializable (class: org.apache.spark.sql.TypedColumn, 
> value: 
> myaverage() AS `average_salary`) 
>         - field (class: $iw, name: averageSalary, type: class 
> org.apache.spark.sql.TypedColumn) 
>         - object (class $iw, $iw@4b2f8ae9) 
>         - field (class: MyAverage$, name: $outer, type: class $iw) 
>         - object (class MyAverage$, MyAverage$@2be41d90) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, 
> name: aggregator, type: class org.apache.spark.sql.expressions.Aggregator) 
>         - object (class 
> org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, 
> MyAverage(Employee)) 
>         - field (class: 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, 
> name: aggregateFunction, type: class 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction) 
>         - object (class 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, 
> partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class Employee)), 
> Some(class Employee), Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0)) 
>         - writeObject data (class: 
> scala.collection.immutable.List$SerializationProxy) 
>         - object (class scala.collection.immutable.List$SerializationProxy, 
> scala.collection.immutable.List$SerializationProxy@5e92c46f) 
>         - writeReplace data (class: 
> scala.collection.immutable.List$SerializationProxy) 
>         - object (class scala.collection.immutable.$colon$colon, 
> List(partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class 
> Employee)), Some(class Employee), 
> Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0))) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, name: 
> aggregateExpressions, type: interface scala.collection.Seq) 
>         - object (class 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, 
> ObjectHashAggregate(keys=[], 
> functions=[partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class 
> Employee)), Some(class Employee), 
> Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0)], output=[buf#37]) 
> +- *FileScan json [name#8,salary#9L] Batched: false, Format: JSON, Location: 
> InMemoryFileIndex[file:/opt/spark2/examples/src/main/resources/employees.json],
>  
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct 
> ) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.Ob

[jira] [Commented] (SPARK-25279) Throw exception: zzcclp java.io.NotSerializableException: org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc

2018-09-05 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604254#comment-16604254
 ] 

Liang-Chi Hsieh commented on SPARK-25279:
-

The paste mode in REPL wraps pasted code as a single object and so the 
`TypedColumn` object is wrapped together. `TypedColumn` is not serializable.


 

 

 

> Throw exception: zzcclp   java.io.NotSerializableException: 
> org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc
> ---
>
> Key: SPARK-25279
> URL: https://issues.apache.org/jira/browse/SPARK-25279
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.1
>Reporter: Zhichao  Zhang
>Priority: Minor
>
> Hi dev: 
>   I am using Spark-Shell to run the example which is in section 
> '[http://spark.apache.org/docs/2.2.2/sql-programming-guide.html#type-safe-user-defined-aggregate-functions'],
>  
> and there is an error: 
> {code:java}
> Caused by: java.io.NotSerializableException: 
> org.apache.spark.sql.TypedColumn 
> Serialization stack: 
>         - object not serializable (class: org.apache.spark.sql.TypedColumn, 
> value: 
> myaverage() AS `average_salary`) 
>         - field (class: $iw, name: averageSalary, type: class 
> org.apache.spark.sql.TypedColumn) 
>         - object (class $iw, $iw@4b2f8ae9) 
>         - field (class: MyAverage$, name: $outer, type: class $iw) 
>         - object (class MyAverage$, MyAverage$@2be41d90) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, 
> name: aggregator, type: class org.apache.spark.sql.expressions.Aggregator) 
>         - object (class 
> org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, 
> MyAverage(Employee)) 
>         - field (class: 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, 
> name: aggregateFunction, type: class 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction) 
>         - object (class 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, 
> partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class Employee)), 
> Some(class Employee), Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0)) 
>         - writeObject data (class: 
> scala.collection.immutable.List$SerializationProxy) 
>         - object (class scala.collection.immutable.List$SerializationProxy, 
> scala.collection.immutable.List$SerializationProxy@5e92c46f) 
>         - writeReplace data (class: 
> scala.collection.immutable.List$SerializationProxy) 
>         - object (class scala.collection.immutable.$colon$colon, 
> List(partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class 
> Employee)), Some(class Employee), 
> Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0))) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, name: 
> aggregateExpressions, type: interface scala.collection.Seq) 
>         - object (class 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, 
> ObjectHashAggregate(keys=[], 
> functions=[partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class 
> Employee)), Some(class Employee), 
> Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0)], output=[buf#37]) 
> +- *FileScan json [name#8,salary#9L] Batched: false, Format: JSON, Location: 
> InMemoryFileIndex[file:/opt/spark2/examples/src/main/resources/employees.json],
>  
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct 
> ) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1,
>  
> name: $outer, type: class 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec) 
>         - object (class 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1,
>  
> ) 
>

[jira] [Assigned] (SPARK-24889) dataset.unpersist() doesn't update storage memory stats

2018-09-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24889:


Assignee: Apache Spark

> dataset.unpersist() doesn't update storage memory stats
> ---
>
> Key: SPARK-24889
> URL: https://issues.apache.org/jira/browse/SPARK-24889
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Yuri Bogomolov
>Assignee: Apache Spark
>Priority: Major
> Attachments: image-2018-07-23-10-53-58-474.png
>
>
> Steps to reproduce:
> 1) Start a Spark cluster, and check the storage memory value from the Spark 
> Web UI "Executors" tab (it should be equal to zero if you just started)
> 2) Run:
> {code:java}
> val df = spark.sqlContext.range(1, 10)
> df.cache()
> df.count()
> df.unpersist(true){code}
> 3) Check the storage memory value again, now it's equal to 1GB
>  
> Looks like the memory is actually released, but stats aren't updated. This 
> issue makes cluster management more complicated.
> !image-2018-07-23-10-53-58-474.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24889) dataset.unpersist() doesn't update storage memory stats

2018-09-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24889:


Assignee: (was: Apache Spark)

> dataset.unpersist() doesn't update storage memory stats
> ---
>
> Key: SPARK-24889
> URL: https://issues.apache.org/jira/browse/SPARK-24889
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Yuri Bogomolov
>Priority: Major
> Attachments: image-2018-07-23-10-53-58-474.png
>
>
> Steps to reproduce:
> 1) Start a Spark cluster, and check the storage memory value from the Spark 
> Web UI "Executors" tab (it should be equal to zero if you just started)
> 2) Run:
> {code:java}
> val df = spark.sqlContext.range(1, 10)
> df.cache()
> df.count()
> df.unpersist(true){code}
> 3) Check the storage memory value again, now it's equal to 1GB
>  
> Looks like the memory is actually released, but stats aren't updated. This 
> issue makes cluster management more complicated.
> !image-2018-07-23-10-53-58-474.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24889) dataset.unpersist() doesn't update storage memory stats

2018-09-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604225#comment-16604225
 ] 

Apache Spark commented on SPARK-24889:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/22341

> dataset.unpersist() doesn't update storage memory stats
> ---
>
> Key: SPARK-24889
> URL: https://issues.apache.org/jira/browse/SPARK-24889
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Yuri Bogomolov
>Priority: Major
> Attachments: image-2018-07-23-10-53-58-474.png
>
>
> Steps to reproduce:
> 1) Start a Spark cluster, and check the storage memory value from the Spark 
> Web UI "Executors" tab (it should be equal to zero if you just started)
> 2) Run:
> {code:java}
> val df = spark.sqlContext.range(1, 10)
> df.cache()
> df.count()
> df.unpersist(true){code}
> 3) Check the storage memory value again, now it's equal to 1GB
>  
> Looks like the memory is actually released, but stats aren't updated. This 
> issue makes cluster management more complicated.
> !image-2018-07-23-10-53-58-474.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-05 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604176#comment-16604176
 ] 

Hyukjin Kwon commented on SPARK-18112:
--

Can you post reproducer step by step? did you set 
{{spark.sql.hive.metastore.version}} and jar properly?

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-05 Thread t oo (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604167#comment-16604167
 ] 

t oo commented on SPARK-18112:
--

[~hyukjin.kwon] [~srowen] can this ticket be re-opened? This code is still in 
master as mentioned in comments above

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2018-09-05 Thread t oo (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604162#comment-16604162
 ] 

t oo commented on SPARK-13446:
--

[~cloud_fan] I am hitting same issue as [~elgalu] :(

> Spark need to support reading data from Hive 2.0.0 metastore
> 
>
> Key: SPARK-13446
> URL: https://issues.apache.org/jira/browse/SPARK-13446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Lifeng Wang
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.2.0
>
>
> Spark provided HIveContext class to read data from hive metastore directly. 
> While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has 
> released, it's better to upgrade to support Hive 2.0.0.
> {noformat}
> 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI 
> thrift://hsw-node13:9083
> 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 16/02/23 02:35:02 INFO metastore: Connected to metastore.
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17159) Improve FileInputDStream.findNewFiles list performance

2018-09-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-17159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604155#comment-16604155
 ] 

Apache Spark commented on SPARK-17159:
--

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/22339

> Improve FileInputDStream.findNewFiles list performance
> --
>
> Key: SPARK-17159
> URL: https://issues.apache.org/jira/browse/SPARK-17159
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.0.0
> Environment: spark against object stores
>Reporter: Steve Loughran
>Priority: Minor
>
> {{FileInputDStream.findNewFiles()}} is doing a globStatus with a fitler that 
> calls getFileStatus() on every file, takes the output and does listStatus() 
> on the output.
> This going to suffer on object stores, as dir listing and getFileStatus calls 
> are so expensive. It's clear this is a problem, as the method has code to 
> detect timeouts in the window and warn of problems.
> It should be possible to make this faster



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25337) HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasou

2018-09-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604154#comment-16604154
 ] 

Apache Spark commented on SPARK-25337:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22340

> HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: 
> org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)
> 
>
> Key: SPARK-25337
> URL: https://issues.apache.org/jira/browse/SPARK-25337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> Observed in the Scala 2.12 pull request builder consistently now. I don't see 
> this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of 
> hard to see how.
> CC [~sadhen]
> {code:java}
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED ***
> Exception encountered when invoking run on a nested suite - spark-submit 
> returned with exit code 1.
> Command line: './bin/spark-submit' '--name' 'prepare testing tables' 
> '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 
> 'spark.master.rest.enabled=false' '--conf' 
> 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
>  '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' 
> '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
>  
> '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py'
> ...
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/session.py",
>  line 545, in sql
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> 2018-09-04 20:00:04.95 - stdout> py4j.protocol.Py4JJavaError: An error 
> occurred while calling o27.sql.
> 2018-09-04 20:00:04.95 - stdout> : java.util.ServiceConfigurationError: 
> org.apache.spark.sql.sources.DataSourceRegister: Provider 
> org.apache.spark.sql.hive.execution.HiveFileFormat could not be instantiated
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25337) HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasour

2018-09-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25337:


Assignee: Apache Spark

> HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: 
> org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)
> 
>
> Key: SPARK-25337
> URL: https://issues.apache.org/jira/browse/SPARK-25337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Major
>
> Observed in the Scala 2.12 pull request builder consistently now. I don't see 
> this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of 
> hard to see how.
> CC [~sadhen]
> {code:java}
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED ***
> Exception encountered when invoking run on a nested suite - spark-submit 
> returned with exit code 1.
> Command line: './bin/spark-submit' '--name' 'prepare testing tables' 
> '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 
> 'spark.master.rest.enabled=false' '--conf' 
> 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
>  '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' 
> '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
>  
> '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py'
> ...
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/session.py",
>  line 545, in sql
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> 2018-09-04 20:00:04.95 - stdout> py4j.protocol.Py4JJavaError: An error 
> occurred while calling o27.sql.
> 2018-09-04 20:00:04.95 - stdout> : java.util.ServiceConfigurationError: 
> org.apache.spark.sql.sources.DataSourceRegister: Provider 
> org.apache.spark.sql.hive.execution.HiveFileFormat could not be instantiated
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25337) HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasour

2018-09-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25337:


Assignee: (was: Apache Spark)

> HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: 
> org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)
> 
>
> Key: SPARK-25337
> URL: https://issues.apache.org/jira/browse/SPARK-25337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> Observed in the Scala 2.12 pull request builder consistently now. I don't see 
> this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of 
> hard to see how.
> CC [~sadhen]
> {code:java}
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED ***
> Exception encountered when invoking run on a nested suite - spark-submit 
> returned with exit code 1.
> Command line: './bin/spark-submit' '--name' 'prepare testing tables' 
> '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 
> 'spark.master.rest.enabled=false' '--conf' 
> 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
>  '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' 
> '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
>  
> '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py'
> ...
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/session.py",
>  line 545, in sql
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> 2018-09-04 20:00:04.95 - stdout> py4j.protocol.Py4JJavaError: An error 
> occurred while calling o27.sql.
> 2018-09-04 20:00:04.95 - stdout> : java.util.ServiceConfigurationError: 
> org.apache.spark.sql.sources.DataSourceRegister: Provider 
> org.apache.spark.sql.hive.execution.HiveFileFormat could not be instantiated
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25317) MemoryBlock performance regression

2018-09-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604146#comment-16604146
 ] 

Apache Spark commented on SPARK-25317:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22338

> MemoryBlock performance regression
> --
>
> Key: SPARK-25317
> URL: https://issues.apache.org/jira/browse/SPARK-25317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> eThere is a performance regression when calculating hash code for UTF8String:
> {code:java}
>   test("hashing") {
> import org.apache.spark.unsafe.hash.Murmur3_x86_32
> import org.apache.spark.unsafe.types.UTF8String
> val hasher = new Murmur3_x86_32(0)
> val str = UTF8String.fromString("b" * 10001)
> val numIter = 10
> val start = System.nanoTime
> for (i <- 0 until numIter) {
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
> }
> val duration = (System.nanoTime() - start) / 1000 / numIter
> println(s"duration $duration us")
>   }
> {code}
> To run this test in 2.3, we need to add
> {code:java}
> public static int hashUTF8String(UTF8String str, int seed) {
> return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
> str.numBytes(), seed);
>   }
> {code}
> to `Murmur3_x86_32`
> In my laptop, the result for master vs 2.3 is: 120 us vs 40 us



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25317) MemoryBlock performance regression

2018-09-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25317:


Assignee: Apache Spark

> MemoryBlock performance regression
> --
>
> Key: SPARK-25317
> URL: https://issues.apache.org/jira/browse/SPARK-25317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Blocker
>
> eThere is a performance regression when calculating hash code for UTF8String:
> {code:java}
>   test("hashing") {
> import org.apache.spark.unsafe.hash.Murmur3_x86_32
> import org.apache.spark.unsafe.types.UTF8String
> val hasher = new Murmur3_x86_32(0)
> val str = UTF8String.fromString("b" * 10001)
> val numIter = 10
> val start = System.nanoTime
> for (i <- 0 until numIter) {
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
> }
> val duration = (System.nanoTime() - start) / 1000 / numIter
> println(s"duration $duration us")
>   }
> {code}
> To run this test in 2.3, we need to add
> {code:java}
> public static int hashUTF8String(UTF8String str, int seed) {
> return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
> str.numBytes(), seed);
>   }
> {code}
> to `Murmur3_x86_32`
> In my laptop, the result for master vs 2.3 is: 120 us vs 40 us



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 105 matches

Mail list logo