date:20190218

[jira] [Commented] (SPARK-22860) Spark workers log ssl passwords passed to the executors

2019-02-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16770842#comment-16770842
 ] 

Apache Spark commented on SPARK-22860:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/23820

> Spark workers log ssl passwords passed to the executors
> ---
>
> Key: SPARK-22860
> URL: https://issues.apache.org/jira/browse/SPARK-22860
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Felix K.
>Priority: Major
>
> The workers log the spark.ssl.keyStorePassword and 
> spark.ssl.trustStorePassword passed by cli to the executor processes. The 
> ExecutorRunner should escape passwords to not appear in the worker's log 
> files in INFO level. In this example, you can see my 'SuperSecretPassword' in 
> a worker log:
> {code}
> 17/12/08 08:04:12 INFO ExecutorRunner: Launch command: 
> "/global/myapp/oem/jdk/bin/java" "-cp" 
> "/global/myapp/application/myapp_software/thing_loader_lib/core-repository-model-zzz-1.2.3-SNAPSHOT.jar
> [...]
> :/global/myapp/application/spark-2.1.1-bin-hadoop2.7/jars/*" "-Xmx16384M" 
> "-Dspark.authenticate.enableSaslEncryption=true" 
> "-Dspark.ssl.keyStorePassword=SuperSecretPassword" 
> "-Dspark.ssl.keyStore=/global/myapp/application/config/ssl/keystore.jks" 
> "-Dspark.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks" 
> "-Dspark.ssl.enabled=true" "-Dspark.driver.port=39927" 
> "-Dspark.ssl.protocol=TLS" 
> "-Dspark.ssl.trustStorePassword=SuperSecretPassword" 
> "-Dspark.authenticate=true" "-Dmyapp_IMPORT_DATE=2017-10-30" 
> "-Dmyapp.config.directory=/global/myapp/application/config" 
> "-Dsolr.httpclient.builder.factory=com.company.myapp.loader.auth.LoaderConfigSparkSolrBasicAuthConfigurer"
>  
> "-Djavax.net.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks"
>  "-XX:+UseG1GC" "-XX:+UseStringDeduplication" 
> "-Dthings.loader.export.zzz_files=false" 
> "-Dlog4j.configuration=file:/global/myapp/application/config/spark-executor-log4j.properties"
>  "-XX:+HeapDumpOnOutOfMemoryError" "-XX:+UseStringDeduplication" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@192.168.0.1:39927" "--executor-id" "2" 
> "--hostname" "192.168.0.1" "--cores" "4" "--app-id" "app-20171208080412-" 
> "--worker-url" "spark://Worker@192.168.0.1:59530"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26909) use unsafeRow.hashCode() as hash value in HashAggregate

2019-02-18 Thread yucai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-26909:
--
Description: 
This is a followup PR for #21149.

New way uses unsafeRow.hashCode() as hash value in HashAggregate.
The unsafe row has [null bit set] etc., the result should be different, so we 
don't need weird `48`.

  was:
This is a followup PR for #21149.

New way uses unsafeRow.hashCode() as hash value in HashAggregate.
The unsafe row has [null bit set] etc., the result should be different, and we 
don't need weird `48` also.


> use unsafeRow.hashCode() as hash value in HashAggregate
> ---
>
> Key: SPARK-26909
> URL: https://issues.apache.org/jira/browse/SPARK-26909
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: yucai
>Priority: Major
>
> This is a followup PR for #21149.
> New way uses unsafeRow.hashCode() as hash value in HashAggregate.
> The unsafe row has [null bit set] etc., the result should be different, so we 
> don't need weird `48`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22860) Spark workers log ssl passwords passed to the executors

2019-02-18 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16770844#comment-16770844
 ] 

Jungtaek Lim commented on SPARK-22860:
--

I'm proposing to just redact these values from log message first, because 
complexities of both are quite different. Given we need to deal with cli 
arguments, I had to create my own logic instead of taking up existing PR. 
(Existing PR just removed them in arguments which I wonder it works well.) 
Sorry about that.

> Spark workers log ssl passwords passed to the executors
> ---
>
> Key: SPARK-22860
> URL: https://issues.apache.org/jira/browse/SPARK-22860
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Felix K.
>Priority: Major
>
> The workers log the spark.ssl.keyStorePassword and 
> spark.ssl.trustStorePassword passed by cli to the executor processes. The 
> ExecutorRunner should escape passwords to not appear in the worker's log 
> files in INFO level. In this example, you can see my 'SuperSecretPassword' in 
> a worker log:
> {code}
> 17/12/08 08:04:12 INFO ExecutorRunner: Launch command: 
> "/global/myapp/oem/jdk/bin/java" "-cp" 
> "/global/myapp/application/myapp_software/thing_loader_lib/core-repository-model-zzz-1.2.3-SNAPSHOT.jar
> [...]
> :/global/myapp/application/spark-2.1.1-bin-hadoop2.7/jars/*" "-Xmx16384M" 
> "-Dspark.authenticate.enableSaslEncryption=true" 
> "-Dspark.ssl.keyStorePassword=SuperSecretPassword" 
> "-Dspark.ssl.keyStore=/global/myapp/application/config/ssl/keystore.jks" 
> "-Dspark.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks" 
> "-Dspark.ssl.enabled=true" "-Dspark.driver.port=39927" 
> "-Dspark.ssl.protocol=TLS" 
> "-Dspark.ssl.trustStorePassword=SuperSecretPassword" 
> "-Dspark.authenticate=true" "-Dmyapp_IMPORT_DATE=2017-10-30" 
> "-Dmyapp.config.directory=/global/myapp/application/config" 
> "-Dsolr.httpclient.builder.factory=com.company.myapp.loader.auth.LoaderConfigSparkSolrBasicAuthConfigurer"
>  
> "-Djavax.net.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks"
>  "-XX:+UseG1GC" "-XX:+UseStringDeduplication" 
> "-Dthings.loader.export.zzz_files=false" 
> "-Dlog4j.configuration=file:/global/myapp/application/config/spark-executor-log4j.properties"
>  "-XX:+HeapDumpOnOutOfMemoryError" "-XX:+UseStringDeduplication" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@192.168.0.1:39927" "--executor-id" "2" 
> "--hostname" "192.168.0.1" "--cores" "4" "--app-id" "app-20171208080412-" 
> "--worker-url" "spark://Worker@192.168.0.1:59530"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26909) use unsafeRow.hashCode() as hash value in HashAggregate

2019-02-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26909:


Assignee: (was: Apache Spark)

> use unsafeRow.hashCode() as hash value in HashAggregate
> ---
>
> Key: SPARK-26909
> URL: https://issues.apache.org/jira/browse/SPARK-26909
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: yucai
>Priority: Major
>
> This is a followup PR for #21149.
> New way uses unsafeRow.hashCode() as hash value in HashAggregate.
> The unsafe row has [null bit set] etc., the result should be different, so we 
> don't need weird `48`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26909) use unsafeRow.hashCode() as hash value in HashAggregate

2019-02-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26909:


Assignee: Apache Spark

> use unsafeRow.hashCode() as hash value in HashAggregate
> ---
>
> Key: SPARK-26909
> URL: https://issues.apache.org/jira/browse/SPARK-26909
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: yucai
>Assignee: Apache Spark
>Priority: Major
>
> This is a followup PR for #21149.
> New way uses unsafeRow.hashCode() as hash value in HashAggregate.
> The unsafe row has [null bit set] etc., the result should be different, so we 
> don't need weird `48`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26909) use unsafeRow.hashCode() as hash value in HashAggregate

2019-02-18 Thread yucai (JIRA)

yucai created SPARK-26909:
-

 Summary: use unsafeRow.hashCode() as hash value in HashAggregate
 Key: SPARK-26909
 URL: https://issues.apache.org/jira/browse/SPARK-26909
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: yucai


This is a followup PR for #21149.

New way uses unsafeRow.hashCode() as hash value in HashAggregate.
The unsafe row has [null bit set] etc., the result should be different, and we 
don't need weird `48` also.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22860) Spark workers log ssl passwords passed to the executors

2019-02-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16770843#comment-16770843
 ] 

Apache Spark commented on SPARK-22860:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/23820

> Spark workers log ssl passwords passed to the executors
> ---
>
> Key: SPARK-22860
> URL: https://issues.apache.org/jira/browse/SPARK-22860
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Felix K.
>Priority: Major
>
> The workers log the spark.ssl.keyStorePassword and 
> spark.ssl.trustStorePassword passed by cli to the executor processes. The 
> ExecutorRunner should escape passwords to not appear in the worker's log 
> files in INFO level. In this example, you can see my 'SuperSecretPassword' in 
> a worker log:
> {code}
> 17/12/08 08:04:12 INFO ExecutorRunner: Launch command: 
> "/global/myapp/oem/jdk/bin/java" "-cp" 
> "/global/myapp/application/myapp_software/thing_loader_lib/core-repository-model-zzz-1.2.3-SNAPSHOT.jar
> [...]
> :/global/myapp/application/spark-2.1.1-bin-hadoop2.7/jars/*" "-Xmx16384M" 
> "-Dspark.authenticate.enableSaslEncryption=true" 
> "-Dspark.ssl.keyStorePassword=SuperSecretPassword" 
> "-Dspark.ssl.keyStore=/global/myapp/application/config/ssl/keystore.jks" 
> "-Dspark.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks" 
> "-Dspark.ssl.enabled=true" "-Dspark.driver.port=39927" 
> "-Dspark.ssl.protocol=TLS" 
> "-Dspark.ssl.trustStorePassword=SuperSecretPassword" 
> "-Dspark.authenticate=true" "-Dmyapp_IMPORT_DATE=2017-10-30" 
> "-Dmyapp.config.directory=/global/myapp/application/config" 
> "-Dsolr.httpclient.builder.factory=com.company.myapp.loader.auth.LoaderConfigSparkSolrBasicAuthConfigurer"
>  
> "-Djavax.net.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks"
>  "-XX:+UseG1GC" "-XX:+UseStringDeduplication" 
> "-Dthings.loader.export.zzz_files=false" 
> "-Dlog4j.configuration=file:/global/myapp/application/config/spark-executor-log4j.properties"
>  "-XX:+HeapDumpOnOutOfMemoryError" "-XX:+UseStringDeduplication" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@192.168.0.1:39927" "--executor-id" "2" 
> "--hostname" "192.168.0.1" "--cores" "4" "--app-id" "app-20171208080412-" 
> "--worker-url" "spark://Worker@192.168.0.1:59530"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26880) dataDF.queryExecution.toRdd corrupt rows

2019-02-18 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16770866#comment-16770866
 ] 

Jungtaek Lim edited comment on SPARK-26880 at 2/18/19 8:40 AM:
---

Just submitted a PR to add note on proper usage of "QueryExecution.toRdd".
https://github.com/apache/spark/pull/23822

Given this issue is filed as a bug, I just submitted PR as MINOR and don't link 
my PR to this issue. Marking this issue as 'Resolved' would provide false-alarm 
- marking this issue as "invalid" or "information provided" would be OK for 
this?


was (Author: kabhwan):
Just submitted a PR to add note on proper usage of "QueryExecution.toRdd".
https://github.com/apache/spark/pull/23822

Given this issue is filed as a bug, I just submitted PR as MINOR and don't link 
my PR to this issue. Marking this issue as 'Resolved' would provide false-alarm.

> dataDF.queryExecution.toRdd corrupt rows
> 
>
> Key: SPARK-26880
> URL: https://issues.apache.org/jira/browse/SPARK-26880
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Grant Henke
>Priority: Major
>
> I have seen a simple case where InternalRows returned by 
> `queryExecution.toRdd` are corrupt. Some rows are duplicated while other are 
> missing. 
> This simple test illustrates the issue:
> {code}
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.catalyst.encoders.RowEncoder
> import org.apache.spark.sql.types.IntegerType
> import org.apache.spark.sql.types.StringType
> import org.apache.spark.sql.types.StructField
> import org.apache.spark.sql.types.StructType
> import org.junit.Assert._
> import org.junit.Test
> import org.scalatest.Matchers
> import org.scalatest.junit.JUnitSuite
> import org.slf4j.Logger
> import org.slf4j.LoggerFactory
> class SparkTest extends JUnitSuite with Matchers {
>   val Log: Logger = LoggerFactory.getLogger(getClass)
>   @Test
>   def testSparkRowCorruption(): Unit = {
> val conf = new SparkConf()
>   .setMaster("local[*]")
>   .setAppName("test")
>   .set("spark.ui.enabled", "false")
> val ss = SparkSession.builder().config(conf).getOrCreate()
> // Setup a DataFrame for testing.
> val data = Seq(
>   Row.fromSeq(Seq(0, "0")),
>   Row.fromSeq(Seq(25, "25")),
>   Row.fromSeq(Seq(50, "50")),
>   Row.fromSeq(Seq(75, "75")),
>   Row.fromSeq(Seq(99, "99")),
>   Row.fromSeq(Seq(100, "100")),
>   Row.fromSeq(Seq(101, "101")),
>   Row.fromSeq(Seq(125, "125")),
>   Row.fromSeq(Seq(150, "150")),
>   Row.fromSeq(Seq(175, "175")),
>   Row.fromSeq(Seq(199, "199"))
> )
> val dataRDD = ss.sparkContext.parallelize(data)
> val schema = StructType(
>   Seq(
> StructField("key", IntegerType),
> StructField("value", StringType)
>   ))
> val dataDF = ss.sqlContext.createDataFrame(dataRDD, schema)
> // Convert to an RDD.
> val rdd = dataDF.queryExecution.toRdd
> 
> // Collect the data to compare.
> val resultData = rdd.collect
> resultData.foreach { row =>
>   // Log for visualizing the corruption.
>   Log.error(s"${row.getInt(0)}")
> }
> // Ensure the keys in the original data and resulting data match.
> val dataKeys = data.map(_.getInt(0)).toSet
> val resultKeys = resultData.map(_.getInt(0)).toSet
> assertEquals(dataKeys, resultKeys)
>   }
> }
> {code}
> That test fails with the following:
> {noformat}
> 10:38:26.967 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 0
> 10:38:26.967 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 25
> 10:38:26.967 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 75
> 10:38:26.968 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 75
> 10:38:26.968 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 99
> 10:38:26.968 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 100
> 10:38:26.969 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 125
> 10:38:26.969 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 125
> 10:38:26.969 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 150
> 10:38:26.970 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 199
> 10:38:26.970 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 199
> expected: but 
> was:
> Expected :Set(101, 0, 25, 125, 150, 50, 199, 175, 99, 75, 100)
> Actual   :Set(0, 25, 125, 150, 199, 99, 75, 100)
> {noformat}
> If I map from and InternalRow to a Row the issue goes away:
> {code}
> val rdd = dataDF.queryExecution.toRdd.mapPartitions { internalRows =>
>val encoder = RowEncoder.apply(schema).resolveAndBind()
>internalRo

[jira] [Commented] (SPARK-26880) dataDF.queryExecution.toRdd corrupt rows

2019-02-18 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16770866#comment-16770866
 ] 

Jungtaek Lim commented on SPARK-26880:
--

Just submitted a PR to add note on proper usage of "QueryExecution.toRdd".
https://github.com/apache/spark/pull/23822

Given this issue is filed as a bug, I just submitted PR as MINOR and don't link 
my PR to this issue. Marking this issue as 'Resolved' would provide false-alarm.

> dataDF.queryExecution.toRdd corrupt rows
> 
>
> Key: SPARK-26880
> URL: https://issues.apache.org/jira/browse/SPARK-26880
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Grant Henke
>Priority: Major
>
> I have seen a simple case where InternalRows returned by 
> `queryExecution.toRdd` are corrupt. Some rows are duplicated while other are 
> missing. 
> This simple test illustrates the issue:
> {code}
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.catalyst.encoders.RowEncoder
> import org.apache.spark.sql.types.IntegerType
> import org.apache.spark.sql.types.StringType
> import org.apache.spark.sql.types.StructField
> import org.apache.spark.sql.types.StructType
> import org.junit.Assert._
> import org.junit.Test
> import org.scalatest.Matchers
> import org.scalatest.junit.JUnitSuite
> import org.slf4j.Logger
> import org.slf4j.LoggerFactory
> class SparkTest extends JUnitSuite with Matchers {
>   val Log: Logger = LoggerFactory.getLogger(getClass)
>   @Test
>   def testSparkRowCorruption(): Unit = {
> val conf = new SparkConf()
>   .setMaster("local[*]")
>   .setAppName("test")
>   .set("spark.ui.enabled", "false")
> val ss = SparkSession.builder().config(conf).getOrCreate()
> // Setup a DataFrame for testing.
> val data = Seq(
>   Row.fromSeq(Seq(0, "0")),
>   Row.fromSeq(Seq(25, "25")),
>   Row.fromSeq(Seq(50, "50")),
>   Row.fromSeq(Seq(75, "75")),
>   Row.fromSeq(Seq(99, "99")),
>   Row.fromSeq(Seq(100, "100")),
>   Row.fromSeq(Seq(101, "101")),
>   Row.fromSeq(Seq(125, "125")),
>   Row.fromSeq(Seq(150, "150")),
>   Row.fromSeq(Seq(175, "175")),
>   Row.fromSeq(Seq(199, "199"))
> )
> val dataRDD = ss.sparkContext.parallelize(data)
> val schema = StructType(
>   Seq(
> StructField("key", IntegerType),
> StructField("value", StringType)
>   ))
> val dataDF = ss.sqlContext.createDataFrame(dataRDD, schema)
> // Convert to an RDD.
> val rdd = dataDF.queryExecution.toRdd
> 
> // Collect the data to compare.
> val resultData = rdd.collect
> resultData.foreach { row =>
>   // Log for visualizing the corruption.
>   Log.error(s"${row.getInt(0)}")
> }
> // Ensure the keys in the original data and resulting data match.
> val dataKeys = data.map(_.getInt(0)).toSet
> val resultKeys = resultData.map(_.getInt(0)).toSet
> assertEquals(dataKeys, resultKeys)
>   }
> }
> {code}
> That test fails with the following:
> {noformat}
> 10:38:26.967 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 0
> 10:38:26.967 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 25
> 10:38:26.967 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 75
> 10:38:26.968 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 75
> 10:38:26.968 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 99
> 10:38:26.968 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 100
> 10:38:26.969 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 125
> 10:38:26.969 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 125
> 10:38:26.969 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 150
> 10:38:26.970 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 199
> 10:38:26.970 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 199
> expected: but 
> was:
> Expected :Set(101, 0, 25, 125, 150, 50, 199, 175, 99, 75, 100)
> Actual   :Set(0, 25, 125, 150, 199, 99, 75, 100)
> {noformat}
> If I map from and InternalRow to a Row the issue goes away:
> {code}
> val rdd = dataDF.queryExecution.toRdd.mapPartitions { internalRows =>
>val encoder = RowEncoder.apply(schema).resolveAndBind()
>internalRows.map(encoder.fromRow)
> }
> {code}
> Converting with CatalystTypeConverters also appears to resolve the issue:
> {code}
> val rdd = dataDF.queryExecution.toRdd.mapPartitions { internalRows =>
>val typeConverter = CatalystTypeConverters.createToScalaConverter(schema)
>internalRows.map(ir => typeConverter(ir).asInstanceOf[Row])
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-

[jira] [Commented] (SPARK-26881) Scaling issue with Gramian computation for RowMatrix: too many results sent to driver

2019-02-18 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16770881#comment-16770881
 ] 

Marco Gaido commented on SPARK-26881:
-

This may have been fixed/improved by SPARK-26228, could you try on current 
master? cc [~srowen]

> Scaling issue with Gramian computation for RowMatrix: too many results sent 
> to driver
> -
>
> Key: SPARK-26881
> URL: https://issues.apache.org/jira/browse/SPARK-26881
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Rafael RENAUDIN-AVINO
>Priority: Minor
>
> This issue hit me when running PCA on large dataset (~1Billion rows, ~30k 
> columns).
> Computing Gramian of a big RowMatrix allows to reproduce the issue.
>  
> The problem arises in the treeAggregate phase of the gramian matrix 
> computation: results sent to driver are enormous.
> A potential solution to this could be to replace the hard coded depth (2) of 
> the tree aggregation by a heuristic computed based on the number of 
> partitions, driver max result size, and memory size of the dense vectors that 
> are being aggregated, cf below for more detail:
> (nb_partitions)^(1/depth) * dense_vector_size <= driver_max_result_size
> I have a potential fix ready (currently testing it at scale), but I'd like to 
> hear the community opinion about such a fix to know if it's worth investing 
> my time into a clean pull request.
>  
> Note that I only faced this issue with spark 2.2 but I suspect it affects 
> later versions aswell. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26910) Re-release SparkR to CRAN

2019-02-18 Thread Michael Chirico (JIRA)

Michael Chirico created SPARK-26910:
---

 Summary: Re-release SparkR to CRAN
 Key: SPARK-26910
 URL: https://issues.apache.org/jira/browse/SPARK-26910
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Affects Versions: 2.4.0
Reporter: Michael Chirico


The logical successor to https://issues.apache.org/jira/browse/SPARK-15799

I don't see anything specifically tracking re-release in the Jira list. It 
would be helpful to have an issue tracking this to refer to as an outsider, as 
well as to document what the blockers are in case some outside help could be 
useful.
 * Is there a plan to re-release SparkR to CRAN?
 * What are the major blockers to doing so at the moment?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26353) Add typed aggregate functions(max/min) to the example module

2019-02-18 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26353:


Assignee: liuxian

> Add typed aggregate functions(max/min) to the example module
> 
>
> Key: SPARK-26353
> URL: https://issues.apache.org/jira/browse/SPARK-26353
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
>
> Add typed aggregate functions(max/min) to the example module.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26889) Fix timestamp type in Structured Streaming + Kafka Integration Guide

2019-02-18 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26889:


Assignee: Gabor Somogyi

> Fix timestamp type in Structured Streaming + Kafka Integration Guide
> 
>
> Key: SPARK-26889
> URL: https://issues.apache.org/jira/browse/SPARK-26889
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Trivial
>
> {code:java}
> $ spark-shell --packages 
> org.apache.spark:spark-sql-kafka-0-10_2.11:3.0.0-SNAPSHOT
> ...
> scala> val df = spark.read.format("kafka").option("kafka.bootstrap.servers", 
> "foo").option("subscribe", "bar").load().printSchema()
> root
>  |-- key: binary (nullable = true)
>  |-- value: binary (nullable = true)
>  |-- topic: string (nullable = true)
>  |-- partition: integer (nullable = true)
>  |-- offset: long (nullable = true)
>  |-- timestamp: timestamp (nullable = true)
>  |-- timestampType: integer (nullable = true)
> df: Unit = ()
> {code}
> In the doc timestamp type is long.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26889) Fix timestamp type in Structured Streaming + Kafka Integration Guide

2019-02-18 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26889.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23796
[https://github.com/apache/spark/pull/23796]

> Fix timestamp type in Structured Streaming + Kafka Integration Guide
> 
>
> Key: SPARK-26889
> URL: https://issues.apache.org/jira/browse/SPARK-26889
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Trivial
> Fix For: 3.0.0
>
>
> {code:java}
> $ spark-shell --packages 
> org.apache.spark:spark-sql-kafka-0-10_2.11:3.0.0-SNAPSHOT
> ...
> scala> val df = spark.read.format("kafka").option("kafka.bootstrap.servers", 
> "foo").option("subscribe", "bar").load().printSchema()
> root
>  |-- key: binary (nullable = true)
>  |-- value: binary (nullable = true)
>  |-- topic: string (nullable = true)
>  |-- partition: integer (nullable = true)
>  |-- offset: long (nullable = true)
>  |-- timestamp: timestamp (nullable = true)
>  |-- timestampType: integer (nullable = true)
> df: Unit = ()
> {code}
> In the doc timestamp type is long.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26353) Add typed aggregate functions(max/min) to the example module

2019-02-18 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26353.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23304
[https://github.com/apache/spark/pull/23304]

> Add typed aggregate functions(max/min) to the example module
> 
>
> Key: SPARK-26353
> URL: https://issues.apache.org/jira/browse/SPARK-26353
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 3.0.0
>
>
> Add typed aggregate functions(max/min) to the example module.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26911) Spark do not see column in table

2019-02-18 Thread Vitaly Larchenkov (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitaly Larchenkov updated SPARK-26911:
--
Summary: Spark do not see column in table  (was: Spark )

> Spark do not see column in table
> 
>
> Key: SPARK-26911
> URL: https://issues.apache.org/jira/browse/SPARK-26911
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: PySpark (Spark 2.3.1)
>Reporter: Vitaly Larchenkov
>Priority: Major
>
> Spark cannot find column that actually exists in array
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input 
> columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, 
> flid.bank_id, flid.wr_id, flid.link_id]; {code}
>  
>  
> {code:java}
> ---
> Py4JJavaError Traceback (most recent call last)
> /usr/share/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
> /usr/share/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 327 "An error occurred while calling {0}{1}{2}.\n".
> --> 328 format(target_id, ".", name), value)
> 329 else:
> Py4JJavaError: An error occurred while calling o35.sql.
> : org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input 
> columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, 
> flid.bank_id, flid.wr_id, flid.link_id]; line 10 pos 98;
> 'Project ['multiples.id, 'multiples.link_id]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26911) Spark do not see column in table

2019-02-18 Thread Vitaly Larchenkov (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitaly Larchenkov updated SPARK-26911:
--
Description: 
 

 

Spark cannot find column that actually exists in array
{code:java}
org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input 
columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, 
flid.bank_id, flid.wr_id, flid.link_id]; {code}
 

 
{code:java}
---
Py4JJavaError Traceback (most recent call last)
/usr/share/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:

/usr/share/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:

Py4JJavaError: An error occurred while calling o35.sql.
: org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input 
columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, 
flid.bank_id, flid.wr_id, flid.link_id]; line 10 pos 98;
'Project ['multiples.id, 'multiples.link_id]

{code}
 

Query:
{code:java}
q = f"""
with flid as (
select * from flow_log_by_id
)
select multiples.id, multiples.link_id
from (select fl.id, fl.link_id
from (select id from {flow_log_by_id} group by id having count(*) > 1) multiples
join {flow_log_by_id} fl on fl.id = multiples.id) multiples
join {level_link} ll
on multiples.link_id = ll.link_id_old and ll.link_id_new in (select link_id 
from flid where id = multiples.id)
"""
flow_subset_test_result = spark.sql(q)
{code}
 `with flid` used because without it spark do not find `flow_log_by_id` table, 
so looks like another issues. In sql it works without problems.

  was:
Spark cannot find column that actually exists in array
{code:java}
org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input 
columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, 
flid.bank_id, flid.wr_id, flid.link_id]; {code}
 

 
{code:java}
---
Py4JJavaError Traceback (most recent call last)
/usr/share/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:

/usr/share/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:

Py4JJavaError: An error occurred while calling o35.sql.
: org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input 
columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, 
flid.bank_id, flid.wr_id, flid.link_id]; line 10 pos 98;
'Project ['multiples.id, 'multiples.link_id]{code}


> Spark do not see column in table
> 
>
> Key: SPARK-26911
> URL: https://issues.apache.org/jira/browse/SPARK-26911
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: PySpark (Spark 2.3.1)
>Reporter: Vitaly Larchenkov
>Priority: Major
>
>  
>  
> Spark cannot find column that actually exists in array
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input 
> columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, 
> flid.bank_id, flid.wr_id, flid.link_id]; {code}
>  
>  
> {code:java}
> ---
> Py4JJavaError Traceback (most recent call last)
> /usr/share/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
> /usr/share/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 327 "An error occurred while calling {0}{1}{2}.\n".
> --> 328 format(target_id, ".", name), value)
> 329 else:
> Py4JJavaError: An error occurred while calling o35.sql.
> : org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input 
> columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, 
> flid.bank_id, flid.wr_id, flid.link_id]; line 10 pos 98;
> 'Project ['multiples.id, 'multiples.link_id]
> {code}
>  
> Query:
>

[jira] [Created] (SPARK-26911) Spark

2019-02-18 Thread Vitaly Larchenkov (JIRA)

Vitaly Larchenkov created SPARK-26911:
-

 Summary: Spark 
 Key: SPARK-26911
 URL: https://issues.apache.org/jira/browse/SPARK-26911
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
 Environment: PySpark (Spark 2.3.1)
Reporter: Vitaly Larchenkov


Spark cannot find column that actually exists in array
{code:java}
org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input 
columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, 
flid.bank_id, flid.wr_id, flid.link_id]; {code}
 

 
{code:java}
---
Py4JJavaError Traceback (most recent call last)
/usr/share/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:

/usr/share/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:

Py4JJavaError: An error occurred while calling o35.sql.
: org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input 
columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, 
flid.bank_id, flid.wr_id, flid.link_id]; line 10 pos 98;
'Project ['multiples.id, 'multiples.link_id]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26912) Allow set permission for event_log

2019-02-18 Thread Jackey Lee (JIRA)

Jackey Lee created SPARK-26912:
--

 Summary: Allow set permission for event_log
 Key: SPARK-26912
 URL: https://issues.apache.org/jira/browse/SPARK-26912
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Jackey Lee


Spark_event_log is set to 770 permissions by default. This will be a problem in 
the following scenarios.
1, the user has some ugis that can not be added to the same group
2, the user, who created applications, does not allow others to see or modifie 
it, event they are in same group
3, the user allows other ugi to subscribe to information



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26912) Allow setting permission for event_log

2019-02-18 Thread Jackey Lee (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jackey Lee updated SPARK-26912:
---
Summary: Allow setting permission for event_log  (was: Allow set permission 
for event_log)

> Allow setting permission for event_log
> --
>
> Key: SPARK-26912
> URL: https://issues.apache.org/jira/browse/SPARK-26912
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jackey Lee
>Priority: Minor
>
> Spark_event_log is set to 770 permissions by default. This will be a problem 
> in the following scenarios.
> 1, the user has some ugis that can not be added to the same group
> 2, the user, who created applications, does not allow others to see or 
> modifie it, event they are in same group
> 3, the user allows other ugi to subscribe to information



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26911) Spark do not see column in table

2019-02-18 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16770964#comment-16770964
 ] 

Marco Gaido commented on SPARK-26911:
-

May you please check that current master is still affected? Moreover, can you 
provide a reproducer? Otherwise it is impossible to investigate the issue. 
Thanks.

> Spark do not see column in table
> 
>
> Key: SPARK-26911
> URL: https://issues.apache.org/jira/browse/SPARK-26911
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: PySpark (Spark 2.3.1)
>Reporter: Vitaly Larchenkov
>Priority: Major
>
> Spark cannot find column that actually exists in array
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input 
> columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, 
> flid.bank_id, flid.wr_id, flid.link_id]; {code}
>  
>  
> {code:java}
> ---
> Py4JJavaError Traceback (most recent call last)
> /usr/share/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
> /usr/share/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 327 "An error occurred while calling {0}{1}{2}.\n".
> --> 328 format(target_id, ".", name), value)
> 329 else:
> Py4JJavaError: An error occurred while calling o35.sql.
> : org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input 
> columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, 
> flid.bank_id, flid.wr_id, flid.link_id]; line 10 pos 98;
> 'Project ['multiples.id, 'multiples.link_id]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26913) New data source V2 API: SupportsDirectWrite

2019-02-18 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-26913:
--

 Summary: New data source V2 API: SupportsDirectWrite
 Key: SPARK-26913
 URL: https://issues.apache.org/jira/browse/SPARK-26913
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


Spark supports writing to file data sources without getting and validation with 
the table schema.
For example, 
```
spark.range(10).write.orc(path)
val newDF = spark.range(20).map(id => (id.toDouble, 
id.toString)).toDF("double", "string")
newDF.write.mode("overwrite").orc(path)
```
1. There is no need to get/infer the schema from the table/path
2.  The schema of `newDF` can be different with the original table schema.


However, from https://github.com/apache/spark/pull/23606/files#r255319992 we 
can see that the feature above is missing in data source V2. Currently, data 
source V2 always validates the output query with the table schema. Even after 
the catalog support of DS V2 is implemented,  I think it is hard to support 
both behaviors with the current API/framework. 

This PR proposes to create a new mix-in interface `SupportsDirectWrite`.  With 
the interface, Spark will write to the table location directly without schema 
inference and validation on `DataFrameWriter.save`.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26913) New data source V2 API: SupportsDirectWrite

2019-02-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26913:


Assignee: (was: Apache Spark)

> New data source V2 API: SupportsDirectWrite
> ---
>
> Key: SPARK-26913
> URL: https://issues.apache.org/jira/browse/SPARK-26913
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Spark supports writing to file data sources without getting and validation 
> with the table schema.
> For example, 
> ```
> spark.range(10).write.orc(path)
> val newDF = spark.range(20).map(id => (id.toDouble, 
> id.toString)).toDF("double", "string")
> newDF.write.mode("overwrite").orc(path)
> ```
> 1. There is no need to get/infer the schema from the table/path
> 2.  The schema of `newDF` can be different with the original table schema.
> However, from https://github.com/apache/spark/pull/23606/files#r255319992 we 
> can see that the feature above is missing in data source V2. Currently, data 
> source V2 always validates the output query with the table schema. Even after 
> the catalog support of DS V2 is implemented,  I think it is hard to support 
> both behaviors with the current API/framework. 
> This PR proposes to create a new mix-in interface `SupportsDirectWrite`.  
> With the interface, Spark will write to the table location directly without 
> schema inference and validation on `DataFrameWriter.save`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26913) New data source V2 API: SupportsDirectWrite

2019-02-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26913:


Assignee: Apache Spark

> New data source V2 API: SupportsDirectWrite
> ---
>
> Key: SPARK-26913
> URL: https://issues.apache.org/jira/browse/SPARK-26913
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Spark supports writing to file data sources without getting and validation 
> with the table schema.
> For example, 
> ```
> spark.range(10).write.orc(path)
> val newDF = spark.range(20).map(id => (id.toDouble, 
> id.toString)).toDF("double", "string")
> newDF.write.mode("overwrite").orc(path)
> ```
> 1. There is no need to get/infer the schema from the table/path
> 2.  The schema of `newDF` can be different with the original table schema.
> However, from https://github.com/apache/spark/pull/23606/files#r255319992 we 
> can see that the feature above is missing in data source V2. Currently, data 
> source V2 always validates the output query with the table schema. Even after 
> the catalog support of DS V2 is implemented,  I think it is hard to support 
> both behaviors with the current API/framework. 
> This PR proposes to create a new mix-in interface `SupportsDirectWrite`.  
> With the interface, Spark will write to the table location directly without 
> schema inference and validation on `DataFrameWriter.save`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26912) Allow setting permission for event_log

2019-02-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26912:


Assignee: (was: Apache Spark)

> Allow setting permission for event_log
> --
>
> Key: SPARK-26912
> URL: https://issues.apache.org/jira/browse/SPARK-26912
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jackey Lee
>Priority: Minor
>
> Spark_event_log is set to 770 permissions by default. This will be a problem 
> in the following scenarios.
> 1, the user has some ugis that can not be added to the same group
> 2, the user, who created applications, does not allow others to see or 
> modifie it, event they are in same group
> 3, the user allows other ugi to subscribe to information



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26912) Allow setting permission for event_log

2019-02-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26912:


Assignee: Apache Spark

> Allow setting permission for event_log
> --
>
> Key: SPARK-26912
> URL: https://issues.apache.org/jira/browse/SPARK-26912
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jackey Lee
>Assignee: Apache Spark
>Priority: Minor
>
> Spark_event_log is set to 770 permissions by default. This will be a problem 
> in the following scenarios.
> 1, the user has some ugis that can not be added to the same group
> 2, the user, who created applications, does not allow others to see or 
> modifie it, event they are in same group
> 3, the user allows other ugi to subscribe to information



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit

2019-02-18 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771008#comment-16771008
 ] 

Jungtaek Lim commented on SPARK-15544:
--

I'm interested in this issue, but I guess the thing is not just let Master 
avoid shutting down when leadership has been revoked, but also handle various 
situations in H/A in event handler, so it may require understanding of how 
Spark H/A deals with such situations as of now.

> Bouncing Zookeeper node causes Active spark master to exit
> --
>
> Key: SPARK-15544
> URL: https://issues.apache.org/jira/browse/SPARK-15544
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
> Environment: Ubuntu 14.04.  Zookeeper 3.4.6 with 3-node quorum
>Reporter: Steven Lowenthal
>Priority: Major
>
> Shutting Down a single zookeeper node caused spark master to exit.  The 
> master should have connected to a second zookeeper node. 
> {code:title=log output}
> 16/05/25 18:21:28 INFO master.Master: Launching executor 
> app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138
> 16/05/25 18:21:28 INFO master.Master: Launching executor 
> app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129
> 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data 
> from server sessionid 0x154dfc0426b0054, likely server has closed socket, 
> closing socket connection and attempting reconnect
> 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data 
> from server sessionid 0x254c701f28d0053, likely server has closed socket, 
> closing socket connection and attempting reconnect
> 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED
> 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED
> 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost 
> leadership
> 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master 
> shutting down. }}
> {code}
> spark-env.sh: 
> {code:title=spark-env.sh}
> export SPARK_LOCAL_DIRS=/ephemeral/spark/local
> export SPARK_WORKER_DIR=/ephemeral/spark/work
> export SPARK_LOG_DIR=/var/log/spark
> export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop
> export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER 
> -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181"
> export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26914) ThriftServer scheduler pool may be unpredictably when using fair schedule mode

2019-02-18 Thread zhoukang (JIRA)

zhoukang created SPARK-26914:


 Summary: ThriftServer scheduler pool may be unpredictably when 
using fair schedule mode
 Key: SPARK-26914
 URL: https://issues.apache.org/jira/browse/SPARK-26914
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: zhoukang


When using fair scheduler mode for thrift server, we may have unpredictable 
result.

{code:java}
val pool = sessionToActivePool.get(parentSession.getSessionHandle)
if (pool != null) {
  
sqlContext.sparkContext.setLocalProperty(SparkContext.SPARK_SCHEDULER_POOL, 
pool)
}
{code}

Here is an example:
We have some query will use default pool, however it submit to 'normal' pool

I changed code and add some log.Got some strange result.

Then i found out that the localProperties of SparkContext may has unpredictable 
result when call setLocalProperty. And since thriftserver use thread pool to 
execute queries, it will trigger this bug sometimes.

{code:java}
/**
   * Set a local property that affects jobs submitted from this thread, such as 
the Spark fair
   * scheduler pool. User-defined properties may also be set here. These 
properties are propagated
   * through to worker tasks and can be accessed there via
   * [[org.apache.spark.TaskContext#getLocalProperty]].
   *
   * These properties are inherited by child threads spawned from this thread. 
This
   * may have unexpected consequences when working with thread pools. The 
standard java
   * implementation of thread pools have worker threads spawn other worker 
threads.
   * As a result, local properties may propagate unpredictably.
   */
  def setLocalProperty(key: String, value: String) {
if (value == null) {
  localProperties.get.remove(key)
} else {
  localProperties.get.setProperty(key, value)
}
  }
{code}
   




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26914) ThriftServer scheduler pool may be unpredictably when using fair schedule mode

2019-02-18 Thread zhoukang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-26914:
-
Attachment: 26914-03.png
26914-02.png
26914-01.png

> ThriftServer scheduler pool may be unpredictably when using fair schedule mode
> --
>
> Key: SPARK-26914
> URL: https://issues.apache.org/jira/browse/SPARK-26914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: zhoukang
>Priority: Major
> Attachments: 26914-01.png, 26914-02.png, 26914-03.png
>
>
> When using fair scheduler mode for thrift server, we may have unpredictable 
> result.
> {code:java}
> val pool = sessionToActivePool.get(parentSession.getSessionHandle)
> if (pool != null) {
>   
> sqlContext.sparkContext.setLocalProperty(SparkContext.SPARK_SCHEDULER_POOL, 
> pool)
> }
> {code}
> Here is an example:
> We have some query will use default pool, however it submit to 'normal' pool
> I changed code and add some log.Got some strange result.
> Then i found out that the localProperties of SparkContext may has 
> unpredictable result when call setLocalProperty. And since thriftserver use 
> thread pool to execute queries, it will trigger this bug sometimes.
> {code:java}
> /**
>* Set a local property that affects jobs submitted from this thread, such 
> as the Spark fair
>* scheduler pool. User-defined properties may also be set here. These 
> properties are propagated
>* through to worker tasks and can be accessed there via
>* [[org.apache.spark.TaskContext#getLocalProperty]].
>*
>* These properties are inherited by child threads spawned from this 
> thread. This
>* may have unexpected consequences when working with thread pools. The 
> standard java
>* implementation of thread pools have worker threads spawn other worker 
> threads.
>* As a result, local properties may propagate unpredictably.
>*/
>   def setLocalProperty(key: String, value: String) {
> if (value == null) {
>   localProperties.get.remove(key)
> } else {
>   localProperties.get.setProperty(key, value)
> }
>   }
> {code}
>



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26914) ThriftServer scheduler pool may be unpredictably when using fair schedule mode

2019-02-18 Thread zhoukang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-26914:
-
Description: 
When using fair scheduler mode for thrift server, we may have unpredictable 
result.

{code:java}
val pool = sessionToActivePool.get(parentSession.getSessionHandle)
if (pool != null) {
  
sqlContext.sparkContext.setLocalProperty(SparkContext.SPARK_SCHEDULER_POOL, 
pool)
}
{code}

Here is an example:
We have some query will use default pool, however it submit to 'normal' pool.
 !26914-02.png! 

I changed code and add some log.Got some strange result.
 !26914-01.png! 
 !26914-03.png! 

Then i found out that the localProperties of SparkContext may has unpredictable 
result when call setLocalProperty. And since thriftserver use thread pool to 
execute queries, it will trigger this bug sometimes.

{code:java}
/**
   * Set a local property that affects jobs submitted from this thread, such as 
the Spark fair
   * scheduler pool. User-defined properties may also be set here. These 
properties are propagated
   * through to worker tasks and can be accessed there via
   * [[org.apache.spark.TaskContext#getLocalProperty]].
   *
   * These properties are inherited by child threads spawned from this thread. 
This
   * may have unexpected consequences when working with thread pools. The 
standard java
   * implementation of thread pools have worker threads spawn other worker 
threads.
   * As a result, local properties may propagate unpredictably.
   */
  def setLocalProperty(key: String, value: String) {
if (value == null) {
  localProperties.get.remove(key)
} else {
  localProperties.get.setProperty(key, value)
}
  }
{code}
   


  was:
When using fair scheduler mode for thrift server, we may have unpredictable 
result.

{code:java}
val pool = sessionToActivePool.get(parentSession.getSessionHandle)
if (pool != null) {
  
sqlContext.sparkContext.setLocalProperty(SparkContext.SPARK_SCHEDULER_POOL, 
pool)
}
{code}

Here is an example:
We have some query will use default pool, however it submit to 'normal' pool

I changed code and add some log.Got some strange result.

Then i found out that the localProperties of SparkContext may has unpredictable 
result when call setLocalProperty. And since thriftserver use thread pool to 
execute queries, it will trigger this bug sometimes.

{code:java}
/**
   * Set a local property that affects jobs submitted from this thread, such as 
the Spark fair
   * scheduler pool. User-defined properties may also be set here. These 
properties are propagated
   * through to worker tasks and can be accessed there via
   * [[org.apache.spark.TaskContext#getLocalProperty]].
   *
   * These properties are inherited by child threads spawned from this thread. 
This
   * may have unexpected consequences when working with thread pools. The 
standard java
   * implementation of thread pools have worker threads spawn other worker 
threads.
   * As a result, local properties may propagate unpredictably.
   */
  def setLocalProperty(key: String, value: String) {
if (value == null) {
  localProperties.get.remove(key)
} else {
  localProperties.get.setProperty(key, value)
}
  }
{code}
   



> ThriftServer scheduler pool may be unpredictably when using fair schedule mode
> --
>
> Key: SPARK-26914
> URL: https://issues.apache.org/jira/browse/SPARK-26914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: zhoukang
>Priority: Major
> Attachments: 26914-01.png, 26914-02.png, 26914-03.png
>
>
> When using fair scheduler mode for thrift server, we may have unpredictable 
> result.
> {code:java}
> val pool = sessionToActivePool.get(parentSession.getSessionHandle)
> if (pool != null) {
>   
> sqlContext.sparkContext.setLocalProperty(SparkContext.SPARK_SCHEDULER_POOL, 
> pool)
> }
> {code}
> Here is an example:
> We have some query will use default pool, however it submit to 'normal' pool.
>  !26914-02.png! 
> I changed code and add some log.Got some strange result.
>  !26914-01.png! 
>  !26914-03.png! 
> Then i found out that the localProperties of SparkContext may has 
> unpredictable result when call setLocalProperty. And since thriftserver use 
> thread pool to execute queries, it will trigger this bug sometimes.
> {code:java}
> /**
>* Set a local property that affects jobs submitted from this thread, such 
> as the Spark fair
>* scheduler pool. User-defined properties may also be set here. These 
> properties are propagated
>* through to worker tasks and can be accessed there via
>* [[org.apache.spark.TaskContext#getLocalProperty]].
>*
>* These properties are inherited by child threads spawned from thi

[jira] [Assigned] (SPARK-26914) ThriftServer scheduler pool may be unpredictably when using fair schedule mode

2019-02-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26914:


Assignee: Apache Spark

> ThriftServer scheduler pool may be unpredictably when using fair schedule mode
> --
>
> Key: SPARK-26914
> URL: https://issues.apache.org/jira/browse/SPARK-26914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: zhoukang
>Assignee: Apache Spark
>Priority: Major
> Attachments: 26914-01.png, 26914-02.png, 26914-03.png
>
>
> When using fair scheduler mode for thrift server, we may have unpredictable 
> result.
> {code:java}
> val pool = sessionToActivePool.get(parentSession.getSessionHandle)
> if (pool != null) {
>   
> sqlContext.sparkContext.setLocalProperty(SparkContext.SPARK_SCHEDULER_POOL, 
> pool)
> }
> {code}
> Here is an example:
> We have some query will use default pool, however it submit to 'normal' pool.
>  !26914-02.png! 
> I changed code and add some log.Got some strange result.
>  !26914-01.png! 
>  !26914-03.png! 
> Then i found out that the localProperties of SparkContext may has 
> unpredictable result when call setLocalProperty. And since thriftserver use 
> thread pool to execute queries, it will trigger this bug sometimes.
> {code:java}
> /**
>* Set a local property that affects jobs submitted from this thread, such 
> as the Spark fair
>* scheduler pool. User-defined properties may also be set here. These 
> properties are propagated
>* through to worker tasks and can be accessed there via
>* [[org.apache.spark.TaskContext#getLocalProperty]].
>*
>* These properties are inherited by child threads spawned from this 
> thread. This
>* may have unexpected consequences when working with thread pools. The 
> standard java
>* implementation of thread pools have worker threads spawn other worker 
> threads.
>* As a result, local properties may propagate unpredictably.
>*/
>   def setLocalProperty(key: String, value: String) {
> if (value == null) {
>   localProperties.get.remove(key)
> } else {
>   localProperties.get.setProperty(key, value)
> }
>   }
> {code}
>



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26914) ThriftServer scheduler pool may be unpredictably when using fair schedule mode

2019-02-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26914:


Assignee: (was: Apache Spark)

> ThriftServer scheduler pool may be unpredictably when using fair schedule mode
> --
>
> Key: SPARK-26914
> URL: https://issues.apache.org/jira/browse/SPARK-26914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: zhoukang
>Priority: Major
> Attachments: 26914-01.png, 26914-02.png, 26914-03.png
>
>
> When using fair scheduler mode for thrift server, we may have unpredictable 
> result.
> {code:java}
> val pool = sessionToActivePool.get(parentSession.getSessionHandle)
> if (pool != null) {
>   
> sqlContext.sparkContext.setLocalProperty(SparkContext.SPARK_SCHEDULER_POOL, 
> pool)
> }
> {code}
> Here is an example:
> We have some query will use default pool, however it submit to 'normal' pool.
>  !26914-02.png! 
> I changed code and add some log.Got some strange result.
>  !26914-01.png! 
>  !26914-03.png! 
> Then i found out that the localProperties of SparkContext may has 
> unpredictable result when call setLocalProperty. And since thriftserver use 
> thread pool to execute queries, it will trigger this bug sometimes.
> {code:java}
> /**
>* Set a local property that affects jobs submitted from this thread, such 
> as the Spark fair
>* scheduler pool. User-defined properties may also be set here. These 
> properties are propagated
>* through to worker tasks and can be accessed there via
>* [[org.apache.spark.TaskContext#getLocalProperty]].
>*
>* These properties are inherited by child threads spawned from this 
> thread. This
>* may have unexpected consequences when working with thread pools. The 
> standard java
>* implementation of thread pools have worker threads spawn other worker 
> threads.
>* As a result, local properties may propagate unpredictably.
>*/
>   def setLocalProperty(key: String, value: String) {
> if (value == null) {
>   localProperties.get.remove(key)
> } else {
>   localProperties.get.setProperty(key, value)
> }
>   }
> {code}
>



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit

2019-02-18 Thread Moein Hosseini (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771035#comment-16771035
 ] 

Moein Hosseini commented on SPARK-15544:


[~srowen] I've started to work on it. Seems it comes from LatchLeader of 
Curator.

> Bouncing Zookeeper node causes Active spark master to exit
> --
>
> Key: SPARK-15544
> URL: https://issues.apache.org/jira/browse/SPARK-15544
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
> Environment: Ubuntu 14.04.  Zookeeper 3.4.6 with 3-node quorum
>Reporter: Steven Lowenthal
>Priority: Major
>
> Shutting Down a single zookeeper node caused spark master to exit.  The 
> master should have connected to a second zookeeper node. 
> {code:title=log output}
> 16/05/25 18:21:28 INFO master.Master: Launching executor 
> app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138
> 16/05/25 18:21:28 INFO master.Master: Launching executor 
> app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129
> 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data 
> from server sessionid 0x154dfc0426b0054, likely server has closed socket, 
> closing socket connection and attempting reconnect
> 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data 
> from server sessionid 0x254c701f28d0053, likely server has closed socket, 
> closing socket connection and attempting reconnect
> 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED
> 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED
> 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost 
> leadership
> 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master 
> shutting down. }}
> {code}
> spark-env.sh: 
> {code:title=spark-env.sh}
> export SPARK_LOCAL_DIRS=/ephemeral/spark/local
> export SPARK_WORKER_DIR=/ephemeral/spark/work
> export SPARK_LOG_DIR=/var/log/spark
> export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop
> export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER 
> -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181"
> export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26880) dataDF.queryExecution.toRdd corrupt rows

2019-02-18 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26880.
--
Resolution: Invalid

> dataDF.queryExecution.toRdd corrupt rows
> 
>
> Key: SPARK-26880
> URL: https://issues.apache.org/jira/browse/SPARK-26880
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Grant Henke
>Priority: Major
>
> I have seen a simple case where InternalRows returned by 
> `queryExecution.toRdd` are corrupt. Some rows are duplicated while other are 
> missing. 
> This simple test illustrates the issue:
> {code}
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.catalyst.encoders.RowEncoder
> import org.apache.spark.sql.types.IntegerType
> import org.apache.spark.sql.types.StringType
> import org.apache.spark.sql.types.StructField
> import org.apache.spark.sql.types.StructType
> import org.junit.Assert._
> import org.junit.Test
> import org.scalatest.Matchers
> import org.scalatest.junit.JUnitSuite
> import org.slf4j.Logger
> import org.slf4j.LoggerFactory
> class SparkTest extends JUnitSuite with Matchers {
>   val Log: Logger = LoggerFactory.getLogger(getClass)
>   @Test
>   def testSparkRowCorruption(): Unit = {
> val conf = new SparkConf()
>   .setMaster("local[*]")
>   .setAppName("test")
>   .set("spark.ui.enabled", "false")
> val ss = SparkSession.builder().config(conf).getOrCreate()
> // Setup a DataFrame for testing.
> val data = Seq(
>   Row.fromSeq(Seq(0, "0")),
>   Row.fromSeq(Seq(25, "25")),
>   Row.fromSeq(Seq(50, "50")),
>   Row.fromSeq(Seq(75, "75")),
>   Row.fromSeq(Seq(99, "99")),
>   Row.fromSeq(Seq(100, "100")),
>   Row.fromSeq(Seq(101, "101")),
>   Row.fromSeq(Seq(125, "125")),
>   Row.fromSeq(Seq(150, "150")),
>   Row.fromSeq(Seq(175, "175")),
>   Row.fromSeq(Seq(199, "199"))
> )
> val dataRDD = ss.sparkContext.parallelize(data)
> val schema = StructType(
>   Seq(
> StructField("key", IntegerType),
> StructField("value", StringType)
>   ))
> val dataDF = ss.sqlContext.createDataFrame(dataRDD, schema)
> // Convert to an RDD.
> val rdd = dataDF.queryExecution.toRdd
> 
> // Collect the data to compare.
> val resultData = rdd.collect
> resultData.foreach { row =>
>   // Log for visualizing the corruption.
>   Log.error(s"${row.getInt(0)}")
> }
> // Ensure the keys in the original data and resulting data match.
> val dataKeys = data.map(_.getInt(0)).toSet
> val resultKeys = resultData.map(_.getInt(0)).toSet
> assertEquals(dataKeys, resultKeys)
>   }
> }
> {code}
> That test fails with the following:
> {noformat}
> 10:38:26.967 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 0
> 10:38:26.967 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 25
> 10:38:26.967 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 75
> 10:38:26.968 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 75
> 10:38:26.968 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 99
> 10:38:26.968 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 100
> 10:38:26.969 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 125
> 10:38:26.969 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 125
> 10:38:26.969 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 150
> 10:38:26.970 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 199
> 10:38:26.970 [ERROR - Test worker] (PartitionByInternalRowTest.scala:57) 199
> expected: but 
> was:
> Expected :Set(101, 0, 25, 125, 150, 50, 199, 175, 99, 75, 100)
> Actual   :Set(0, 25, 125, 150, 199, 99, 75, 100)
> {noformat}
> If I map from and InternalRow to a Row the issue goes away:
> {code}
> val rdd = dataDF.queryExecution.toRdd.mapPartitions { internalRows =>
>val encoder = RowEncoder.apply(schema).resolveAndBind()
>internalRows.map(encoder.fromRow)
> }
> {code}
> Converting with CatalystTypeConverters also appears to resolve the issue:
> {code}
> val rdd = dataDF.queryExecution.toRdd.mapPartitions { internalRows =>
>val typeConverter = CatalystTypeConverters.createToScalaConverter(schema)
>internalRows.map(ir => typeConverter(ir).asInstanceOf[Row])
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24783) spark.sql.shuffle.partitions=0 should throw exception

2019-02-18 Thread langjingxiang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771070#comment-16771070
 ] 

langjingxiang commented on SPARK-24783:
---

ok 

Let me add a judgement.

> spark.sql.shuffle.partitions=0 should throw exception
> -
>
> Key: SPARK-24783
> URL: https://issues.apache.org/jira/browse/SPARK-24783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Avi minsky
>Priority: Major
>
> if spark.sql.shuffle.partitions=0 and trying to join tables (not broadcast 
> join)
> *result join will be an empty table.*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24783) spark.sql.shuffle.partitions=0 should throw exception

2019-02-18 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771085#comment-16771085
 ] 

Jungtaek Lim commented on SPARK-24783:
--

Does we have the chance to set the value to 0 or negative value intentionally? 
If not, the needed change would be trivial: add checkValue in 
`SHUFFLE_PARTITIONS` to require "value > 0".

> spark.sql.shuffle.partitions=0 should throw exception
> -
>
> Key: SPARK-24783
> URL: https://issues.apache.org/jira/browse/SPARK-24783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Avi minsky
>Priority: Major
>
> if spark.sql.shuffle.partitions=0 and trying to join tables (not broadcast 
> join)
> *result join will be an empty table.*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26881) Scaling issue with Gramian computation for RowMatrix: too many results sent to driver

2019-02-18 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771094#comment-16771094
 ] 

Sean Owen commented on SPARK-26881:
---

It's related but not quite the same issue. (You should try 2.3.3 at least with 
then other fix for a perf improvement.)

To be clear this proposes a deeper treeAggregate in some cases? Makes sense. 
The cost is more overall wall clock time and a little more network traffic. But 
I agree that it needs to be capped by the max result set size as you say. Go 
ahead with that. 


> Scaling issue with Gramian computation for RowMatrix: too many results sent 
> to driver
> -
>
> Key: SPARK-26881
> URL: https://issues.apache.org/jira/browse/SPARK-26881
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Rafael RENAUDIN-AVINO
>Priority: Minor
>
> This issue hit me when running PCA on large dataset (~1Billion rows, ~30k 
> columns).
> Computing Gramian of a big RowMatrix allows to reproduce the issue.
>  
> The problem arises in the treeAggregate phase of the gramian matrix 
> computation: results sent to driver are enormous.
> A potential solution to this could be to replace the hard coded depth (2) of 
> the tree aggregation by a heuristic computed based on the number of 
> partitions, driver max result size, and memory size of the dense vectors that 
> are being aggregated, cf below for more detail:
> (nb_partitions)^(1/depth) * dense_vector_size <= driver_max_result_size
> I have a potential fix ready (currently testing it at scale), but I'd like to 
> hear the community opinion about such a fix to know if it's worth investing 
> my time into a clean pull request.
>  
> Note that I only faced this issue with spark 2.2 but I suspect it affects 
> later versions aswell. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26915) File source should write without schema inference and validation in DataFrameWriter.save()

2019-02-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26915:


Assignee: Apache Spark

> File source should write without schema inference and validation in 
> DataFrameWriter.save()
> --
>
> Key: SPARK-26915
> URL: https://issues.apache.org/jira/browse/SPARK-26915
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Spark supports writing to file data sources without getting and validation 
> with the table schema.
> For example, 
> ```
> spark.range(10).write.orc(path)
> val newDF = spark.range(20).map(id => (id.toDouble, 
> id.toString)).toDF("double", "string")
> newDF.write.mode("overwrite").orc(path)
> ```
> 1. There is no need to get/infer the schema from the table/path
> 2. The schema of `newDF` can be different with the original table schema.
> However, from https://github.com/apache/spark/pull/23606/files#r255319992 we 
> can see that the feature above is missing in data source V2. Currently, data 
> source V2 always validates the output query with the table schema. Even after 
> the catalog support of DS V2 is implemented, I think it is hard to support 
> both behaviors with the current API/framework.
> This PR proposes to process file sources as a special case in 
> `DataFrameWriter.save()`. So that we can keep the original behavior for this 
> DataFrame API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26915) File source should write without schema inference and validation in DataFrameWriter.save()

2019-02-18 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-26915:
--

 Summary: File source should write without schema inference and 
validation in DataFrameWriter.save()
 Key: SPARK-26915
 URL: https://issues.apache.org/jira/browse/SPARK-26915
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


Spark supports writing to file data sources without getting and validation with 
the table schema.
For example, 
```
spark.range(10).write.orc(path)
val newDF = spark.range(20).map(id => (id.toDouble, 
id.toString)).toDF("double", "string")
newDF.write.mode("overwrite").orc(path)
```
1. There is no need to get/infer the schema from the table/path
2. The schema of `newDF` can be different with the original table schema.

However, from https://github.com/apache/spark/pull/23606/files#r255319992 we 
can see that the feature above is missing in data source V2. Currently, data 
source V2 always validates the output query with the table schema. Even after 
the catalog support of DS V2 is implemented, I think it is hard to support both 
behaviors with the current API/framework.

This PR proposes to process file sources as a special case in 
`DataFrameWriter.save()`. So that we can keep the original behavior for this 
DataFrame API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26915) File source should write without schema inference and validation in DataFrameWriter.save()

2019-02-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26915:


Assignee: (was: Apache Spark)

> File source should write without schema inference and validation in 
> DataFrameWriter.save()
> --
>
> Key: SPARK-26915
> URL: https://issues.apache.org/jira/browse/SPARK-26915
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Spark supports writing to file data sources without getting and validation 
> with the table schema.
> For example, 
> ```
> spark.range(10).write.orc(path)
> val newDF = spark.range(20).map(id => (id.toDouble, 
> id.toString)).toDF("double", "string")
> newDF.write.mode("overwrite").orc(path)
> ```
> 1. There is no need to get/infer the schema from the table/path
> 2. The schema of `newDF` can be different with the original table schema.
> However, from https://github.com/apache/spark/pull/23606/files#r255319992 we 
> can see that the feature above is missing in data source V2. Currently, data 
> source V2 always validates the output query with the table schema. Even after 
> the catalog support of DS V2 is implemented, I think it is hard to support 
> both behaviors with the current API/framework.
> This PR proposes to process file sources as a special case in 
> `DataFrameWriter.save()`. So that we can keep the original behavior for this 
> DataFrame API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26881) Scaling issue with Gramian computation for RowMatrix: too many results sent to driver

2019-02-18 Thread Rafael RENAUDIN-AVINO (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771138#comment-16771138
 ] 

Rafael RENAUDIN-AVINO commented on SPARK-26881:
---

Basically I see two improvements that could be made:
1- Allow the depth of the aggregation to be configurable by the user
2- If not specified by the user, the depth of the treeAggregate can be computed 
with the heuristic in the ticket description.

 

I'm not sure about the value of 2- without integrating 1-: if a framework uses 
a heuristic to determine a parameter without giving the user the option of 
setting the parameter himself, it might be annoying...

> Scaling issue with Gramian computation for RowMatrix: too many results sent 
> to driver
> -
>
> Key: SPARK-26881
> URL: https://issues.apache.org/jira/browse/SPARK-26881
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Rafael RENAUDIN-AVINO
>Priority: Minor
>
> This issue hit me when running PCA on large dataset (~1Billion rows, ~30k 
> columns).
> Computing Gramian of a big RowMatrix allows to reproduce the issue.
>  
> The problem arises in the treeAggregate phase of the gramian matrix 
> computation: results sent to driver are enormous.
> A potential solution to this could be to replace the hard coded depth (2) of 
> the tree aggregation by a heuristic computed based on the number of 
> partitions, driver max result size, and memory size of the dense vectors that 
> are being aggregated, cf below for more detail:
> (nb_partitions)^(1/depth) * dense_vector_size <= driver_max_result_size
> I have a potential fix ready (currently testing it at scale), but I'd like to 
> hear the community opinion about such a fix to know if it's worth investing 
> my time into a clean pull request.
>  
> Note that I only faced this issue with spark 2.2 but I suspect it affects 
> later versions aswell. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26881) Scaling issue with Gramian computation for RowMatrix: too many results sent to driver

2019-02-18 Thread Rafael RENAUDIN-AVINO (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771138#comment-16771138
 ] 

Rafael RENAUDIN-AVINO edited comment on SPARK-26881 at 2/18/19 2:49 PM:


Basically I see two improvements that could be made:
 1- Allow the depth of the aggregation to be configurable by the user
 2- If not specified by the user, the depth of the treeAggregate can be 
computed with the heuristic in the ticket description.

 

I'm not sure about the value of 2- without integrating 1-: if a framework uses 
a heuristic to determine a parameter without giving the user the option of 
setting the parameter himself, it might be annoying...

As for testing, It worked well for my use case (heuristic yields depth 4 in my 
case). I'll try testing at scale with spark 2.3.3 but I can't guarantee I'll be 
able to, as the cluster I'm working on has limited spark profiles supported. 
Will try though :)


was (Author: gagafunctor):
Basically I see two improvements that could be made:
1- Allow the depth of the aggregation to be configurable by the user
2- If not specified by the user, the depth of the treeAggregate can be computed 
with the heuristic in the ticket description.

 

I'm not sure about the value of 2- without integrating 1-: if a framework uses 
a heuristic to determine a parameter without giving the user the option of 
setting the parameter himself, it might be annoying...

> Scaling issue with Gramian computation for RowMatrix: too many results sent 
> to driver
> -
>
> Key: SPARK-26881
> URL: https://issues.apache.org/jira/browse/SPARK-26881
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Rafael RENAUDIN-AVINO
>Priority: Minor
>
> This issue hit me when running PCA on large dataset (~1Billion rows, ~30k 
> columns).
> Computing Gramian of a big RowMatrix allows to reproduce the issue.
>  
> The problem arises in the treeAggregate phase of the gramian matrix 
> computation: results sent to driver are enormous.
> A potential solution to this could be to replace the hard coded depth (2) of 
> the tree aggregation by a heuristic computed based on the number of 
> partitions, driver max result size, and memory size of the dense vectors that 
> are being aggregated, cf below for more detail:
> (nb_partitions)^(1/depth) * dense_vector_size <= driver_max_result_size
> I have a potential fix ready (currently testing it at scale), but I'd like to 
> hear the community opinion about such a fix to know if it's worth investing 
> my time into a clean pull request.
>  
> Note that I only faced this issue with spark 2.2 but I suspect it affects 
> later versions aswell. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24783) spark.sql.shuffle.partitions=0 should throw exception

2019-02-18 Thread langjingxiang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771141#comment-16771141
 ] 

langjingxiang commented on SPARK-24783:
---

yes you are right ,It seems that there is no use of 0 or negative values at 
present.

> spark.sql.shuffle.partitions=0 should throw exception
> -
>
> Key: SPARK-24783
> URL: https://issues.apache.org/jira/browse/SPARK-24783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Avi minsky
>Priority: Major
>
> if spark.sql.shuffle.partitions=0 and trying to join tables (not broadcast 
> join)
> *result join will be an empty table.*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26916) Upgrade to Kafka 2.1.1

2019-02-18 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-26916:
-

 Summary: Upgrade to Kafka 2.1.1
 Key: SPARK-26916
 URL: https://issues.apache.org/jira/browse/SPARK-26916
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


Kafka 2.1.1 vote passes. This issue updates Kafka dependency to 2.1.1 to bring 
the following 42 bug fixes.
- https://issues.apache.org/jira/projects/KAFKA/versions/12344250



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26916) Upgrade to Kafka 2.1.1

2019-02-18 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26916:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-24417

> Upgrade to Kafka 2.1.1
> --
>
> Key: SPARK-26916
> URL: https://issues.apache.org/jira/browse/SPARK-26916
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Kafka 2.1.1 vote passes. This issue updates Kafka dependency to 2.1.1 to 
> bring the following 42 bug fixes.
> - https://issues.apache.org/jira/projects/KAFKA/versions/12344250



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26916) Upgrade to Kafka 2.1.1

2019-02-18 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26916:
--
Issue Type: Improvement  (was: Bug)

> Upgrade to Kafka 2.1.1
> --
>
> Key: SPARK-26916
> URL: https://issues.apache.org/jira/browse/SPARK-26916
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Kafka 2.1.1 vote passes. This issue updates Kafka dependency to 2.1.1 to 
> bring the following 42 bug fixes.
> - https://issues.apache.org/jira/projects/KAFKA/versions/12344250



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26916) Upgrade to Kafka 2.1.1

2019-02-18 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771216#comment-16771216
 ] 

Dongjoon Hyun commented on SPARK-26916:
---

I'll make a PR shortly.

> Upgrade to Kafka 2.1.1
> --
>
> Key: SPARK-26916
> URL: https://issues.apache.org/jira/browse/SPARK-26916
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Kafka 2.1.1 vote passes. This issue updates Kafka dependency to 2.1.1 to 
> bring the following 42 bug fixes.
> - https://issues.apache.org/jira/projects/KAFKA/versions/12344250



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26916) Upgrade to Kafka 2.1.1

2019-02-18 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-26916:
-

Assignee: Dongjoon Hyun

> Upgrade to Kafka 2.1.1
> --
>
> Key: SPARK-26916
> URL: https://issues.apache.org/jira/browse/SPARK-26916
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Kafka 2.1.1 vote passes. This issue updates Kafka dependency to 2.1.1 to 
> bring the following 42 bug fixes.
> - https://issues.apache.org/jira/projects/KAFKA/versions/12344250



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26916) Upgrade to Kafka 2.1.1

2019-02-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26916:


Assignee: Dongjoon Hyun  (was: Apache Spark)

> Upgrade to Kafka 2.1.1
> --
>
> Key: SPARK-26916
> URL: https://issues.apache.org/jira/browse/SPARK-26916
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Kafka 2.1.1 vote passes. This issue updates Kafka dependency to 2.1.1 to 
> bring the following 42 bug fixes.
> - https://issues.apache.org/jira/projects/KAFKA/versions/12344250



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26916) Upgrade to Kafka 2.1.1

2019-02-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26916:


Assignee: Apache Spark  (was: Dongjoon Hyun)

> Upgrade to Kafka 2.1.1
> --
>
> Key: SPARK-26916
> URL: https://issues.apache.org/jira/browse/SPARK-26916
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> Kafka 2.1.1 vote passes. This issue updates Kafka dependency to 2.1.1 to 
> bring the following 42 bug fixes.
> - https://issues.apache.org/jira/projects/KAFKA/versions/12344250



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26777) SQL worked in 2.3.2 and fails in 2.4.0

2019-02-18 Thread Ilya Peysakhov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771304#comment-16771304
 ] 

Ilya Peysakhov commented on SPARK-26777:


[~hyukjin.kwon]


[~yuri.budilov]

 

[~kabhwan]

 

Hello folks,

 

i went ahead and ran this in vanilla Scala Spark 2.4 on windows, EMR 5.20 
(scala, spark 2.4) and EMR 5.17 (scala, spark 2.3.1) and have replicated the 
issue. 

 

Here is the code

 

spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL 
select '2018-01-02', 'source2' UNION ALL select '2018-01-03' , 'source3' UNION 
ALL select '2018-01-04' ,'source4' ").write.save("/latest_dates")
val mydatetable = spark.read.load("/latest_dates")
mydatetable.createOrReplaceTempView("latest_dates")


spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, 
'2018-01-02' UNION ALL select 300, '2018-01-03' UNION ALL select 3444, 
'2018-01-01' UNION ALL select 600, '2018-08-30' 
").write.partitionBy("date").save("/mypartitioneddata")
val source1 = spark.read.load("/mypartitioneddata")
source1.createOrReplaceTempView("source1")


spark.sql("select max(date), 'source1' as category from source1 where date >= 
(select latest_date from latest_dates where source='source1') ").show

 

The error comes up in vanilla spark 2.4 on windows and EMR 5.20, not vanilla 
spark 2.3.1 or EMR 5.17. This is definitely not an AWS/EMR issue. 

 

Let me know if you need more details. Should i post a new issue?

I did notice that if you just use a subquery on hardcoded values, this does not 
happen, it only happens when you subquery an existing table (like in the method 
above, reading data into a view, or in AWS using the Glue catalog). 

 

> SQL worked in 2.3.2 and fails in 2.4.0
> --
>
> Key: SPARK-26777
> URL: https://issues.apache.org/jira/browse/SPARK-26777
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Yuri Budilov
>Priority: Major
>
> Following SQL worked in Spark 2.3.2 and now fails on 2.4.0 (AWS EMR Spark)
>  PySpark call below:
> spark.sql("select partition_year_utc,partition_month_utc,partition_day_utc \
> from datalake_reporting.copy_of_leads_notification \
> where partition_year_utc = (select max(partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification) \
> and partition_month_utc = \
>  (select max(partition_month_utc) from 
> datalake_reporting.copy_of_leads_notification as m \
>  where \
>  m.partition_year_utc = (select max(partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification)) \
>  and partition_day_utc = (select max(d.partition_day_utc) from 
> datalake_reporting.copy_of_leads_notification as d \
>  where d.partition_month_utc = \
>  (select max(m1.partition_month_utc) from 
> datalake_reporting.copy_of_leads_notification as m1 \
>  where m1.partition_year_utc = \
>  (select max(y.partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification as y) \
>  ) \
>  ) \
>  order by 1 desc, 2 desc, 3 desc limit 1 ").show(1,False)
> Error: (no need for data, this is syntax).
> py4j.protocol.Py4JJavaError: An error occurred while calling o1326.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#4495 []
>  
> Note: all 3 columns in query are Partitioned columns - see bottom of the 
> schema)
>  
> Hive EMR AWS Schema is:
>  
> CREATE EXTERNAL TABLE `copy_of_leads_notification`(
> `message.environment.siteorigin` string, `dcpheader.dcploaddateutc` string, 
> `message.id` int, `source.properties._country` string, `message.created` 
> string, `dcpheader.generatedmessageid` string, `message.tags` bigint, 
> `source.properties._enqueuedtimeutc` string, `source.properties._leadtype` 
> string, `message.itemid` string, `message.prospect.postcode` string, 
> `message.prospect.email` string, `message.referenceid` string, 
> `message.item.year` string, `message.identifier` string, 
> `dcpheader.dcploadmonthutc` string, `message.processed` string, 
> `source.properties._tenant` string, `message.item.price` string, 
> `message.subscription.confirmresponse` boolean, `message.itemtype` string, 
> `message.prospect.lastname` string, `message.subscription.insurancequote` 
> boolean, `source.exchangename` string, 
> `message.prospect.identificationnumbers` bigint, 
> `message.environment.ipaddress` string, `dcpheader.dcploaddayutc` string, 
> `source.properties._itemtype` string, `source.properties._requesttype` 
> string, `message.item.make` string, `message.prospect.firstname` string, 
> `message.subscription.survey` boolean, `message.prospect.homephone` string, 
> `message.extendedproperties` bigint, `message.subscription.financequote` 
> boolean, `message.uniqueidentifier` string, `source.properties._i

[jira] [Created] (SPARK-26917) CacheManager blocks while traversing plans

2019-02-18 Thread Dave DeCaprio (JIRA)

Dave DeCaprio created SPARK-26917:
-

 Summary: CacheManager blocks while traversing plans
 Key: SPARK-26917
 URL: https://issues.apache.org/jira/browse/SPARK-26917
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
 Environment: We are running on AWS EMR 5.20, so Spark 2.4.0.
Reporter: Dave DeCaprio


This is related to SPARK-26548 and SPARK-26617.  The CacheManager is further 
locking during the recacheByCondition operation.  For large plans evaluation of 
the condition can take a long time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26917) CacheManager blocks while traversing plans

2019-02-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26917:


Assignee: Apache Spark

> CacheManager blocks while traversing plans
> --
>
> Key: SPARK-26917
> URL: https://issues.apache.org/jira/browse/SPARK-26917
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: We are running on AWS EMR 5.20, so Spark 2.4.0.
>Reporter: Dave DeCaprio
>Assignee: Apache Spark
>Priority: Minor
>
> This is related to SPARK-26548 and SPARK-26617.  The CacheManager is further 
> locking during the recacheByCondition operation.  For large plans evaluation 
> of the condition can take a long time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26917) CacheManager blocks while traversing plans

2019-02-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26917:


Assignee: (was: Apache Spark)

> CacheManager blocks while traversing plans
> --
>
> Key: SPARK-26917
> URL: https://issues.apache.org/jira/browse/SPARK-26917
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: We are running on AWS EMR 5.20, so Spark 2.4.0.
>Reporter: Dave DeCaprio
>Priority: Minor
>
> This is related to SPARK-26548 and SPARK-26617.  The CacheManager is further 
> locking during the recacheByCondition operation.  For large plans evaluation 
> of the condition can take a long time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26785) data source v2 API refactor: streaming write

2019-02-18 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-26785.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> data source v2 API refactor: streaming write
> 
>
> Key: SPARK-26785
> URL: https://issues.apache.org/jira/browse/SPARK-26785
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26777) SQL worked in 2.3.2 and fails in 2.4.0

2019-02-18 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771486#comment-16771486
 ] 

Hyukjin Kwon commented on SPARK-26777:
--

Please open a separate issue if you're not very sure if it's the same issue or 
not.

> SQL worked in 2.3.2 and fails in 2.4.0
> --
>
> Key: SPARK-26777
> URL: https://issues.apache.org/jira/browse/SPARK-26777
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Yuri Budilov
>Priority: Major
>
> Following SQL worked in Spark 2.3.2 and now fails on 2.4.0 (AWS EMR Spark)
>  PySpark call below:
> spark.sql("select partition_year_utc,partition_month_utc,partition_day_utc \
> from datalake_reporting.copy_of_leads_notification \
> where partition_year_utc = (select max(partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification) \
> and partition_month_utc = \
>  (select max(partition_month_utc) from 
> datalake_reporting.copy_of_leads_notification as m \
>  where \
>  m.partition_year_utc = (select max(partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification)) \
>  and partition_day_utc = (select max(d.partition_day_utc) from 
> datalake_reporting.copy_of_leads_notification as d \
>  where d.partition_month_utc = \
>  (select max(m1.partition_month_utc) from 
> datalake_reporting.copy_of_leads_notification as m1 \
>  where m1.partition_year_utc = \
>  (select max(y.partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification as y) \
>  ) \
>  ) \
>  order by 1 desc, 2 desc, 3 desc limit 1 ").show(1,False)
> Error: (no need for data, this is syntax).
> py4j.protocol.Py4JJavaError: An error occurred while calling o1326.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#4495 []
>  
> Note: all 3 columns in query are Partitioned columns - see bottom of the 
> schema)
>  
> Hive EMR AWS Schema is:
>  
> CREATE EXTERNAL TABLE `copy_of_leads_notification`(
> `message.environment.siteorigin` string, `dcpheader.dcploaddateutc` string, 
> `message.id` int, `source.properties._country` string, `message.created` 
> string, `dcpheader.generatedmessageid` string, `message.tags` bigint, 
> `source.properties._enqueuedtimeutc` string, `source.properties._leadtype` 
> string, `message.itemid` string, `message.prospect.postcode` string, 
> `message.prospect.email` string, `message.referenceid` string, 
> `message.item.year` string, `message.identifier` string, 
> `dcpheader.dcploadmonthutc` string, `message.processed` string, 
> `source.properties._tenant` string, `message.item.price` string, 
> `message.subscription.confirmresponse` boolean, `message.itemtype` string, 
> `message.prospect.lastname` string, `message.subscription.insurancequote` 
> boolean, `source.exchangename` string, 
> `message.prospect.identificationnumbers` bigint, 
> `message.environment.ipaddress` string, `dcpheader.dcploaddayutc` string, 
> `source.properties._itemtype` string, `source.properties._requesttype` 
> string, `message.item.make` string, `message.prospect.firstname` string, 
> `message.subscription.survey` boolean, `message.prospect.homephone` string, 
> `message.extendedproperties` bigint, `message.subscription.financequote` 
> boolean, `message.uniqueidentifier` string, `source.properties._id` string, 
> `dcpheader.sourcemessageguid` string, `message.requesttype` string, 
> `source.routingkey` string, `message.service` string, `message.item.model` 
> string, `message.environment.pagesource` string, `source.source` string, 
> `message.sellerid` string, `partition_date_utc` string, 
> `message.selleridentifier` string, `message.subscription.newsletter` boolean, 
> `dcpheader.dcploadyearutc` string, `message.leadtype` string, 
> `message.history` bigint, `message.callconnect.calloutcome` string, 
> `message.callconnect.datecreatedutc` string, 
> `message.callconnect.callrecordingurl` string, 
> `message.callconnect.transferoutcome` string, 
> `message.callconnect.hiderecording` boolean, 
> `message.callconnect.callstartutc` string, `message.callconnect.code` string, 
> `message.callconnect.callduration` string, `message.fraudnetinfo` string, 
> `message.callconnect.answernumber` string, `message.environment.sourcedevice` 
> string, `message.comments` string, `message.fraudinfo.servervariables` 
> bigint, `message.callconnect.servicenumber` string, 
> `message.callconnect.callid` string, `message.callconnect.voicemailurl` 
> string, `message.item.stocknumber` string, 
> `message.callconnect.answerduration` string, `message.callconnect.callendutc` 
> string, `message.item.series` string, `message.item.detailsurl` string, 
> `message.item.pricetype` string, `message.item.description` string, 
> `message.item.colour` string,

[jira] [Commented] (SPARK-26858) Vectorized gapplyCollect, Arrow optimization in native R function execution

2019-02-18 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771494#comment-16771494
 ] 

Felix Cheung commented on SPARK-26858:
--

If I understand, this is the case where Spark actually doesn't care much about 
the schema but sounds like Arrow does.

Could we infer the schema from R data.frame? Is there an equivalent way for 
Python Pandas to Arrow?

> Vectorized gapplyCollect, Arrow optimization in native R function execution
> ---
>
> Key: SPARK-26858
> URL: https://issues.apache.org/jira/browse/SPARK-26858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Unlike gapply, gapplyCollect requires additional ser/de steps because it can 
> omit the schema, and Spark SQL doesn't know the return type before actually 
> execution happens.
> In original code path, it's done via using binary schema. Once gapply is done 
> (SPARK-26761). we can mimic this approach in vectorized gapply to support 
> gapplyCollect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26910) Re-release SparkR to CRAN

2019-02-18 Thread Felix Cheung (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-26910:


Assignee: Felix Cheung

> Re-release SparkR to CRAN
> -
>
> Key: SPARK-26910
> URL: https://issues.apache.org/jira/browse/SPARK-26910
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Michael Chirico
>Assignee: Felix Cheung
>Priority: Major
>
> The logical successor to https://issues.apache.org/jira/browse/SPARK-15799
> I don't see anything specifically tracking re-release in the Jira list. It 
> would be helpful to have an issue tracking this to refer to as an outsider, 
> as well as to document what the blockers are in case some outside help could 
> be useful.
>  * Is there a plan to re-release SparkR to CRAN?
>  * What are the major blockers to doing so at the moment?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26910) Re-release SparkR to CRAN

2019-02-18 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771498#comment-16771498
 ] 

Felix Cheung commented on SPARK-26910:
--

once that works we should look into 2.4.1

> Re-release SparkR to CRAN
> -
>
> Key: SPARK-26910
> URL: https://issues.apache.org/jira/browse/SPARK-26910
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Michael Chirico
>Assignee: Felix Cheung
>Priority: Major
>
> The logical successor to https://issues.apache.org/jira/browse/SPARK-15799
> I don't see anything specifically tracking re-release in the Jira list. It 
> would be helpful to have an issue tracking this to refer to as an outsider, 
> as well as to document what the blockers are in case some outside help could 
> be useful.
>  * Is there a plan to re-release SparkR to CRAN?
>  * What are the major blockers to doing so at the moment?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26910) Re-release SparkR to CRAN

2019-02-18 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771496#comment-16771496
 ] 

Felix Cheung commented on SPARK-26910:
--

2.3.3 has been submitted to CRAN. we are currently waiting for test result.

> Re-release SparkR to CRAN
> -
>
> Key: SPARK-26910
> URL: https://issues.apache.org/jira/browse/SPARK-26910
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Michael Chirico
>Priority: Major
>
> The logical successor to https://issues.apache.org/jira/browse/SPARK-15799
> I don't see anything specifically tracking re-release in the Jira list. It 
> would be helpful to have an issue tracking this to refer to as an outsider, 
> as well as to document what the blockers are in case some outside help could 
> be useful.
>  * Is there a plan to re-release SparkR to CRAN?
>  * What are the major blockers to doing so at the moment?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26918) All .md should have ASF license header

2019-02-18 Thread Felix Cheung (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-26918:
-
Description: 
per policy, all md files should have the header, like eg. 
[https://raw.githubusercontent.com/apache/arrow/master/docs/README.md]

 or

[https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md]

 

currently it does not

[https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] 

  was:
per policy, all md files should have the header, like eg. 
[https://raw.githubusercontent.com/apache/arrow/master/docs/README.md]

 

or

 

[https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md]

 

currently it does not

[https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] 


> All .md should have ASF license header
> --
>
> Key: SPARK-26918
> URL: https://issues.apache.org/jira/browse/SPARK-26918
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Felix Cheung
>Priority: Major
>
> per policy, all md files should have the header, like eg. 
> [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md]
>  or
> [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md]
>  
> currently it does not
> [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26918) All .md should have ASF license header

2019-02-18 Thread Felix Cheung (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-26918:
-
Description: 
per policy, all md files should have the header, like eg. 
[https://raw.githubusercontent.com/apache/arrow/master/docs/README.md]

 

or

 

[https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md]

 

currently it does not

[https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] 

  was:
per policy, all md files should have the header, like eg. 
[https://raw.githubusercontent.com/apache/arrow/master/docs/README.md]

currently it does not

[https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] 


> All .md should have ASF license header
> --
>
> Key: SPARK-26918
> URL: https://issues.apache.org/jira/browse/SPARK-26918
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Felix Cheung
>Priority: Major
>
> per policy, all md files should have the header, like eg. 
> [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md]
>  
> or
>  
> [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md]
>  
> currently it does not
> [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26918) All .md should have ASF license header

2019-02-18 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-26918:


 Summary: All .md should have ASF license header
 Key: SPARK-26918
 URL: https://issues.apache.org/jira/browse/SPARK-26918
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.4.0, 3.0.0
Reporter: Felix Cheung


per policy, all md files should have the header, like eg. 
[https://raw.githubusercontent.com/apache/arrow/master/docs/README.md]

currently it does not

[https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26858) Vectorized gapplyCollect, Arrow optimization in native R function execution

2019-02-18 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771507#comment-16771507
 ] 

Hyukjin Kwon commented on SPARK-26858:
--

{quote}
If I understand, this is the case where Spark actually doesn't care much about 
the schema but sounds like Arrow does.
{quote}

Yes, correct. Spark uses {{binary}} type as a container to ship R data frame 
but Arrow requires to set {{Schema}}.

{quote}
Could we infer the schema from R data.frame?
{quote}

That's possible. I think this is virtually similar way of getting the schema 
from the first Arrow batch. The problem is, we don't know the output's schema 
before actually executing the given R native function.
So .. the things to deal with to use this approach:

1. Somehow execute R native function only once, and somehow get the schema from 
(R data frame or Arrow batch)
2. Because the current Arrow code path in Apache spark always uses Spark's 
schema ahead, it needs some codes to extract the schema from the first batch 
(or R data.frame)

I tried this approach and looked pretty hacky. It needs to send back the schema 
from executor to driver (because the R native function is executed at executor 
side).

{quote}
Is there an equivalent way for Python Pandas to Arrow?
{quote}

One way in Python side is {{pyarrow.Table.from_batches}} API which allows to 
create Arrow table from batches directly, meaning from:

{code}
|-|
| Arrow Batch |
|-|
| Arrow Batch |
|-|
   ...
{code}

In this case, we don't need to know the {{Schema}} but R side seems not having 
this API in Arrow. If R side has this API, the workaround might be possible.
However, the protocol will still be different since basically we're not able to 
send a complete Arrow streaming format:

{code}
| Schema  |
|-|
| Arrow Batch |
|-|
| Arrow Batch |
|-|
   ...
{code}

If I remember correctly, Python side in PySpark always complies this format. R 
side in SparkR so far complies this format too.
In order to skip {{Schema}}, it needs a different API usage even if an API like 
{{pyarrow.Table.from_batches}} is possible in R Arrow API.

([~bryanc], correct me if I am wrong at any point ..)

> Vectorized gapplyCollect, Arrow optimization in native R function execution
> ---
>
> Key: SPARK-26858
> URL: https://issues.apache.org/jira/browse/SPARK-26858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Unlike gapply, gapplyCollect requires additional ser/de steps because it can 
> omit the schema, and Spark SQL doesn't know the return type before actually 
> execution happens.
> In original code path, it's done via using binary schema. Once gapply is done 
> (SPARK-26761). we can mimic this approach in vectorized gapply to support 
> gapplyCollect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26858) Vectorized gapplyCollect, Arrow optimization in native R function execution

2019-02-18 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771507#comment-16771507
 ] 

Hyukjin Kwon edited comment on SPARK-26858 at 2/19/19 2:29 AM:
---

{quote}
If I understand, this is the case where Spark actually doesn't care much about 
the schema but sounds like Arrow does.
{quote}

Yes, correct. Spark uses {{binary}} type as a container to ship R data frame 
but Arrow requires to set {{Schema}}.

{quote}
Could we infer the schema from R data.frame?
{quote}

That's possible. I think this is virtually similar way of getting the schema 
from the first Arrow batch. The problem is, we don't know the output's schema 
before actually executing the given R native function.
So .. the things to deal with to use this approach:

1. Somehow execute R native function only once, and somehow get the schema from 
(R data frame or Arrow batch)
2. Because the current Arrow code path in Apache spark always uses Spark's 
schema ahead, it needs some codes to extract the schema from the first batch 
(or R data.frame)

I tried this approach and looked pretty hacky. It needs to send back the schema 
from executor to driver so that it can be used for {{collect}} later during 
{{gapplyCollect}} (because the R native function is executed at executor side).

{quote}
Is there an equivalent way for Python Pandas to Arrow?
{quote}

One way in Python side is {{pyarrow.Table.from_batches}} API which allows to 
create Arrow table from batches directly, meaning from:

{code}
|-|
| Arrow Batch |
|-|
| Arrow Batch |
|-|
   ...
{code}

In this case, we don't need to know the {{Schema}} but R side seems not having 
this API in Arrow. If R side has this API, the workaround might be possible.
However, the protocol will still be different since basically we're not able to 
send a complete Arrow streaming format:

{code}
| Schema  |
|-|
| Arrow Batch |
|-|
| Arrow Batch |
|-|
   ...
{code}

If I remember correctly, Python side in PySpark always complies this format. R 
side in SparkR so far complies this format too.
In order to skip {{Schema}}, it needs a different API usage even if an API like 
{{pyarrow.Table.from_batches}} is possible in R Arrow API.

([~bryanc], correct me if I am wrong at any point ..)


was (Author: hyukjin.kwon):
{quote}
If I understand, this is the case where Spark actually doesn't care much about 
the schema but sounds like Arrow does.
{quote}

Yes, correct. Spark uses {{binary}} type as a container to ship R data frame 
but Arrow requires to set {{Schema}}.

{quote}
Could we infer the schema from R data.frame?
{quote}

That's possible. I think this is virtually similar way of getting the schema 
from the first Arrow batch. The problem is, we don't know the output's schema 
before actually executing the given R native function.
So .. the things to deal with to use this approach:

1. Somehow execute R native function only once, and somehow get the schema from 
(R data frame or Arrow batch)
2. Because the current Arrow code path in Apache spark always uses Spark's 
schema ahead, it needs some codes to extract the schema from the first batch 
(or R data.frame)

I tried this approach and looked pretty hacky. It needs to send back the schema 
from executor to driver (because the R native function is executed at executor 
side).

{quote}
Is there an equivalent way for Python Pandas to Arrow?
{quote}

One way in Python side is {{pyarrow.Table.from_batches}} API which allows to 
create Arrow table from batches directly, meaning from:

{code}
|-|
| Arrow Batch |
|-|
| Arrow Batch |
|-|
   ...
{code}

In this case, we don't need to know the {{Schema}} but R side seems not having 
this API in Arrow. If R side has this API, the workaround might be possible.
However, the protocol will still be different since basically we're not able to 
send a complete Arrow streaming format:

{code}
| Schema  |
|-|
| Arrow Batch |
|-|
| Arrow Batch |
|-|
   ...
{code}

If I remember correctly, Python side in PySpark always complies this format. R 
side in SparkR so far complies this format too.
In order to skip {{Schema}}, it needs a different API usage even if an API like 
{{pyarrow.Table.from_batches}} is possible in R Arrow API.

([~bryanc], correct me if I am wrong at any point ..)

> Vectorized gapplyCollect, Arrow optimization in native R function execution
> ---
>
> Key: SPARK-26858
> URL: https://issues.apache.org/jira/browse/SPARK-26858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>

[jira] [Updated] (SPARK-26919) change maven default compile java home

2019-02-18 Thread daile (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

daile updated SPARK-26919:
--
Attachment: p1.png

> change maven default compile java home
> --
>
> Key: SPARK-26919
> URL: https://issues.apache.org/jira/browse/SPARK-26919
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.1
>Reporter: daile
>Priority: Critical
> Attachments: p1.png
>
>
>   when i use "build/mvn -DskipTests clean package"  the deafult java home 
> Configuration "
> ${java.home}". I tried the environment of mac os and winodws and found that 
> the default java.home is */jre but the jre environment does not have the 
> javac complie command. So I think it can be replaced with the system 
> environment variable and the test is successfully compiled.
> !image-2019-02-19-10-25-02-872.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26919) change maven default compile java home

2019-02-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771514#comment-16771514
 ] 

Apache Spark commented on SPARK-26919:
--

User 'MrDLontheway' has created a pull request for this issue:
https://github.com/apache/spark/pull/23834

> change maven default compile java home
> --
>
> Key: SPARK-26919
> URL: https://issues.apache.org/jira/browse/SPARK-26919
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.1
>Reporter: daile
>Priority: Critical
> Attachments: p1.png
>
>
>   when i use "build/mvn -DskipTests clean package"  the deafult java home 
> Configuration "
> ${java.home}". I tried the environment of mac os and winodws and found that 
> the default java.home is */jre but the jre environment does not have the 
> javac complie command. So I think it can be replaced with the system 
> environment variable and the test is successfully compiled.
> !image-2019-02-19-10-25-02-872.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26919) change maven default compile java home

2019-02-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26919:


Assignee: (was: Apache Spark)

> change maven default compile java home
> --
>
> Key: SPARK-26919
> URL: https://issues.apache.org/jira/browse/SPARK-26919
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.1
>Reporter: daile
>Priority: Critical
> Attachments: p1.png
>
>
>   when i use "build/mvn -DskipTests clean package"  the deafult java home 
> Configuration "
> ${java.home}". I tried the environment of mac os and winodws and found that 
> the default java.home is */jre but the jre environment does not have the 
> javac complie command. So I think it can be replaced with the system 
> environment variable and the test is successfully compiled.
> !image-2019-02-19-10-25-02-872.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26919) change maven default compile java home

2019-02-18 Thread daile (JIRA)

daile created SPARK-26919:
-

 Summary: change maven default compile java home
 Key: SPARK-26919
 URL: https://issues.apache.org/jira/browse/SPARK-26919
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.1
Reporter: daile
 Attachments: p1.png

  when i use "build/mvn -DskipTests clean package"  the deafult java home 
Configuration "

${java.home}". I tried the environment of mac os and winodws and found that the 
default java.home is */jre but the jre environment does not have the javac 
complie command. So I think it can be replaced with the system environment 
variable and the test is successfully compiled.

!image-2019-02-19-10-25-02-872.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26919) change maven default compile java home

2019-02-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771515#comment-16771515
 ] 

Apache Spark commented on SPARK-26919:
--

User 'MrDLontheway' has created a pull request for this issue:
https://github.com/apache/spark/pull/23834

> change maven default compile java home
> --
>
> Key: SPARK-26919
> URL: https://issues.apache.org/jira/browse/SPARK-26919
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.1
>Reporter: daile
>Priority: Critical
> Attachments: p1.png
>
>
>   when i use "build/mvn -DskipTests clean package"  the deafult java home 
> Configuration "
> ${java.home}". I tried the environment of mac os and winodws and found that 
> the default java.home is */jre but the jre environment does not have the 
> javac complie command. So I think it can be replaced with the system 
> environment variable and the test is successfully compiled.
> !image-2019-02-19-10-25-02-872.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26919) change maven default compile java home

2019-02-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26919:


Assignee: Apache Spark

> change maven default compile java home
> --
>
> Key: SPARK-26919
> URL: https://issues.apache.org/jira/browse/SPARK-26919
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.1
>Reporter: daile
>Assignee: Apache Spark
>Priority: Critical
> Attachments: p1.png
>
>
>   when i use "build/mvn -DskipTests clean package"  the deafult java home 
> Configuration "
> ${java.home}". I tried the environment of mac os and winodws and found that 
> the default java.home is */jre but the jre environment does not have the 
> javac complie command. So I think it can be replaced with the system 
> environment variable and the test is successfully compiled.
> !image-2019-02-19-10-25-02-872.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26920) Deduplicate type checking across Arrow optimization and vectorized APIs in SparkR

2019-02-18 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-26920:


 Summary: Deduplicate type checking across Arrow optimization and 
vectorized APIs in SparkR
 Key: SPARK-26920
 URL: https://issues.apache.org/jira/browse/SPARK-26920
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


There are duplication about type checking in Arrow <> SparkR code paths. For 
instance,

https://github.com/apache/spark/blob/8126d09fb5b969c1e293f1f8c41bec35357f74b5/R/pkg/R/group.R#L229-L253

struct type and map type should also be restricted.

We should pull it out as a separate function and add deduplicated tests 
separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26920) Deduplicate type checking across Arrow optimization and vectorized APIs in SparkR

2019-02-18 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26920:
-
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-26759

> Deduplicate type checking across Arrow optimization and vectorized APIs in 
> SparkR
> -
>
> Key: SPARK-26920
> URL: https://issues.apache.org/jira/browse/SPARK-26920
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> There are duplication about type checking in Arrow <> SparkR code paths. For 
> instance,
> https://github.com/apache/spark/blob/8126d09fb5b969c1e293f1f8c41bec35357f74b5/R/pkg/R/group.R#L229-L253
> struct type and map type should also be restricted.
> We should pull it out as a separate function and add deduplicated tests 
> separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26921) Fix CRAN hack as soon as Arrow is available on CRAN

2019-02-18 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-26921:


 Summary: Fix CRAN hack as soon as Arrow is available on CRAN
 Key: SPARK-26921
 URL: https://issues.apache.org/jira/browse/SPARK-26921
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


Arrow optimization was added but Arrow is not available on CRAN.

So, it had to add some hacks to avoid CRAN check in SparkR side. For example, 
see 

https://github.com/apache/spark/search?q=requireNamespace1&unscoped_q=requireNamespace1

These should be removed to properly check CRAN in SparkR



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26922) Set socket timeout consistently in Arrow optimization

2019-02-18 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-26922:


 Summary: Set socket timeout consistently in Arrow optimization
 Key: SPARK-26922
 URL: https://issues.apache.org/jira/browse/SPARK-26922
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


For instance, see 

https://github.com/apache/spark/blob/e8982ca7ad94e98d907babf2d6f1068b7cd064c6/R/pkg/R/context.R#L184

it should set the timeout from {{SPARKR_BACKEND_CONNECTION_TIMEOUT}}. Or maybe 
we need another environment variable.

This might be able to be fixed together when some codes around there is touched.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24783) spark.sql.shuffle.partitions=0 should throw exception

2019-02-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24783:


Assignee: (was: Apache Spark)

> spark.sql.shuffle.partitions=0 should throw exception
> -
>
> Key: SPARK-24783
> URL: https://issues.apache.org/jira/browse/SPARK-24783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Avi minsky
>Priority: Major
>
> if spark.sql.shuffle.partitions=0 and trying to join tables (not broadcast 
> join)
> *result join will be an empty table.*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24783) spark.sql.shuffle.partitions=0 should throw exception

2019-02-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24783:


Assignee: Apache Spark

> spark.sql.shuffle.partitions=0 should throw exception
> -
>
> Key: SPARK-24783
> URL: https://issues.apache.org/jira/browse/SPARK-24783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Avi minsky
>Assignee: Apache Spark
>Priority: Major
>
> if spark.sql.shuffle.partitions=0 and trying to join tables (not broadcast 
> join)
> *result join will be an empty table.*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26923) Refactor ArrowRRunner and RRunner to deduplicate codes

2019-02-18 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-26923:


 Summary: Refactor ArrowRRunner and RRunner to deduplicate codes
 Key: SPARK-26923
 URL: https://issues.apache.org/jira/browse/SPARK-26923
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR, SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


ArrowRRunner and RRunner has already duplicated codes. We should refactor and 
deduplicate them. Also, ArrowRRunner happened to have a rather hacky code (see 
https://github.com/apache/spark/pull/23787/files#diff-a0b6a11cc2e2299455c795fe3c96b823R61
).

We might even be able to deduplicate some codes with PythonRunners.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26924) Document Arrow optimization and vectorized R APIs

2019-02-18 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-26924:


 Summary: Document Arrow optimization and vectorized R APIs
 Key: SPARK-26924
 URL: https://issues.apache.org/jira/browse/SPARK-26924
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR, SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


We should update SparkR guide documentation, and some related documents, 
comments like in {{SQLConf.scala}} when most of tasks are finished.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26859) Fix field writer index bug in non-vectorized ORC deserializer

2019-02-18 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26859:
--
Summary: Fix field writer index bug in non-vectorized ORC deserializer  
(was: Reading ORC files with explicit schema can result in wrong data)

> Fix field writer index bug in non-vectorized ORC deserializer
> -
>
> Key: SPARK-26859
> URL: https://issues.apache.org/jira/browse/SPARK-26859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ivan Vergiliev
>Priority: Major
>  Labels: correctness
>
> There is a bug in the ORC deserialization code that, when triggered, results 
> in completely wrong data being read. I've marked this as a Blocker as per the 
> docs in https://spark.apache.org/contributing.html as it's a data correctness 
> issue.
> The bug is triggered when the following set of conditions are all met:
> - the non-vectorized ORC reader is being used;
> - a schema is explicitly specified when reading the ORC file
> - the provided schema has columns not present in the ORC file, and these 
> columns are in the middle of the schema
> - the ORC file being read contains null values in the columns after the ones 
> added by the schema.
> When all of these are met:
> - the internal state of the ORC deserializer gets messed up, and, as a result
> - the null values from the ORC file end up being set on wrong columns, not 
> the one they're in, and
> - the old values from the null columns don't get cleared from the previous 
> record.
> Here's a concrete example. Let's consider the following DataFrame:
> {code:scala}
> val rdd = sparkContext.parallelize(Seq((1, 2, "abc"), (4, 5, "def"), 
> (8, 9, null)))
> val df = rdd.toDF("col1", "col2", "col3")
> {code}
> and the following schema:
> {code:scala}
> col1 int, col4 int, col2 int, col3 string
> {code}
> Notice the `col4 int` added in the middle that doesn't exist in the dataframe.
> Saving this dataframe to ORC and then reading it back with the specified 
> schema should result in reading the same values, with nulls for `col4`. 
> Instead, we get the following back:
> {code:java}
> [1,null,2,abc]
> [4,null,5,def]
> [8,null,null,def]
> {code}
> Notice how the `def` from the second record doesn't get properly cleared and 
> ends up in the third record as well; also, instead of `col2 = 9` in the last 
> record as expected, we get the null that should've been in column 3 instead.
> *Impact*
> When this issue is triggered, it results in completely wrong results being 
> read from the ORC file. The set of conditions under which it gets triggered 
> is somewhat narrow so the set of affected users is probably limited. There 
> are possibly also people that are affected but haven't realized it because 
> the conditions are so obscure.
> *Bug details*
> The issue is caused by calling `setNullAt` with a wrong index in 
> `OrcDeserializer.scala:deserialize()`. I have a fix that I'll send out for 
> review shortly.
> *Workaround*
> This bug is currently only triggered when new columns are added to the middle 
> of the schema. This means that it can be worked around by only adding new 
> columns at the end.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26740) Statistics for date and timestamp columns depend on system time zone

2019-02-18 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-26740:

Fix Version/s: 2.4.1

> Statistics for date and timestamp columns depend on system time zone
> 
>
> Key: SPARK-26740
> URL: https://issues.apache.org/jira/browse/SPARK-26740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> While saving statistics for timestamp/date columns, default time zone is used 
> in conversion of internal type (microseconds or days since epoch) to textual 
> representation. The textual representation doesn't contain time zone. So, 
> when it is converted back to internal types (Long for TimestampType or 
> DateType), the Timestamp.valueOf and Date.valueOf are used in conversions. 
> The methods use current system time zone.
> If system time zone is different while saving and retrieving statistics for 
> timestamp/date columns, restored microseconds/days since epoch will be 
> different.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26909) use unsafeRow.hashCode() as hash value in HashAggregate

2019-02-18 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26909:
---

Assignee: yucai

> use unsafeRow.hashCode() as hash value in HashAggregate
> ---
>
> Key: SPARK-26909
> URL: https://issues.apache.org/jira/browse/SPARK-26909
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: yucai
>Assignee: yucai
>Priority: Major
>
> This is a followup PR for #21149.
> New way uses unsafeRow.hashCode() as hash value in HashAggregate.
> The unsafe row has [null bit set] etc., the result should be different, so we 
> don't need weird `48`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26909) use unsafeRow.hashCode() as hash value in HashAggregate

2019-02-18 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26909.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23821
[https://github.com/apache/spark/pull/23821]

> use unsafeRow.hashCode() as hash value in HashAggregate
> ---
>
> Key: SPARK-26909
> URL: https://issues.apache.org/jira/browse/SPARK-26909
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: yucai
>Assignee: yucai
>Priority: Major
> Fix For: 3.0.0
>
>
> This is a followup PR for #21149.
> New way uses unsafeRow.hashCode() as hash value in HashAggregate.
> The unsafe row has [null bit set] etc., the result should be different, so we 
> don't need weird `48`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26854) Support ANY/SOME subquery

2019-02-18 Thread Mingcong Han (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingcong Han updated SPARK-26854:
-
Summary: Support ANY/SOME subquery  (was: Support ANY subquery)

> Support ANY/SOME subquery
> -
>
> Key: SPARK-26854
> URL: https://issues.apache.org/jira/browse/SPARK-26854
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Mingcong Han
>Priority: Minor
>
> ANY syntax
> {quote}
> SELECT column(s)
> FROM table
> WHERE column(s) operator ANY
> (SELECT column(s) FROM table WHERE condition);
> {quote}
> `ANY` subquery can be regarded as a generalization of `IN` subquery. And `IN` 
> subquery is a special case of `ANY` subquery whose operator should be "=". 
> The expression evaluates to `true` if the comparison between `column(s)` and 
> any row in the subquery's result set returns `true`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23682) Memory issue with Spark structured streaming

2019-02-18 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771613#comment-16771613
 ] 

Jungtaek Lim commented on SPARK-23682:
--

Ping [~bondyk] to see whether SPARK-24717 resolves this. If then we can close 
this, otherwise we can track down further.


> Memory issue with Spark structured streaming
> 
>
> Key: SPARK-23682
> URL: https://issues.apache.org/jira/browse/SPARK-23682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
> Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3
> |spark.blacklist.decommissioning.enabled|true|
> |spark.blacklist.decommissioning.timeout|1h|
> |spark.cleaner.periodicGC.interval|10min|
> |spark.default.parallelism|18|
> |spark.dynamicAllocation.enabled|false|
> |spark.eventLog.enabled|true|
> |spark.executor.cores|3|
> |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'|
> |spark.executor.id|driver|
> |spark.executor.instances|3|
> |spark.executor.memory|22G|
> |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2|
> |spark.hadoop.parquet.enable.summary-metadata|false|
> |spark.hadoop.yarn.timeline-service.enabled|false|
> |spark.jars| |
> |spark.master|yarn|
> |spark.memory.fraction|0.9|
> |spark.memory.storageFraction|0.3|
> |spark.memory.useLegacyMode|false|
> |spark.rdd.compress|true|
> |spark.resourceManager.cleanupExpiredHost|true|
> |spark.scheduler.mode|FIFO|
> |spark.serializer|org.apache.spark.serializer.KryoSerializer|
> |spark.shuffle.service.enabled|true|
> |spark.speculation|false|
> |spark.sql.parquet.filterPushdown|true|
> |spark.sql.parquet.mergeSchema|false|
> |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse|
> |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true|
> |spark.submit.deployMode|client|
> |spark.yarn.am.cores|1|
> |spark.yarn.am.memory|2G|
> |spark.yarn.am.memoryOverhead|1G|
> |spark.yarn.executor.memoryOverhead|3G|
>Reporter: Yuriy Bondaruk
>Priority: Major
>  Labels: Memory, memory, memory-leak
> Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot 
> 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen 
> Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, 
> Spark executors GC time.png, image-2018-03-22-14-46-31-960.png, 
> screen_shot_2018-03-20_at_15.23.29.png
>
>
> It seems like there is an issue with memory in structured streaming. A stream 
> with aggregation (dropDuplicates()) and data partitioning constantly 
> increases memory usage and finally executors fails with exit code 137:
> {quote}ExecutorLostFailure (executor 2 exited caused by one of the running 
> tasks) Reason: Container marked as failed: 
> container_1520214726510_0001_01_03 on host: 
> ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal{quote}
> Stream creating looks something like this:
> {quote}session
> .readStream()
> .schema(inputSchema)
> .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
> .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
> .csv("s3://test-bucket/input")
> .as(Encoders.bean(TestRecord.class))
> .flatMap(mf, Encoders.bean(TestRecord.class))
> .dropDuplicates("testId", "testName")
> .withColumn("year", 
> functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), 
> ""))
> .writeStream()
> .option("path", "s3://test-bucket/output")
> .option("checkpointLocation", "s3://test-bucket/checkpoint")
> .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS))
> .partitionBy("year")
> .format("parquet")
> .outputMode(OutputMode.Append())
> .queryName("test-stream")
> .start();{quote}
> Analyzing the heap dump I found that most of the memory used by 
> {{org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider}}
>  that is referenced from 
> [StateStore|https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L196]
>  
> On the first glance it looks normal since that is how Spark keeps aggregation 
> keys in memory. However I did my testing by renaming files in source folder, 
> so that they could be picked up by spark again. Since input records are the 
> same all further rows should be rejected as duplicates and memory consumption 
> shouldn't increase but it's not true. Moreover, GC time took more than 30

[jira] [Resolved] (SPARK-26886) Proper termination of external processes launched by the worker

2019-02-18 Thread luzengxiang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

luzengxiang resolved SPARK-26886.
-
Resolution: Won't Do

> Proper termination of external processes launched by the worker
> ---
>
> Key: SPARK-26886
> URL: https://issues.apache.org/jira/browse/SPARK-26886
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: luzengxiang
>Priority: Minor
>
> When Embedding Deeplearning Framework in spark, spark worker has to launch 
> external process(eg. MPI task) in some cases. 
> {quote}val nothing = inputData.barrier().mapPartitions
>  {_ => 
>  val barrierTask = BarrierTaskContext.get()
>  // save data to disk barrierTask.barrier()
>  barrierTask.barrier()
>  // launch external process, eg MPI Task + TensorFlow
>  }
> {quote}
>  
> The problem is that external process remains running when spark task is 
> killed manually. This Jira is the place to talk about properly terminating 
> external processes launched by spark worker, when spark task is killed or 
> interrupt.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26922) Set socket timeout consistently in Arrow optimization

2019-02-18 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26922:
-
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-26759

> Set socket timeout consistently in Arrow optimization
> -
>
> Key: SPARK-26922
> URL: https://issues.apache.org/jira/browse/SPARK-26922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> For instance, see 
> https://github.com/apache/spark/blob/e8982ca7ad94e98d907babf2d6f1068b7cd064c6/R/pkg/R/context.R#L184
> it should set the timeout from {{SPARKR_BACKEND_CONNECTION_TIMEOUT}}. Or 
> maybe we need another environment variable.
> This might be able to be fixed together when some codes around there is 
> touched.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26911) Spark do not see column in table

2019-02-18 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771628#comment-16771628
 ] 

Hyukjin Kwon commented on SPARK-26911:
--

Can you make the reproducer self-runnable and narrow down the problem? Sounds 
like requesting investigation than filing an issue. I am resolving this until 
sufficient information are provided for other people to investigate further. If 
you have a fix, reopen and make a PR right away.

> Spark do not see column in table
> 
>
> Key: SPARK-26911
> URL: https://issues.apache.org/jira/browse/SPARK-26911
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: PySpark (Spark 2.3.1)
>Reporter: Vitaly Larchenkov
>Priority: Major
>
>  
>  
> Spark cannot find column that actually exists in array
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input 
> columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, 
> flid.bank_id, flid.wr_id, flid.link_id]; {code}
>  
>  
> {code:java}
> ---
> Py4JJavaError Traceback (most recent call last)
> /usr/share/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
> /usr/share/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 327 "An error occurred while calling {0}{1}{2}.\n".
> --> 328 format(target_id, ".", name), value)
> 329 else:
> Py4JJavaError: An error occurred while calling o35.sql.
> : org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input 
> columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, 
> flid.bank_id, flid.wr_id, flid.link_id]; line 10 pos 98;
> 'Project ['multiples.id, 'multiples.link_id]
> {code}
>  
> Query:
> {code:java}
> q = f"""
> with flid as (
> select * from flow_log_by_id
> )
> select multiples.id, multiples.link_id
> from (select fl.id, fl.link_id
> from (select id from {flow_log_by_id} group by id having count(*) > 1) 
> multiples
> join {flow_log_by_id} fl on fl.id = multiples.id) multiples
> join {level_link} ll
> on multiples.link_id = ll.link_id_old and ll.link_id_new in (select link_id 
> from flid where id = multiples.id)
> """
> flow_subset_test_result = spark.sql(q)
> {code}
>  `with flid` used because without it spark do not find `flow_log_by_id` 
> table, so looks like another issues. In sql it works without problems.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24744) Structured Streaming set SparkSession configuration with the value in the metadata if there is not a option set by user.

2019-02-18 Thread Jungtaek Lim (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-24744.
--
Resolution: Information Provided

We provided the reason why such change is restricted in the PR, and also added 
relevant information as a note to official document. 
https://github.com/apache/spark/blob/master/docs/structured-streaming-programming-guide.md#additional-information

> Structured Streaming set SparkSession configuration with the value in the 
> metadata if there is not a option set by user.   
> ---
>
> Key: SPARK-24744
> URL: https://issues.apache.org/jira/browse/SPARK-24744
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: bjkonglu
>Priority: Minor
>
> h3. Background
> When I use structured streaming to construct my application, there is 
> something odd! The application always set option 
> [spark.sql.shuffle.partitions] to default value [200]. Even though, I set 
> [spark.sql.shuffle.partitions] to other value by SparkConf or --conf 
> spark.sql.shuffle.partitions=100,  but it doesn't work. The option value is 
> default value as before.
> h3. Analyse
> I review the relevant code. The relevant code is in 
> [org.apache.spark.sql.execution.streaming.OffsetSeqMetadata].
> {code:scala}
> /** Set the SparkSession configuration with the values in the metadata */
>   def setSessionConf(metadata: OffsetSeqMetadata, sessionConf: 
> RuntimeConfig): Unit = {
> OffsetSeqMetadata.relevantSQLConfs.map(_.key).foreach { confKey =>
>   metadata.conf.get(confKey) match {
> case Some(valueInMetadata) =>
>   // Config value exists in the metadata, update the session config 
> with this value
>   val optionalValueInSession = sessionConf.getOption(confKey)
>   if (optionalValueInSession.isDefined && optionalValueInSession.get 
> != valueInMetadata) {
> logWarning(s"Updating the value of conf '$confKey' in current 
> session from " +
>   s"'${optionalValueInSession.get}' to '$valueInMetadata'.")
>   }
>   sessionConf.set(confKey, valueInMetadata)
> case None =>
>   // For backward compatibility, if a config was not recorded in the 
> offset log,
>   // then log it, and let the existing conf value in SparkSession 
> prevail.
>   logWarning (s"Conf '$confKey' was not found in the offset log, 
> using existing value")
>   }
> }
>   }
> {code}
> In this code, we can find it always set some option in metadata value. But as 
> user, we want to those option can set by user.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26907) Does ShuffledRDD Replication Work With External Shuffle Service

2019-02-18 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771630#comment-16771630
 ] 

Hyukjin Kwon commented on SPARK-26907:
--

Questions should better go to mailing list than filing it as an issue here. See 
https://spark.apache.org/community.html

> Does ShuffledRDD Replication Work With External Shuffle Service
> ---
>
> Key: SPARK-26907
> URL: https://issues.apache.org/jira/browse/SPARK-26907
> Project: Spark
>  Issue Type: Question
>  Components: Block Manager, YARN
>Affects Versions: 2.3.2
>Reporter: Han Altae-Tran
>Priority: Major
>
> I am interested in working with high replication environments for extreme 
> fault tolerance (e.g. 10x replication), but have noticed that when using 
> groupBy or groupWith followed by persist (with 10x replication), even if one 
> node fails, the entire stage can fail with FetchFailedException.
>  
> Is this because the External Shuffle Service writes and services intermediate 
> shuffle data only to/from the local disk attached to the executor that 
> generated it, causing spark to ignore possible replicated shuffle data (from 
> the persist) that may be serviced elsewhere? If so, is there any way to 
> increase the replication factor of the External Shuffle Service to make it 
> fault tolerant?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26907) Does ShuffledRDD Replication Work With External Shuffle Service

2019-02-18 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26907.
--
Resolution: Invalid

> Does ShuffledRDD Replication Work With External Shuffle Service
> ---
>
> Key: SPARK-26907
> URL: https://issues.apache.org/jira/browse/SPARK-26907
> Project: Spark
>  Issue Type: Question
>  Components: Block Manager, YARN
>Affects Versions: 2.3.2
>Reporter: Han Altae-Tran
>Priority: Major
>
> I am interested in working with high replication environments for extreme 
> fault tolerance (e.g. 10x replication), but have noticed that when using 
> groupBy or groupWith followed by persist (with 10x replication), even if one 
> node fails, the entire stage can fail with FetchFailedException.
>  
> Is this because the External Shuffle Service writes and services intermediate 
> shuffle data only to/from the local disk attached to the executor that 
> generated it, causing spark to ignore possible replicated shuffle data (from 
> the persist) that may be serviced elsewhere? If so, is there any way to 
> increase the replication factor of the External Shuffle Service to make it 
> fault tolerant?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26911) Spark do not see column in table

2019-02-18 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26911.
--
Resolution: Incomplete

> Spark do not see column in table
> 
>
> Key: SPARK-26911
> URL: https://issues.apache.org/jira/browse/SPARK-26911
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: PySpark (Spark 2.3.1)
>Reporter: Vitaly Larchenkov
>Priority: Major
>
>  
>  
> Spark cannot find column that actually exists in array
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input 
> columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, 
> flid.bank_id, flid.wr_id, flid.link_id]; {code}
>  
>  
> {code:java}
> ---
> Py4JJavaError Traceback (most recent call last)
> /usr/share/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
> /usr/share/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 327 "An error occurred while calling {0}{1}{2}.\n".
> --> 328 format(target_id, ".", name), value)
> 329 else:
> Py4JJavaError: An error occurred while calling o35.sql.
> : org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input 
> columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, 
> flid.bank_id, flid.wr_id, flid.link_id]; line 10 pos 98;
> 'Project ['multiples.id, 'multiples.link_id]
> {code}
>  
> Query:
> {code:java}
> q = f"""
> with flid as (
> select * from flow_log_by_id
> )
> select multiples.id, multiples.link_id
> from (select fl.id, fl.link_id
> from (select id from {flow_log_by_id} group by id having count(*) > 1) 
> multiples
> join {flow_log_by_id} fl on fl.id = multiples.id) multiples
> join {level_link} ll
> on multiples.link_id = ll.link_id_old and ll.link_id_new in (select link_id 
> from flid where id = multiples.id)
> """
> flow_subset_test_result = spark.sql(q)
> {code}
>  `with flid` used because without it spark do not find `flow_log_by_id` 
> table, so looks like another issues. In sql it works without problems.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26911) Spark do not see column in table

2019-02-18 Thread Vitaly Larchenkov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771633#comment-16771633
 ] 

Vitaly Larchenkov commented on SPARK-26911:
---

Yeah, will do that in few days.

> Spark do not see column in table
> 
>
> Key: SPARK-26911
> URL: https://issues.apache.org/jira/browse/SPARK-26911
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: PySpark (Spark 2.3.1)
>Reporter: Vitaly Larchenkov
>Priority: Major
>
>  
>  
> Spark cannot find column that actually exists in array
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input 
> columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, 
> flid.bank_id, flid.wr_id, flid.link_id]; {code}
>  
>  
> {code:java}
> ---
> Py4JJavaError Traceback (most recent call last)
> /usr/share/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
> /usr/share/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 327 "An error occurred while calling {0}{1}{2}.\n".
> --> 328 format(target_id, ".", name), value)
> 329 else:
> Py4JJavaError: An error occurred while calling o35.sql.
> : org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input 
> columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, 
> flid.bank_id, flid.wr_id, flid.link_id]; line 10 pos 98;
> 'Project ['multiples.id, 'multiples.link_id]
> {code}
>  
> Query:
> {code:java}
> q = f"""
> with flid as (
> select * from flow_log_by_id
> )
> select multiples.id, multiples.link_id
> from (select fl.id, fl.link_id
> from (select id from {flow_log_by_id} group by id having count(*) > 1) 
> multiples
> join {flow_log_by_id} fl on fl.id = multiples.id) multiples
> join {level_link} ll
> on multiples.link_id = ll.link_id_old and ll.link_id_new in (select link_id 
> from flid where id = multiples.id)
> """
> flow_subset_test_result = spark.sql(q)
> {code}
>  `with flid` used because without it spark do not find `flow_log_by_id` 
> table, so looks like another issues. In sql it works without problems.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26925) how to get the statistics when read from or writer to another database by datasourceV2

2019-02-18 Thread webber (JIRA)

webber created SPARK-26925:
--

 Summary: how to get the statistics when read  from or writer to  
another database by datasourceV2
 Key: SPARK-26925
 URL: https://issues.apache.org/jira/browse/SPARK-26925
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 2.4.0
Reporter: webber


How to get the statistics when read  from or writer to  another database by 
datasourceV2. For example, when i read data from Rocksdb by datasourceV2,   i 
get a dataset, but also need the number of data.  If i want to know the number 
of the data, i must to do dataset.count(). In fact, i have  counted  the number 
when read from Rocksdb, but i cant feedback the infomation. Can you help me ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

94 matches

Mail list logo