[jira] [Updated] (SPARK-24689) java.io.NotSerializableException: org.apache.spark.mllib.clustering.DistributedLDAModel

2018-06-28 Thread konglingbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

konglingbo updated SPARK-24689:
---
Attachment: @CLZ98635A644[_edx...@e.png

> java.io.NotSerializableException: 
> org.apache.spark.mllib.clustering.DistributedLDAModel
> ---
>
> Key: SPARK-24689
> URL: https://issues.apache.org/jira/browse/SPARK-24689
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.1
> Environment: !image-2018-06-29-13-42-30-255.png!
>Reporter: konglingbo
>Priority: Blocker
> Attachments: @CLZ98635A644[_edx...@e.png
>
>
> scala> val predictionAndLabels=testing.map{case LabeledPoint(label,features)=>
>  | val prediction = model.predict(features)
>  | (prediction, label)
>  | }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24689) java.io.NotSerializableException: org.apache.spark.mllib.clustering.DistributedLDAModel

2018-06-28 Thread konglingbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

konglingbo updated SPARK-24689:
---
Description: 
scala> val predictionAndLabels=testing.map{case LabeledPoint(label,features)=>
 | val prediction = model.predict(features)
 | (prediction, label)
 | }

  was:!image-2018-06-29-13-41-55-635.png!


> java.io.NotSerializableException: 
> org.apache.spark.mllib.clustering.DistributedLDAModel
> ---
>
> Key: SPARK-24689
> URL: https://issues.apache.org/jira/browse/SPARK-24689
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.1
> Environment: !image-2018-06-29-13-42-30-255.png!
>Reporter: konglingbo
>Priority: Blocker
>
> scala> val predictionAndLabels=testing.map{case LabeledPoint(label,features)=>
>  | val prediction = model.predict(features)
>  | (prediction, label)
>  | }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24689) java.io.NotSerializableException: org.apache.spark.mllib.clustering.DistributedLDAModel

2018-06-28 Thread konglingbo (JIRA)
konglingbo created SPARK-24689:
--

 Summary: java.io.NotSerializableException: 
org.apache.spark.mllib.clustering.DistributedLDAModel
 Key: SPARK-24689
 URL: https://issues.apache.org/jira/browse/SPARK-24689
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 2.3.1
 Environment: !image-2018-06-29-13-42-30-255.png!
Reporter: konglingbo


!image-2018-06-29-13-41-55-635.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23636) [SPARK 2.2] | Kafka Consumer | KafkaUtils.createRDD throws Exception - java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access

2018-06-28 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527204#comment-16527204
 ] 

Ted Yu commented on SPARK-23636:


It seems in KafkaDataConsumer#close :
{code}
 def close(): Unit = consumer.close()
{code}
The code should catch ConcurrentModificationException and try closing the 
consumer again.

> [SPARK 2.2] | Kafka Consumer | KafkaUtils.createRDD throws Exception - 
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
> -
>
> Key: SPARK-23636
> URL: https://issues.apache.org/jira/browse/SPARK-23636
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Deepak
>Priority: Major
>  Labels: performance
>
> h2.  
> h2. Summary
>  
> While using the KafkaUtils.createRDD API - we receive below listed error, 
> specifically when 1 executor connects to 1 kafka topic-partition, but with 
> more than 1 core & fetches an Array(OffsetRanges)
>  
> _I've tagged this issue to "Structured Streaming" - as I could not find a 
> more appropriate component_ 
>  
> 
> h2. Error Faced
> {noformat}
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access{noformat}
>  Stack Trace
> {noformat}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 5 in stage 1.0 failed 4 times, most recent failure: Lost task 5.3 in 
> stage 1.0 (TID 17, host, executor 16): 
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1629)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1528)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1508)
> at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.close(CachedKafkaConsumer.scala:59)
> at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer$.remove(CachedKafkaConsumer.scala:185)
> at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.(KafkaRDD.scala:204)
> at org.apache.spark.streaming.kafka010.KafkaRDD.compute(KafkaRDD.scala:181)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323){noformat}
>  
> 
> h2. Config Used to simulate the error
> A session with : 
>  * Executors - 1
>  * Cores - 2 or More
>  * Kafka Topic - has only 1 partition
>  * While fetching - More than one Array of Offset Range , Example 
> {noformat}
> Array(OffsetRange("kafka_topic",0,608954201,608954202),
> OffsetRange("kafka_topic",0,608954202,608954203)
> ){noformat}
>  
> 
> h2. Was this approach working before?
>  
> This was working in spark 1.6.2
> However, from spark 2.1 onwards - the approach throws exception
>  
> 
> h2. Why are we fetching from kafka as mentioned above.
>  
> This gives us the capability to establish a connection to Kafka Broker for 
> every spark executor's core, thus each core can fetch/process its own set of 
> messages based on the specified (offset ranges).
>  
>  
> 
> h2. Sample Code
>  
> {quote}scala snippet - on versions spark 2.2.0 or 2.1.0
> // Bunch of imports
> import kafka.serializer.\{DefaultDecoder, StringDecoder}
>  import org.apache.avro.generic.GenericRecord
>  import org.apache.kafka.clients.consumer.ConsumerRecord
>  import org.apache.kafka.common.serialization._
>  import org.apache.spark.rdd.RDD
>  import org.apache.spark.sql.\{DataFrame, Row, SQLContext}
>  import org.apache.spark.sql.Row
>  import org.apache.spark.sql.hive.HiveContext
>  import org.apache.spark.sql.types.\{StringType, StructField, StructType}
>  import org.apache.spark.streaming.kafka010._
>  import org.apache.spark.streaming.kafka010.KafkaUtils._
> {quote}
> {quote}// This forces two connections - from a single executor - to 
> topic-partition .
> // And with 2 cores assigned to 1 executor : each core has a task - pulling 
> respective offsets : OffsetRange("kafka_topic",0,1,2) & 
> OffsetRange("kafka_topic",0,2,3)
> val parallelizedRanges = Array(OffsetRange("kafka_topic",0,1,2), // Fetching 
> sample 2 records 
>  OffsetRange("kafka_topic",0,2,3) // Fetching sample 2 records 
>  )
>  
> // Initiate kafka properties
> val kafkaParams1: java.util.Map[String, Object] = new java.util.HashMap()
> // kafkaParams1.put("key","val") add all the parameters such as broker, 
> topic Not listing every property here.
>  
> // Create RDD
> val rDDConsumerRec: RDD[ConsumerRecord[String, String]] =
>  createRDD[String, String](sparkContext
>  , kafkaParams1, parallelizedRanges, LocationStrategies.PreferConsistent)
>  

[jira] [Updated] (SPARK-24535) Fix java version parsing in SparkR

2018-06-28 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-24535:
-
Target Version/s: 2.3.2

> Fix java version parsing in SparkR
> --
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>Assignee: Felix Cheung
>Priority: Blocker
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24535) Fix java version parsing in SparkR

2018-06-28 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-24535:
-
Priority: Blocker  (was: Major)

> Fix java version parsing in SparkR
> --
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24535) Fix java version parsing in SparkR

2018-06-28 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-24535:


Assignee: Felix Cheung

> Fix java version parsing in SparkR
> --
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>Assignee: Felix Cheung
>Priority: Blocker
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24688) Comments of the example code have some typos

2018-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24688:


Assignee: Apache Spark

> Comments of the example code have some typos
> 
>
> Key: SPARK-24688
> URL: https://issues.apache.org/jira/browse/SPARK-24688
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.3.1
>Reporter: Weizhe Huang
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24688) Comments of the example code have some typos

2018-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24688:


Assignee: (was: Apache Spark)

> Comments of the example code have some typos
> 
>
> Key: SPARK-24688
> URL: https://issues.apache.org/jira/browse/SPARK-24688
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.3.1
>Reporter: Weizhe Huang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-24144) monotonically_increasing_id on streaming dataFrames

2018-06-28 Thread Richard Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Yu updated SPARK-24144:
---
Comment: was deleted

(was: So do you propose to send the information regarding 
monotonically_increasing_id to checkpoint data storage which could later be 
retrieved?)

> monotonically_increasing_id on streaming dataFrames
> ---
>
> Key: SPARK-24144
> URL: https://issues.apache.org/jira/browse/SPARK-24144
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Hemant Bhanawat
>Priority: Major
>
> For our use case, we want to assign snapshot ids (incrementing counters) to 
> the incoming records. In case of failures, the same record should get the 
> same id after failure so that the downstream DB can handle the records in a 
> correct manner. 
> We were trying to do this by zipping the streaming rdds with that counter 
> using a modified version of ZippedWithIndexRDD. There are other ways to do 
> that but it turns out all ways are cumbersome and error prone in failure 
> scenarios.
> As suggested on the spark user dev list, one way to do this would be to 
> support monotonically_increasing_id on streaming dataFrames in Spark code 
> base. This would ensure that counters are incrementing for the records of the 
> stream. Also, since the counter can be checkpointed, it would work well in 
> case of failure scenarios. Last but not the least, doing this in spark would 
> be the most performance efficient way.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24688) Comments of the example code have some typos

2018-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527134#comment-16527134
 ] 

Apache Spark commented on SPARK-24688:
--

User 'uzmijnlm' has created a pull request for this issue:
https://github.com/apache/spark/pull/21665

> Comments of the example code have some typos
> 
>
> Key: SPARK-24688
> URL: https://issues.apache.org/jira/browse/SPARK-24688
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.3.1
>Reporter: Weizhe Huang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-24672) No job is running but there are active tasks

2018-06-28 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue reopened SPARK-24672:


Coditions to this issue:

When the amount of data I selected is larger than spark.driver.maxResultSize , 
It returns the info below and the job failed automatically . 

!image4.png!

After that , there are several active tasks remain that occupy executors.

 

!image2.png!

 

Thanks a lot for your comment.

> No job is running but there are active tasks
> 
>
> Key: SPARK-24672
> URL: https://issues.apache.org/jira/browse/SPARK-24672
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: hadoop 2.6.0
> spark 2.2.1
> CDH 5.8.3
> java 1.8.0
>  
> More info :
> image1.png in Attachments
>Reporter: Deegue
>Priority: Major
> Attachments: image1.png, image2.png, image3.png, image4.png
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Issues:
>  
> There are active tasks while no job is running.
> These active tasks occupy the executors(resources) , and I don't know why 
> they haven't been killed or stopped after its jobs failed.
>  
> More info :
> image2.png & image3.png in Attachments
>  
> I'd be very appreciated it if anyone can help me...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24672) No job is running but there are active tasks

2018-06-28 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-24672:
---
Attachment: image4.png

> No job is running but there are active tasks
> 
>
> Key: SPARK-24672
> URL: https://issues.apache.org/jira/browse/SPARK-24672
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: hadoop 2.6.0
> spark 2.2.1
> CDH 5.8.3
> java 1.8.0
>  
> More info :
> image1.png in Attachments
>Reporter: Deegue
>Priority: Major
> Attachments: image1.png, image2.png, image3.png, image4.png
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Issues:
>  
> There are active tasks while no job is running.
> These active tasks occupy the executors(resources) , and I don't know why 
> they haven't been killed or stopped after its jobs failed.
>  
> More info :
> image2.png & image3.png in Attachments
>  
> I'd be very appreciated it if anyone can help me...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8659) Spark SQL Thrift Server does NOT honour hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory

2018-06-28 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527087#comment-16527087
 ] 

Takeshi Yamamuro commented on SPARK-8659:
-

I think Spark doesn't support GRANT/REVOKE now.

> Spark SQL Thrift Server does NOT honour 
> hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
>  
> ---
>
> Key: SPARK-8659
> URL: https://issues.apache.org/jira/browse/SPARK-8659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
> Environment: Linux
>Reporter: Premchandra Preetham Kukillaya
>Priority: Major
>
> It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the 
> Hive's security  feature SQL based authorisation is not working. It ignores 
> the security settings passed through the command line. The arguments for 
> command line is given below for reference
> The problem is user X can do select on table belonging to user Y, though 
> permission for table is explicitly defined and its a data security risk.
> I am using Hive .13.1 and Spark 1.3.1 and here is the list arguments passed 
> to Spark SQL Thrift Server.
> ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf 
> hostname.compute.amazonaws.com --hiveconf 
> hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
>  --hiveconf 
> hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
>  --hiveconf hive.server2.enable.doAs=false --hiveconf 
> hive.security.authorization.enabled=true --hiveconf 
> javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
>  --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver 
> --hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf 
> javax.jdo.option.ConnectionPassword=hive



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24678) We should use 'PROCESS_LOCAL' first for Spark-Streaming

2018-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527084#comment-16527084
 ] 

Apache Spark commented on SPARK-24678:
--

User 'caneGuy' has created a pull request for this issue:
https://github.com/apache/spark/pull/21664

> We should use 'PROCESS_LOCAL' first for Spark-Streaming
> ---
>
> Key: SPARK-24678
> URL: https://issues.apache.org/jira/browse/SPARK-24678
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.3.1
>Reporter: sharkd tu
>Priority: Major
>
> Currently, `BlockRDD.getPreferredLocations`  only get hosts info of blocks, 
> which results in subsequent schedule level is not better than 'NODE_LOCAL'. 
> We can just make a small changes, the schedule level can be improved to 
> 'PROCESS_LOCAL'
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24688) Comments of the example code have some typos

2018-06-28 Thread Weizhe Huang (JIRA)
Weizhe Huang created SPARK-24688:


 Summary: Comments of the example code have some typos
 Key: SPARK-24688
 URL: https://issues.apache.org/jira/browse/SPARK-24688
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 2.3.1
Reporter: Weizhe Huang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24687) When NoClassDefError thrown during task serialization will cause job hang

2018-06-28 Thread zhoukang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-24687:
-
Description: 
When below exception thrown:

{code:java}
Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: 
Lcom/xxx/data/recommend/aggregator/queue/QueueName;
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2436)
at java.lang.Class.getDeclaredField(Class.java:1946)
at 
java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659)
at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:480)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.(ObjectStreamClass.java:468)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
at java.io.ObjectOutputStream.writeClass(ObjectOutputStream.java:1212)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1119)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
{code}
Stage will always hang.Not abort.
 !hanging-960.png! 
It is because NoClassDefError will no be catch up below.
{code}
var taskBinary: Broadcast[Array[Byte]] = null
try {
  // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
  // For ResultTask, serialize and broadcast (rdd, func).
  val taskBinaryBytes: Array[Byte] = stage match {
case stage: ShuffleMapStage =>
  JavaUtils.bufferToArray(
closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
case stage: ResultStage =>
  JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, 
stage.func): AnyRef))
  }

  taskBinary = 

[jira] [Created] (SPARK-24687) When NoClassDefError thrown during task serialization will cause job hang

2018-06-28 Thread zhoukang (JIRA)
zhoukang created SPARK-24687:


 Summary: When NoClassDefError thrown during task serialization 
will cause job hang
 Key: SPARK-24687
 URL: https://issues.apache.org/jira/browse/SPARK-24687
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.1, 2.1.0
Reporter: zhoukang
 Attachments: hanging-960.png

When below exception thrown:

{code:java}
Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: 
Lcom/xxx/data/recommend/aggregator/queue/QueueName;
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2436)
at java.lang.Class.getDeclaredField(Class.java:1946)
at 
java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659)
at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:480)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.(ObjectStreamClass.java:468)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
at java.io.ObjectOutputStream.writeClass(ObjectOutputStream.java:1212)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1119)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
{code}
Stage will always hang.Not abort.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24687) When NoClassDefError thrown during task serialization will cause job hang

2018-06-28 Thread zhoukang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-24687:
-
Description: 
When below exception thrown:

{code:java}
Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: 
Lcom/xxx/data/recommend/aggregator/queue/QueueName;
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2436)
at java.lang.Class.getDeclaredField(Class.java:1946)
at 
java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659)
at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:480)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.(ObjectStreamClass.java:468)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
at java.io.ObjectOutputStream.writeClass(ObjectOutputStream.java:1212)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1119)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
{code}
Stage will always hang.Not abort.
 !hanging-960.png! 


  was:
When below exception thrown:

{code:java}
Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: 
Lcom/xxx/data/recommend/aggregator/queue/QueueName;
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2436)
at java.lang.Class.getDeclaredField(Class.java:1946)
at 
java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659)
at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:480)
at 

[jira] [Updated] (SPARK-24687) When NoClassDefError thrown during task serialization will cause job hang

2018-06-28 Thread zhoukang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-24687:
-
Attachment: hanging-960.png

> When NoClassDefError thrown during task serialization will cause job hang
> -
>
> Key: SPARK-24687
> URL: https://issues.apache.org/jira/browse/SPARK-24687
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: zhoukang
>Priority: Major
> Attachments: hanging-960.png
>
>
> When below exception thrown:
> {code:java}
> Exception in thread "dag-scheduler-event-loop" 
> java.lang.NoClassDefFoundError: 
> Lcom/xxx/data/recommend/aggregator/queue/QueueName;
>   at java.lang.Class.getDeclaredFields0(Native Method)
>   at java.lang.Class.privateGetDeclaredFields(Class.java:2436)
>   at java.lang.Class.getDeclaredField(Class.java:1946)
>   at 
> java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659)
>   at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:72)
>   at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:480)
>   at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.io.ObjectStreamClass.(ObjectStreamClass.java:468)
>   at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
>   at java.io.ObjectOutputStream.writeClass(ObjectOutputStream.java:1212)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1119)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> {code}
> Stage will always hang.Not abort.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (SPARK-24680) spark.executorEnv.JAVA_HOME does not take effect in Standalone mode

2018-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24680:


Assignee: (was: Apache Spark)

> spark.executorEnv.JAVA_HOME does not take effect in Standalone mode
> ---
>
> Key: SPARK-24680
> URL: https://issues.apache.org/jira/browse/SPARK-24680
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.1, 2.2.1, 2.3.1
>Reporter: StanZhai
>Priority: Minor
>
> spark.executorEnv.JAVA_HOME does not take effect when a Worker starting an 
> Executor process in Standalone mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24680) spark.executorEnv.JAVA_HOME does not take effect in Standalone mode

2018-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527044#comment-16527044
 ] 

Apache Spark commented on SPARK-24680:
--

User 'stanzhai' has created a pull request for this issue:
https://github.com/apache/spark/pull/21663

> spark.executorEnv.JAVA_HOME does not take effect in Standalone mode
> ---
>
> Key: SPARK-24680
> URL: https://issues.apache.org/jira/browse/SPARK-24680
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.1, 2.2.1, 2.3.1
>Reporter: StanZhai
>Priority: Minor
>
> spark.executorEnv.JAVA_HOME does not take effect when a Worker starting an 
> Executor process in Standalone mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24680) spark.executorEnv.JAVA_HOME does not take effect in Standalone mode

2018-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24680:


Assignee: Apache Spark

> spark.executorEnv.JAVA_HOME does not take effect in Standalone mode
> ---
>
> Key: SPARK-24680
> URL: https://issues.apache.org/jira/browse/SPARK-24680
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.1, 2.2.1, 2.3.1
>Reporter: StanZhai
>Assignee: Apache Spark
>Priority: Minor
>
> spark.executorEnv.JAVA_HOME does not take effect when a Worker starting an 
> Executor process in Standalone mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen

2018-06-28 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527036#comment-16527036
 ] 

Takeshi Yamamuro commented on SPARK-24498:
--

I added some benchmark results on different conditions: sf=5 adn 
splitConsumeFuncByOperator=false (I didn't still get any benefit though)

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24673) scala sql function from_utc_timestamp second argument could be Column instead of String

2018-06-28 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527020#comment-16527020
 ] 

Takeshi Yamamuro commented on SPARK-24673:
--

I'm not 100% sure though, probably I think we cannot touch the existing 
signature. So,  we need to add an new entry `from_utc_timestamp(ts: Column, tz: 
Column)` there.
But, the user-facing api issues are more sensitive, so you need to ask 
qualified committers first before making a pr: [~smilegator] [~ueshin]

> scala sql function from_utc_timestamp second argument could be Column instead 
> of String
> ---
>
> Key: SPARK-24673
> URL: https://issues.apache.org/jira/browse/SPARK-24673
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Antonio Murgia
>Priority: Minor
>
> As of 2.3.1 the scala API for the built-in function from_utc_timestamp 
> (org.apache.spark.sql.functions#from_utc_timestamp) is less powerful than its 
> SQL counter part. In particular, given a dataset/dataframe with the following 
> schema:
> {code:java}
> CREATE TABLE MY_TABLE (
>   ts TIMESTAMP,
>   tz STRING
> ){code}
> from the SQL api I can do something like:
> {code:java}
> SELECT FROM_UTC_TIMESTAMP(TS, TZ){code}
> while from the programmatic api I simply cannot because
> {code:java}
> functions.from_utc_timestamp(ts: Column, tz: String){code}
> second argument is a String.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24643) from_json should accept an aggregate function as schema

2018-06-28 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527013#comment-16527013
 ] 

Hyukjin Kwon commented on SPARK-24643:
--

[~maxgekk] shall we leave this closed for now?

> from_json should accept an aggregate function as schema
> ---
>
> Key: SPARK-24643
> URL: https://issues.apache.org/jira/browse/SPARK-24643
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, the *from_json()* function accepts only string literals as schema:
>  - Checking of schema argument inside of JsonToStructs: 
> [https://github.com/apache/spark/blob/b8f27ae3b34134a01998b77db4b7935e7f82a4fe/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L530]
>  - Accepting only string literal: 
> [https://github.com/apache/spark/blob/b8f27ae3b34134a01998b77db4b7935e7f82a4fe/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L749-L752]
> JsonToStructs should be modified to accept results of aggregate functions 
> like *infer_schema* (see SPARK-24642). It should be possible to write SQL 
> like:
> {code:sql}
> select from_json(json_col, infer_schema(json_col)) from json_table
> {code}
> Here is a test case with existing aggregate function - *first()*:
> {code:sql}
> create temporary view schemas(schema) as select * from values
>   ('struct'),
>   ('map');
> select from_json('{"a":1}', first(schema)) from schemas;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24686) Provide spark distributions for hadoop-2.8 rather than hadoop-2.7 as releases on apache mirrors

2018-06-28 Thread t oo (JIRA)
t oo created SPARK-24686:


 Summary: Provide spark distributions for hadoop-2.8 rather than 
hadoop-2.7 as releases on apache mirrors
 Key: SPARK-24686
 URL: https://issues.apache.org/jira/browse/SPARK-24686
 Project: Spark
  Issue Type: Improvement
  Components: Build, Spark Core
Affects Versions: 2.3.1
Reporter: t oo


Hadoop 2.7 is quite old, time to move on to providing released spark builds for 
hadoop 2.8 distribution on the apache mirrors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18258) Sinks need access to offset representation

2018-06-28 Thread Richard Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526936#comment-16526936
 ] 

Richard Yu commented on SPARK-18258:


{quote} * We need agreement on whether it is worth making a change to the 
public Sink api (probably not any time soon, judging from the spark 3.0 vs 2.4 
discussion), or whether there is a different way to accomplish the goal. - Cody
{quote}
cc [~rxin] [~lwlin] [~marmbrus] Hi all, I would like to poll what you think on 
this issue.

> Sinks need access to offset representation
> --
>
> Key: SPARK-18258
> URL: https://issues.apache.org/jira/browse/SPARK-18258
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> Transactional "exactly-once" semantics for output require storing an offset 
> identifier in the same transaction as results.
> The Sink.addBatch method currently only has access to batchId and data, not 
> the actual offset representation.
> I want to store the actual offsets, so that they are recoverable as long as 
> the results are and I'm not locked in to a particular streaming engine.
> I could see this being accomplished by adding parameters to Sink.addBatch for 
> the starting and ending offsets (either the offsets themselves, or the 
> SPARK-17829 string/json representation).  That would be an API change, but if 
> there's another way to map batch ids to offset representations without 
> changing the Sink api that would work as well.  
> I'm assuming we don't need the same level of access to offsets throughout a 
> job as e.g. the Kafka dstream gives, because Sinks are the main place that 
> should need them.
> After SPARK-17829 is complete and offsets have a .json method, an api for 
> this ticket might look like
> {code}
> trait Sink {
>   def addBatch(batchId: Long, data: DataFrame, start: OffsetSeq, end: 
> OffsetSeq): Unit
> {code}
> where start and end were provided by StreamExecution.runBatch using 
> committedOffsets and availableOffsets.  
> I'm not 100% certain that the offsets in the seq could always be mapped back 
> to the correct source when restarting complicated multi-source jobs, but I 
> think it'd be sufficient.  Passing the string/json representation of the seq 
> instead of the seq itself would probably be sufficient as well, but the 
> convention of rendering a None as "-" in the json is maybe a little 
> idiosyncratic to parse, and the constant defining that is private.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24386) implement continuous processing coalesce(1)

2018-06-28 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-24386.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21560
[https://github.com/apache/spark/pull/21560]

> implement continuous processing coalesce(1)
> ---
>
> Key: SPARK-24386
> URL: https://issues.apache.org/jira/browse/SPARK-24386
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 3.0.0
>
>
> [~marmbrus] suggested this as a good implementation checkpoint. If we do the 
> shuffle reader and writer correctly, it should be easy to make a custom 
> coalesce(1) execution for continuous processing using them, without having to 
> implement the logic for shuffle writers finding out where shuffle readers are 
> located. (The coalesce(1) can just get the RpcEndpointRef directly from the 
> reader and pass it to the writers.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24386) implement continuous processing coalesce(1)

2018-06-28 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-24386:
-

Assignee: Jose Torres

> implement continuous processing coalesce(1)
> ---
>
> Key: SPARK-24386
> URL: https://issues.apache.org/jira/browse/SPARK-24386
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
>
> [~marmbrus] suggested this as a good implementation checkpoint. If we do the 
> shuffle reader and writer correctly, it should be easy to make a custom 
> coalesce(1) execution for continuous processing using them, without having to 
> implement the logic for shuffle writers finding out where shuffle readers are 
> located. (The coalesce(1) can just get the RpcEndpointRef directly from the 
> reader and pass it to the writers.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24662) Structured Streaming should support LIMIT

2018-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526859#comment-16526859
 ] 

Apache Spark commented on SPARK-24662:
--

User 'mukulmurthy' has created a pull request for this issue:
https://github.com/apache/spark/pull/21662

> Structured Streaming should support LIMIT
> -
>
> Key: SPARK-24662
> URL: https://issues.apache.org/jira/browse/SPARK-24662
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mukul Murthy
>Priority: Major
>
> Make structured streams support the LIMIT operator. 
> This will undo SPARK-24525 as the limit operator would be a superior solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24662) Structured Streaming should support LIMIT

2018-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24662:


Assignee: (was: Apache Spark)

> Structured Streaming should support LIMIT
> -
>
> Key: SPARK-24662
> URL: https://issues.apache.org/jira/browse/SPARK-24662
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mukul Murthy
>Priority: Major
>
> Make structured streams support the LIMIT operator. 
> This will undo SPARK-24525 as the limit operator would be a superior solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24662) Structured Streaming should support LIMIT

2018-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24662:


Assignee: Apache Spark

> Structured Streaming should support LIMIT
> -
>
> Key: SPARK-24662
> URL: https://issues.apache.org/jira/browse/SPARK-24662
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mukul Murthy
>Assignee: Apache Spark
>Priority: Major
>
> Make structured streams support the LIMIT operator. 
> This will undo SPARK-24525 as the limit operator would be a superior solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken

2018-06-28 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-24530:
-

Assignee: Hyukjin Kwon

> Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) 
> and pyspark.ml docs are broken
> ---
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24662) Structured Streaming should support LIMIT

2018-06-28 Thread Mukul Murthy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526848#comment-16526848
 ] 

Mukul Murthy commented on SPARK-24662:
--

Calling .limit(n) on a DataFrame (or in SQL, SELECT ... LIMIT n) returns only n 
rows from that query. limit is currently not supported on streaming dataframes; 
my fix is going to support it for streams writing in append and complete output 
modes. It's still going to be unsupported for streams in update output mode, 
because for updating streams, limit doesn't make sense.

> Structured Streaming should support LIMIT
> -
>
> Key: SPARK-24662
> URL: https://issues.apache.org/jira/browse/SPARK-24662
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mukul Murthy
>Priority: Major
>
> Make structured streams support the LIMIT operator. 
> This will undo SPARK-24525 as the limit operator would be a superior solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24684) DAGScheduler reports the wrong attempt number to the commit coordinator

2018-06-28 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526834#comment-16526834
 ] 

Ryan Blue commented on SPARK-24684:
---

Yeah, I just backported this wrong and moved to using unique ids in the 
canCommit calls. I don't think it affects master or the on-going patch 
releases. Sorry for the false alarm.

> DAGScheduler reports the wrong attempt number to the commit coordinator
> ---
>
> Key: SPARK-24684
> URL: https://issues.apache.org/jira/browse/SPARK-24684
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.3, 2.3.2
>Reporter: Ryan Blue
>Priority: Major
>
> SPARK-24552 changes writers to pass the task ID to the output coordinator so 
> that the coordinator tracks each task uniquely because attempt numbers can be 
> reused across stage attempts. However, the DAGScheduler still passes the 
> attempt number when notifying the coordinator that a task has finished. The 
> result is that when a task is authorized and then fails due to OOM or a 
> similar error, the scheduler is notified but doesn't remove the commit 
> authorization because the attempt number doesn't match. This causes infinite 
> task retries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24679) Download page should not link to unreleased code

2018-06-28 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24679.

Resolution: Fixed
  Assignee: Luciano Resende

> Download page should not link to unreleased code
> 
>
> Key: SPARK-24679
> URL: https://issues.apache.org/jira/browse/SPARK-24679
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.1
>Reporter: Luciano Resende
>Assignee: Luciano Resende
>Priority: Major
>
> The download pages currently link to the git code repository.
> Whilst the instructions show how to check out master or a particular release 
> branch, this also gives access to the rest of the repo, i.e. to non-released 
> code.
> Links to code repos should only be published on pages intended for developers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23940) High-order function: transform_values(map, function) → map

2018-06-28 Thread Neha Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526808#comment-16526808
 ] 

Neha Patil commented on SPARK-23940:


I can work on this one.

> High-order function: transform_values(map, function) → 
> map
> ---
>
> Key: SPARK-23940
> URL: https://issues.apache.org/jira/browse/SPARK-23940
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map that applies function to each entry of map and transforms the 
> values.
> {noformat}
> SELECT transform_values(MAP(ARRAY[], ARRAY[]), (k, v) -> v + 1); -- {}
> SELECT transform_values(MAP(ARRAY [1, 2, 3], ARRAY [10, 20, 30]), (k, v) -> v 
> + k); -- {1 -> 11, 2 -> 22, 3 -> 33}
> SELECT transform_values(MAP(ARRAY [1, 2, 3], ARRAY ['a', 'b', 'c']), (k, v) 
> -> k * k); -- {1 -> 1, 2 -> 4, 3 -> 9}
> SELECT transform_values(MAP(ARRAY ['a', 'b'], ARRAY [1, 2]), (k, v) -> k || 
> CAST(v as VARCHAR)); -- {a -> a1, b -> b2}
> SELECT transform_values(MAP(ARRAY [1, 2], ARRAY [1.0, 1.4]), -- {1 -> 
> one_1.0, 2 -> two_1.4}
> (k, v) -> MAP(ARRAY[1, 2], ARRAY['one', 'two'])[k] || 
> '_' || CAST(v AS VARCHAR));
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24439) Add distanceMeasure to BisectingKMeans in PySpark

2018-06-28 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-24439.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21557
[https://github.com/apache/spark/pull/21557]

> Add distanceMeasure to BisectingKMeans in PySpark
> -
>
> Key: SPARK-24439
> URL: https://issues.apache.org/jira/browse/SPARK-24439
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 2.4.0
>
>
> https://issues.apache.org/jira/browse/SPARK-23412 added distanceMeasure to 
> BisectingKMeans. We will do the same for PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24439) Add distanceMeasure to BisectingKMeans in PySpark

2018-06-28 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-24439:


Assignee: Huaxin Gao

> Add distanceMeasure to BisectingKMeans in PySpark
> -
>
> Key: SPARK-24439
> URL: https://issues.apache.org/jira/browse/SPARK-24439
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> https://issues.apache.org/jira/browse/SPARK-23412 added distanceMeasure to 
> BisectingKMeans. We will do the same for PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24685) Adjust release scripts to build all versions for older releases

2018-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24685:


Assignee: (was: Apache Spark)

> Adjust release scripts to build all versions for older releases
> ---
>
> Key: SPARK-24685
> URL: https://issues.apache.org/jira/browse/SPARK-24685
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> If using the current release scripts to build old branches (e.g. 2.1), some 
> hadoop versions are missing, since they were removed in newer versions of 
> Spark.
> To keep the RM job consistent for multiple branches, let's keep support for 
> older versions in the scripts until we really decide they're EOL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24685) Adjust release scripts to build all versions for older releases

2018-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526764#comment-16526764
 ] 

Apache Spark commented on SPARK-24685:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21661

> Adjust release scripts to build all versions for older releases
> ---
>
> Key: SPARK-24685
> URL: https://issues.apache.org/jira/browse/SPARK-24685
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> If using the current release scripts to build old branches (e.g. 2.1), some 
> hadoop versions are missing, since they were removed in newer versions of 
> Spark.
> To keep the RM job consistent for multiple branches, let's keep support for 
> older versions in the scripts until we really decide they're EOL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24685) Adjust release scripts to build all versions for older releases

2018-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24685:


Assignee: Apache Spark

> Adjust release scripts to build all versions for older releases
> ---
>
> Key: SPARK-24685
> URL: https://issues.apache.org/jira/browse/SPARK-24685
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> If using the current release scripts to build old branches (e.g. 2.1), some 
> hadoop versions are missing, since they were removed in newer versions of 
> Spark.
> To keep the RM job consistent for multiple branches, let's keep support for 
> older versions in the scripts until we really decide they're EOL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24685) Adjust release scripts to build all versions for older releases

2018-06-28 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-24685:
--

 Summary: Adjust release scripts to build all versions for older 
releases
 Key: SPARK-24685
 URL: https://issues.apache.org/jira/browse/SPARK-24685
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.0
Reporter: Marcelo Vanzin


If using the current release scripts to build old branches (e.g. 2.1), some 
hadoop versions are missing, since they were removed in newer versions of Spark.

To keep the RM job consistent for multiple branches, let's keep support for 
older versions in the scripts until we really decide they're EOL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24670) How to stream only newer files from a folder in Apache Spark?

2018-06-28 Thread Mahbub Murshed (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahbub Murshed resolved SPARK-24670.

Resolution: Fixed

The problem with count difference was ultimately solved by setting 
maxFilesPerTrigger setting to a low number. Initially, it was set to 1000, 
meaning Spark will try processing 1000 files at a time. This means for the 
given data, it will need to process about 50 days of data, which would take 
about 2 days to complete, before it writes everything to the disk. Setting it 
to a lower number solves the problem.

> How to stream only newer files from a folder in Apache Spark?
> -
>
> Key: SPARK-24670
> URL: https://issues.apache.org/jira/browse/SPARK-24670
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output, Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Mahbub Murshed
>Priority: Major
>
> Background:
> I have a directory in Google Cloud Storage containing files for 1.5 years of 
> data. The files are named as hits__.csv. For example, for June 
> 24, say there are three files, hits_20180624_000.csv, hits_20180624_001.csv, 
> hits_20180624_002.csv. etc. The folder has files since January 2017. New 
> files are dropped in the folder every day.
> I am reading the files using Spark streaming and writing to AWS S3. 
> Problem:
> For the first batch Spark processes ALL files in the folder. It will take 
> about a month to complete the entire set.
> Moreover, when writing out the data, Spark isn't completely writing out each 
> days of data until the entire folder is complete.
> Example:
> Say each input file contains 100,000 records.
> Input:
> hits_20180624_000.csv
> hits_20180624_001.csv
> hits_20180624_002.csv
> hits_20180623_000.csv
> hits_20180623_001.csv
> ...
> hits_20170101_000.csv
> hits_20170101_001.csv
> Processing:
> Drops half records (say). Each output files should contain 50,000 records per 
> day.
> Output Expected (number of file may be different):
> year=2018/month=6/day=24/hash0.parquet
> year=2018/month=6/day=24/hash1.parquet
> year=2018/month=6/day=24/hash2.parquet
> year=2018/month=6/day=23/hash0.parquet
> year=2018/month=6/day=23/hash1.parquet
> ...
> Problem:
> Each day contains less than 50,000 records, unless entire batch is complete. 
> In a test with a small subset this behavior was reproduced.
> Question:
> Is there a way to configure Spark to not load older files, even for the first 
> load? Why is Spark not writing out the remaining records?
> Things I tried:
> 1. A trigger of 1 hr
> 2. Watermarking based on eventtime
> [1]: 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24408) Move abs function to math_funcs group

2018-06-28 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-24408.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Thanks for helping improve our docs :)

> Move abs function to math_funcs group
> -
>
> Key: SPARK-24408
> URL: https://issues.apache.org/jira/browse/SPARK-24408
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Assignee: Jacek Laskowski
>Priority: Trivial
> Fix For: 2.4.0
>
>
> A few math function ( {{abs}} )  are is in {{math_funcs}} group. It should 
> really be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24408) Move abs function to math_funcs group

2018-06-28 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-24408:
---

Assignee: Jacek Laskowski

> Move abs function to math_funcs group
> -
>
> Key: SPARK-24408
> URL: https://issues.apache.org/jira/browse/SPARK-24408
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Assignee: Jacek Laskowski
>Priority: Trivial
>
> A few math function ( {{abs}} )  are is in {{math_funcs}} group. It should 
> really be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24408) Move abs function to math_funcs group

2018-06-28 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-24408:

Description: A few math function ( {{abs}} )  are is in {{math_funcs}} 
group. It should really be.  (was: A few math functions ( {{abs}} , 
{{bitwiseNOT}}, {{isnan}}, {{nanvl}}) are not in {{math_funcs}} group. They 
should really be.)

> Move abs function to math_funcs group
> -
>
> Key: SPARK-24408
> URL: https://issues.apache.org/jira/browse/SPARK-24408
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> A few math function ( {{abs}} )  are is in {{math_funcs}} group. It should 
> really be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24408) Move abs function to math_funcs group

2018-06-28 Thread Jacek Laskowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-24408:

Summary: Move abs function to math_funcs group  (was: Move abs, bitwiseNOT, 
isnan, nanvl functions to math_funcs group)

> Move abs function to math_funcs group
> -
>
> Key: SPARK-24408
> URL: https://issues.apache.org/jira/browse/SPARK-24408
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> A few math functions ( {{abs}} , {{bitwiseNOT}}, {{isnan}}, {{nanvl}}) are 
> not in {{math_funcs}} group. They should really be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23120) Add PMML pipeline export support to PySpark

2018-06-28 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-23120.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> Add PMML pipeline export support to PySpark
> ---
>
> Key: SPARK-23120
> URL: https://issues.apache.org/jira/browse/SPARK-23120
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Major
> Fix For: 2.4.0
>
>
> Once we have Scala support go back and fill in Python.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24684) DAGScheduler reports the wrong attempt number to the commit coordinator

2018-06-28 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526714#comment-16526714
 ] 

Marcelo Vanzin commented on SPARK-24684:


The code still uses the attempt number currently (and there are unit tests for 
the case you're talking about).

SPARK-24611 was filed to track switching to task IDs and other enhancements 
which we'll only do in master.

> DAGScheduler reports the wrong attempt number to the commit coordinator
> ---
>
> Key: SPARK-24684
> URL: https://issues.apache.org/jira/browse/SPARK-24684
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.3, 2.3.2
>Reporter: Ryan Blue
>Priority: Major
>
> SPARK-24552 changes writers to pass the task ID to the output coordinator so 
> that the coordinator tracks each task uniquely because attempt numbers can be 
> reused across stage attempts. However, the DAGScheduler still passes the 
> attempt number when notifying the coordinator that a task has finished. The 
> result is that when a task is authorized and then fails due to OOM or a 
> similar error, the scheduler is notified but doesn't remove the commit 
> authorization because the attempt number doesn't match. This causes infinite 
> task retries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14712) spark.ml LogisticRegressionModel.toString should summarize model

2018-06-28 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-14712:
---

Assignee: Bravo Zhang

> spark.ml LogisticRegressionModel.toString should summarize model
> 
>
> Key: SPARK-14712
> URL: https://issues.apache.org/jira/browse/SPARK-14712
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Bravo Zhang
>Priority: Trivial
>  Labels: starter
> Fix For: 2.4.0
>
>
> spark.mllib LogisticRegressionModel overrides toString to print a little 
> model info.  We should do the same in spark.ml.  I'd recommend:
> * super.toString
> * numClasses
> * numFeatures
> We should also override {{__repr__}} in pyspark to do the same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14712) spark.ml LogisticRegressionModel.toString should summarize model

2018-06-28 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-14712.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> spark.ml LogisticRegressionModel.toString should summarize model
> 
>
> Key: SPARK-14712
> URL: https://issues.apache.org/jira/browse/SPARK-14712
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Bravo Zhang
>Priority: Trivial
>  Labels: starter
> Fix For: 2.4.0
>
>
> spark.mllib LogisticRegressionModel overrides toString to print a little 
> model info.  We should do the same in spark.ml.  I'd recommend:
> * super.toString
> * numClasses
> * numFeatures
> We should also override {{__repr__}} in pyspark to do the same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24682) from_json / to_json do not handle java.sql.Date inside Maps correctly

2018-06-28 Thread Patrick McGloin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526687#comment-16526687
 ] 

Patrick McGloin commented on SPARK-24682:
-

I would like to work on this.

> from_json / to_json do not handle java.sql.Date inside Maps correctly
> -
>
> Key: SPARK-24682
> URL: https://issues.apache.org/jira/browse/SPARK-24682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Patrick McGloin
>Priority: Minor
>
> When a date is one of the types inside a Map, the from_json and to_json 
> functions do not handle it correctly.  A date would be persisted like this: 
> "17710".  And the from_json tries to read it in it complains with the error:  
> java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot 
> be cast to java.lang.Integer
> Consider the following test, which will fail on the final show: 
> {color:#80}
> *case class* {color}UnitTestCaseClassWithDateInsideMap(map: 
> {color:#009f82}Map{color}[Date, Int]) 
> test({color:#008000}*"Test a Date as key in a Map"*{color}) {
>  {color:#80}*val* {color}map =  
> _UnitTestCaseClassWithDateInsideMap_({color:#800080}_Map_{color}(Date._valueOf_({color:#008000}*"2018-06-28"*{color})
>  -> {color:#FF}1{color}))
>  {color:#80}*val* {color}options = 
> {color:#800080}_Map_{color}({color:#008000}*"timestampFormat"* {color}-> 
> {color:#008000}*"/MM/dd HH:mm:ss.SSS"*{color}, 
> {color:#008000}*"dateFormat"* {color}-> {color:#008000}*"/MM/dd"*{color})
>  {color:#80}*val* {color}schema = 
> Encoders._product_[UnitTestCaseClassWithDateInsideMap].schema
>  {color:#80}*val* {color}mapDF = {color:#800080}_Seq_{color}(map).toDF()
>  {color:#80}*val* {color}jsonDF = 
> mapDF.select(_to_json_(_struct_(mapDF.columns.head, mapDF.columns.tail:_*), 
> options))
>  jsonDF.show()
>  {color:#80}*val* {color}jsonString = 
> jsonDF.map(_.getString({color:#FF}0{color})).collect().head
>  {color:#80}*val* {color}stringDF = 
> {color:#800080}_Seq_{color}(jsonString).toDF({color:#008000}*"json"*{color})
>  {color:#80}*val* {color}parsedDF = 
> stringDF.select(_from_json_({color:#008000}*$"json"*{color}, schema, options))
>  parsedDF.show()
> } 
> The result of the line "jsonDF.show()" is as follows: 
> +---+ 
> |structstojson(named_struct(NamePlaceholder(), map))| 
> +---+ 
> |                                \{"map":{"17710":1}}| 
> +---+ 
> As can be seen the date is not formatted correctly.  The error with  
> "parsedDF.show()" is: 
> java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot 
> be cast to java.lang.Integer 
> As you can see I have added in the options with date formats but that does 
> not have any impact.  If I do the same tests with a Date outside a Map it 
> works as expected.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24684) DAGScheduler reports the wrong attempt number to the commit coordinator

2018-06-28 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-24684.
---
Resolution: Not A Problem

Closing this. In master, the attempt number is still used. Looks like this was 
just backported incorrectly by me.

> DAGScheduler reports the wrong attempt number to the commit coordinator
> ---
>
> Key: SPARK-24684
> URL: https://issues.apache.org/jira/browse/SPARK-24684
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.3, 2.3.2
>Reporter: Ryan Blue
>Priority: Major
>
> SPARK-24552 changes writers to pass the task ID to the output coordinator so 
> that the coordinator tracks each task uniquely because attempt numbers can be 
> reused across stage attempts. However, the DAGScheduler still passes the 
> attempt number when notifying the coordinator that a task has finished. The 
> result is that when a task is authorized and then fails due to OOM or a 
> similar error, the scheduler is notified but doesn't remove the commit 
> authorization because the attempt number doesn't match. This causes infinite 
> task retries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24684) DAGScheduler reports the wrong attempt number to the commit coordinator

2018-06-28 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-24684:
-

 Summary: DAGScheduler reports the wrong attempt number to the 
commit coordinator
 Key: SPARK-24684
 URL: https://issues.apache.org/jira/browse/SPARK-24684
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.1.3, 2.3.2
Reporter: Ryan Blue


SPARK-24552 changes writers to pass the task ID to the output coordinator so 
that the coordinator tracks each task uniquely because attempt numbers can be 
reused across stage attempts. However, the DAGScheduler still passes the 
attempt number when notifying the coordinator that a task has finished. The 
result is that when a task is authorized and then fails due to OOM or a similar 
error, the scheduler is notified but doesn't remove the commit authorization 
because the attempt number doesn't match. This causes infinite task retries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24683) SparkLauncher.NO_RESOURCE doesn't work with Java applications

2018-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526628#comment-16526628
 ] 

Apache Spark commented on SPARK-24683:
--

User 'mccheah' has created a pull request for this issue:
https://github.com/apache/spark/pull/21660

> SparkLauncher.NO_RESOURCE doesn't work with Java applications
> -
>
> Key: SPARK-24683
> URL: https://issues.apache.org/jira/browse/SPARK-24683
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> On the tip of master, after we merged the Python bindings support, Spark 
> Submit on Kubernetes no longer works with {{SparkLauncher.NO_RESOURCE. 
> This is because spark-submit will not set the main class as a child argument.
> See here: 
> https://github.com/apache/spark/blob/1a644afbac35c204f9ad55f86999319a9ab458c6/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L694



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24683) SparkLauncher.NO_RESOURCE doesn't work with Java applications

2018-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24683:


Assignee: Apache Spark

> SparkLauncher.NO_RESOURCE doesn't work with Java applications
> -
>
> Key: SPARK-24683
> URL: https://issues.apache.org/jira/browse/SPARK-24683
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Assignee: Apache Spark
>Priority: Critical
>
> On the tip of master, after we merged the Python bindings support, Spark 
> Submit on Kubernetes no longer works with {{SparkLauncher.NO_RESOURCE. 
> This is because spark-submit will not set the main class as a child argument.
> See here: 
> https://github.com/apache/spark/blob/1a644afbac35c204f9ad55f86999319a9ab458c6/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L694



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24683) SparkLauncher.NO_RESOURCE doesn't work with Java applications

2018-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24683:


Assignee: (was: Apache Spark)

> SparkLauncher.NO_RESOURCE doesn't work with Java applications
> -
>
> Key: SPARK-24683
> URL: https://issues.apache.org/jira/browse/SPARK-24683
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> On the tip of master, after we merged the Python bindings support, Spark 
> Submit on Kubernetes no longer works with {{SparkLauncher.NO_RESOURCE. 
> This is because spark-submit will not set the main class as a child argument.
> See here: 
> https://github.com/apache/spark/blob/1a644afbac35c204f9ad55f86999319a9ab458c6/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L694



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24683) SparkLauncher.NO_RESOURCE doesn't work with Java applications

2018-06-28 Thread Matt Cheah (JIRA)
Matt Cheah created SPARK-24683:
--

 Summary: SparkLauncher.NO_RESOURCE doesn't work with Java 
applications
 Key: SPARK-24683
 URL: https://issues.apache.org/jira/browse/SPARK-24683
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Matt Cheah


On the tip of master, after we merged the Python bindings support, Spark Submit 
on Kubernetes no longer works with {{SparkLauncher.NO_RESOURCE. This is 
because spark-submit will not set the main class as a child argument.

See here: 
https://github.com/apache/spark/blob/1a644afbac35c204f9ad55f86999319a9ab458c6/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L694



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken

2018-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24530:


Assignee: (was: Apache Spark)

> Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) 
> and pyspark.ml docs are broken
> ---
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken

2018-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526560#comment-16526560
 ] 

Apache Spark commented on SPARK-24530:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/21659

> Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) 
> and pyspark.ml docs are broken
> ---
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken

2018-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24530:


Assignee: Apache Spark

> Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) 
> and pyspark.ml docs are broken
> ---
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24642) Add a function which infers schema from a JSON column

2018-06-28 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526547#comment-16526547
 ] 

Maxim Gekk commented on SPARK-24642:


> I think this is too complicated and unpredictable.

ok. I will close the PR: https://github.com/apache/spark/pull/21626

> Add a function which infers schema from a JSON column
> -
>
> Key: SPARK-24642
> URL: https://issues.apache.org/jira/browse/SPARK-24642
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to add new aggregate function - *infer_schema()*. The function should 
> infer schema for set of JSON strings. The result of the function is a schema 
> in DDL format (or JSON format).
> One of the use cases is passing output of *infer_schema()* to *from_json()*. 
> Currently, the from_json() function requires a schema as a mandatory 
> argument. It is possible to infer schema programmatically in Scala/Python and 
> pass it as the second argument but in SQL it is not possible. An user has to 
> pass schema as string literal in SQL. The new function should allow to use it 
> in SQL like in the example:
> {code:sql}
> select from_json(json_col, infer_schema(json_col))
> from json_table;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24682) from_json / to_json do not handle java.sql.Date inside Maps correctly

2018-06-28 Thread Patrick McGloin (JIRA)
Patrick McGloin created SPARK-24682:
---

 Summary: from_json / to_json do not handle java.sql.Date inside 
Maps correctly
 Key: SPARK-24682
 URL: https://issues.apache.org/jira/browse/SPARK-24682
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Patrick McGloin


When a date is one of the types inside a Map, the from_json and to_json 
functions do not handle it correctly.  A date would be persisted like this: 
"17710".  And the from_json tries to read it in it complains with the error:  

java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot 
be cast to java.lang.Integer

Consider the following test, which will fail on the final show: 
{color:#80}
*case class* {color}UnitTestCaseClassWithDateInsideMap(map: 
{color:#009f82}Map{color}[Date, Int]) 

test({color:#008000}*"Test a Date as key in a Map"*{color}) {
 {color:#80}*val* {color}map =  
_UnitTestCaseClassWithDateInsideMap_({color:#800080}_Map_{color}(Date._valueOf_({color:#008000}*"2018-06-28"*{color})
 -> {color:#FF}1{color}))
 {color:#80}*val* {color}options = 
{color:#800080}_Map_{color}({color:#008000}*"timestampFormat"* {color}-> 
{color:#008000}*"/MM/dd HH:mm:ss.SSS"*{color}, 
{color:#008000}*"dateFormat"* {color}-> {color:#008000}*"/MM/dd"*{color})
 {color:#80}*val* {color}schema = 
Encoders._product_[UnitTestCaseClassWithDateInsideMap].schema

 {color:#80}*val* {color}mapDF = {color:#800080}_Seq_{color}(map).toDF()
 {color:#80}*val* {color}jsonDF = 
mapDF.select(_to_json_(_struct_(mapDF.columns.head, mapDF.columns.tail:_*), 
options))
 jsonDF.show()

 {color:#80}*val* {color}jsonString = 
jsonDF.map(_.getString({color:#FF}0{color})).collect().head

 {color:#80}*val* {color}stringDF = 
{color:#800080}_Seq_{color}(jsonString).toDF({color:#008000}*"json"*{color})
 {color:#80}*val* {color}parsedDF = 
stringDF.select(_from_json_({color:#008000}*$"json"*{color}, schema, options))
 parsedDF.show()
} 


The result of the line "jsonDF.show()" is as follows: 

+---+ 
|structstojson(named_struct(NamePlaceholder(), map))| 
+---+ 
|                                \{"map":{"17710":1}}| 
+---+ 

As can be seen the date is not formatted correctly.  The error with  
"parsedDF.show()" is: 

java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot 
be cast to java.lang.Integer 

As you can see I have added in the options with date formats but that does not 
have any impact.  If I do the same tests with a Date outside a Map it works as 
expected.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24681) Cannot create a view from a table when a nested column name contains ':'

2018-06-28 Thread Adrian Ionescu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Ionescu updated SPARK-24681:
---
Description: 
Here's a patch that reproduces the issue: 
{code:java}
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala 
index 09c1547..29bb3db 100644 
--- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala 
+++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala 
@@ -19,6 +19,7 @@ package org.apache.spark.sql.hive 
 
import org.apache.spark.sql.{QueryTest, Row} 
import org.apache.spark.sql.execution.datasources.parquet.ParquetTest 
+import org.apache.spark.sql.functions.{lit, struct} 
import org.apache.spark.sql.hive.test.TestHiveSingleton 
 
case class Cases(lower: String, UPPER: String) 
@@ -76,4 +77,21 @@ class HiveParquetSuite extends QueryTest with ParquetTest 
with TestHiveSingleton 
  } 
} 
  } 
+ 
+  test("column names including ':' characters") { 
+    withTempPath { path => 
+  withTable("test_table") { 
+    spark.range(0) 
+  .select(struct(lit(0).as("nested:column")).as("toplevel:column")) 
+  .write.format("parquet") 
+  .option("path", path.getCanonicalPath) 
+  .saveAsTable("test_table") 
+ 
+    sql("CREATE VIEW test_view_1 AS SELECT `toplevel:column`.* FROM 
test_table") 
+    sql("CREATE VIEW test_view_2 AS SELECT * FROM test_table") 
+ 
+  } 
+    } 
+  } 
}{code}
The first "CREATE VIEW" statement succeeds, but the second one fails with:
{code:java}
org.apache.spark.SparkException: Cannot recognize hive type string: 
struct
{code}

  was:
Here's a patch that reproduces the issue: 
{code:java}
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala 
index 09c1547..29bb3db 100644 
--- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala 
+++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala 
@@ -19,6 +19,7 @@ package org.apache.spark.sql.hive 
 
import org.apache.spark.sql.{QueryTest, Row} 
import org.apache.spark.sql.execution.datasources.parquet.ParquetTest 
+import org.apache.spark.sql.functions.{lit, struct} 
import org.apache.spark.sql.hive.test.TestHiveSingleton 
 
case class Cases(lower: String, UPPER: String) 
@@ -76,4 +77,21 @@ class HiveParquetSuite extends QueryTest with ParquetTest 
with TestHiveSingleton 
  } 
} 
  } 
+ 
+  test("column names including ':' characters") { 
+    withTempPath { path => 
+  withTable("test_table") { 
+    spark.range(0) 
+  .select(struct(lit(0).as("nested:column")).as("toplevel:column")) 
+  .write.format("parquet") 
+  .option("path", path.getCanonicalPath) 
+  .saveAsTable("test_table") 
+ 
+    sql("CREATE VIEW test_view_1 AS SELECT `toplevel:column`.* FROM 
test_table") 
+    sql("CREATE VIEW test_view_2 AS SELECT * FROM test_table") 
+ 
+  } 
+    } 
+  } 
}{code}
 The last sql statement in there fails with:
{code:java}
org.apache.spark.SparkException: Cannot recognize hive type string: 
struct
{code}


> Cannot create a view from a table when a nested column name contains ':'
> 
>
> Key: SPARK-24681
> URL: https://issues.apache.org/jira/browse/SPARK-24681
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Adrian Ionescu
>Priority: Major
>
> Here's a patch that reproduces the issue: 
> {code:java}
> diff --git 
> a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala 
> b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala 
> index 09c1547..29bb3db 100644 
> --- 
> a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala 
> +++ 
> b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala 
> @@ -19,6 +19,7 @@ package org.apache.spark.sql.hive 
>  
> import org.apache.spark.sql.{QueryTest, Row} 
> import org.apache.spark.sql.execution.datasources.parquet.ParquetTest 
> +import org.apache.spark.sql.functions.{lit, struct} 
> import org.apache.spark.sql.hive.test.TestHiveSingleton 
>  
> case class Cases(lower: String, UPPER: String) 
> @@ -76,4 +77,21 @@ class HiveParquetSuite extends QueryTest with ParquetTest 
> with TestHiveSingleton 
>   } 
> } 
>   } 
> + 
> +  test("column names including ':' characters") { 
> +    withTempPath { path => 
> +  withTable("test_table") { 
> +    spark.range(0) 
> +  .select(struct(lit(0).as("nested:column")).as("toplevel:column")) 
> +  .write.format("parquet") 

[jira] [Created] (SPARK-24681) Cannot create a view from a table when a nested column name contains ':'

2018-06-28 Thread Adrian Ionescu (JIRA)
Adrian Ionescu created SPARK-24681:
--

 Summary: Cannot create a view from a table when a nested column 
name contains ':'
 Key: SPARK-24681
 URL: https://issues.apache.org/jira/browse/SPARK-24681
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0, 2.2.0, 2.4.0
Reporter: Adrian Ionescu


Here's a patch that reproduces the issue: 
{code:java}
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala 
index 09c1547..29bb3db 100644 
--- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala 
+++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala 
@@ -19,6 +19,7 @@ package org.apache.spark.sql.hive 
 
import org.apache.spark.sql.{QueryTest, Row} 
import org.apache.spark.sql.execution.datasources.parquet.ParquetTest 
+import org.apache.spark.sql.functions.{lit, struct} 
import org.apache.spark.sql.hive.test.TestHiveSingleton 
 
case class Cases(lower: String, UPPER: String) 
@@ -76,4 +77,21 @@ class HiveParquetSuite extends QueryTest with ParquetTest 
with TestHiveSingleton 
  } 
} 
  } 
+ 
+  test("column names including ':' characters") { 
+    withTempPath { path => 
+  withTable("test_table") { 
+    spark.range(0) 
+  .select(struct(lit(0).as("nested:column")).as("toplevel:column")) 
+  .write.format("parquet") 
+  .option("path", path.getCanonicalPath) 
+  .saveAsTable("test_table") 
+ 
+    sql("CREATE VIEW test_view_1 AS SELECT `toplevel:column`.* FROM 
test_table") 
+    sql("CREATE VIEW test_view_2 AS SELECT * FROM test_table") 
+ 
+  } 
+    } 
+  } 
}{code}
 The last sql statement in there fails with:
{code:java}
org.apache.spark.SparkException: Cannot recognize hive type string: 
struct
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24020) Sort-merge join inner range optimization

2018-06-28 Thread Petar Zecevic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Petar Zecevic updated SPARK-24020:
--
Description: 
The problem we are solving is the case where you have two big tables 
partitioned by X column, but also sorted within partitions by Y column and you 
need to calculate an expensive function on the joined rows, which reduces the 
number of output rows (e.g. condition based on a spatial distance calculation). 
But you could theoretically reduce the number of joined rows for which the 
calculation itself is performed by using a range condition on the Y column. 
Something like this:

{{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }}

However, during a sort-merge join with this range condition specified, Spark 
will first cross-join all the rows with the same X value and only then try to 
apply the range condition and any function calculations. This happens because, 
inside the generated sort-merge join (SMJ) code, these extra conditions are put 
in the same block with the function being calculated and there is no way to 
evaluate these conditions before reading all the rows to be checked into memory 
(into an {{ExternalAppendOnlyUnsafeRowArray}}). If the two tables have a large 
number of rows per X, this can result in a huge number of calculations and a 
huge number of rows in executor memory, which can be unfeasible.
h3. The solution implementation

We therefore propose a change to the sort-merge join so that, when these extra 
conditions are specified, a queue is used instead of the 
ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a moving 
window through the values from the right relation as the left row changes. You 
could call this a combination of an equi-join and a theta join; in literature 
it is sometimes called an “epsilon join”. We call it a "sort-merge inner range 
join".
This design uses much less memory (not all rows with the same values of X need 
to be loaded into memory at once) and requires a much lower number of 
comparisons (the validity of this statement depends on the actual data and 
conditions used).
h3. The classes that need to be changed 

For implementing the described change we propose changes to these classes:
* _ExtractEquiJoinKeys_ – a pattern that needs to be extended to be able to 
recognize the case where a simple range condition with lower and upper limits 
is used on a secondary column (a column not included in the equi-join 
condition). The pattern also needs to extract the information later required 
for code generation etc.
* _InMemoryUnsafeRowQueue_ – the moving window implementation to be used 
instead of the _ExternalAppendOnlyUnsafeRowArray_ class. The rows need to be 
removed and added to/from the structure as the left key (X) changes, or the 
left secondary value (Y) changes, so the structure needs to be a queue. To make 
the change as less intrusive as possible, we propose to implement 
_InMemoryUnsafeRowQueue_ as a subclass of _ExternalAppendOnlyUnsafeRowArray_
* _JoinSelection_ – a strategy that uses _ExtractEquiJoinKeys_ and needs to be 
aware of the extracted range conditions
* _SortMergeJoinExec_ – the main implementation of the optimization. Needs to 
support two code paths: 
** when whole-stage code generation is turned off (method doExecute, which uses 
sortMergeJoinInnerRangeScanner) 
** when whole-stage code generation is turned on (methods doProduce and 
genScanner)
* _SortMergeJoinInnerRangeScanner_ – implements the SMJ with inner-range 
optimization in the case when whole-stage codegen is turned off
* _InnerJoinSuite_ – functional tests
* _JoinBenchmark_ – performance tests

h3. Triggering the optimization

The optimization should be triggered automatically when an equi-join expression 
is present AND lower and upper range conditions on a secondary column are 
specified. If the tables aren't sorted by both columns, appropriate sorts 
should be added.

To limit the impact of this change we also propose adding a new parameter 
(tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which could 
be used to switch off the optimization entirely.

h2. Applicable use cases

Potential use-cases for this are joins based on spatial or temporal distance 
calculations.

 

  was:
The problem we are solving is the case where you have two big tables 
partitioned by X column, but also sorted within partitions by Y column and you 
need to calculate an expensive function on the joined rows, which reduces the 
number of output rows (e.g. condition based on a spatial distance calculation). 
But you could theoretically reduce the number of joined rows for which the 
calculation itself is performed by using a range condition on the Y column. 
Something like this:

{{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }}

However, during a sort-merge join with this range condition specified, 

[jira] [Updated] (SPARK-24020) Sort-merge join inner range optimization

2018-06-28 Thread Petar Zecevic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Petar Zecevic updated SPARK-24020:
--
Description: 
The problem we are solving is the case where you have two big tables 
partitioned by X column, but also sorted within partitions by Y column and you 
need to calculate an expensive function on the joined rows, which reduces the 
number of output rows (e.g. condition based on a spatial distance calculation). 
But you could theoretically reduce the number of joined rows for which the 
calculation itself is performed by using a range condition on the Y column. 
Something like this:

{{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }}

However, during a sort-merge join with this range condition specified, Spark 
will first cross-join all the rows with the same X value and only then try to 
apply the range condition and any function calculations. This happens because, 
inside the generated sort-merge join (SMJ) code, these extra conditions are put 
in the same block with the function being calculated and there is no way to 
evaluate these conditions before reading all the rows to be checked into memory 
(into an {{ExternalAppendOnlyUnsafeRowArray}}). If the two tables have a large 
number of rows per X, this can result in a huge number of calculations and a 
huge number of rows in executor memory, which can be unfeasible.
h3. The solution implementation

We therefore propose a change to the sort-merge join so that, when these extra 
conditions are specified, a queue is used instead of the 
ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a moving 
window through the values from the right relation as the left row changes. You 
could call this a combination of an equi-join and a theta join; in literature 
it is sometimes called an “epsilon join”. We call it a "sort-merge inner range 
join".
This design uses much less memory (not all rows with the same values of X need 
to be loaded into memory at once) and requires a much lower number of 
comparisons (the validity of this statement depends on the actual data and 
conditions used).
h3. The classes that need to be changed 

For implementing the described change we propose changes to these classes:
* _ExtractEquiJoinKeys_ – a pattern that needs to be extended to be able to 
recognize the case where a simple range condition with lower and upper limits 
is used on a secondary column (a column not included in the equi-join 
condition). The pattern also needs to extract the information later required 
for code generation etc.
* _InMemoryUnsafeRowQueue_ – the moving window implementation to be used 
instead of the _ExternalAppendOnlyUnsafeRowArray_ class. The rows need to be 
removed and added to/from the structure as the left key (X) changes, or the 
left secondary value (Y) changes, so the structure needs to be a queue. To make 
the change as less intrusive as possible, we propose to implement 
_InMemoryUnsafeRowQueue_ as a subclass of _ExternalAppendOnlyUnsafeRowArray_
* _JoinSelection_ – a strategy that uses _ExtractEquiJoinKeys_ and needs to be 
aware of the extracted range conditions
* _SortMergeJoinExec_ – the main implementation of the optimization. Needs to 
support two code paths: 
** when whole-stage code generation is turned off (method doExecute, which uses 
sortMergeJoinInnerRangeScanner) 
** when whole-stage code generation is turned on (methods doProduce and 
genScanner)
* _SortMergeJoinInnerRangeScanner_ – implements the SMJ with inner-range 
optimization in the case when whole-stage codegen is turned off
* _InnerJoinSuite_ – functional tests
* _JoinBenchmark_ – performance tests

h3. Triggering the optimization

The optimization should be triggered automatically when an equi-join expression 
is present AND lower and upper range conditions on a secondary column are 
specified. If the tables aren't sorted by both columns, appropriate sorts 
should be added.

To limit the impact of this change we also propose adding a new parameter 
(tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which could 
be used to switch off the optimization entirely.

h3. Applicable use cases

Potential use-cases for this are joins based on spatial or temporal distance 
calculations.

 

  was:
The problem we are solving is the case where you have two big tables 
partitioned by X column, but also sorted within partitions by Y column and you 
need to calculate an expensive function on the joined rows, which reduces the 
number of output rows (e.g. condition based on a spatial distance calculation). 
But you could theoretically reduce the number of joined rows for which the 
calculation itself is performed by using a range condition on the Y column. 
Something like this:

{{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }}

However, during a sort-merge join with this range condition specified, 

[jira] [Updated] (SPARK-24020) Sort-merge join inner range optimization

2018-06-28 Thread Petar Zecevic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Petar Zecevic updated SPARK-24020:
--
Description: 
The problem we are solving is the case where you have two big tables 
partitioned by X column, but also sorted within partitions by Y column and you 
need to calculate an expensive function on the joined rows, which reduces the 
number of output rows (e.g. condition based on a spatial distance calculation). 
But you could theoretically reduce the number of joined rows for which the 
calculation itself is performed by using a range condition on the Y column. 
Something like this:

{{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }}

However, during a sort-merge join with this range condition specified, Spark 
will first cross-join all the rows with the same X value and only then try to 
apply the range condition and any function calculations. This happens because, 
inside the generated sort-merge join (SMJ) code, these extra conditions are put 
in the same block with the function being calculated and there is no way to 
evaluate these conditions before reading all the rows to be checked into memory 
(into an {{ExternalAppendOnlyUnsafeRowArray}}). If the two tables have a large 
number of rows per X, this can result in a huge number of calculations and a 
huge number of rows in executor memory, which can be unfeasible.
h2. The solution implementation

We therefore propose a change to the sort-merge join so that, when these extra 
conditions are specified, a queue is used instead of the 
ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a moving 
window through the values from the right relation as the left row changes. You 
could call this a combination of an equi-join and a theta join; in literature 
it is sometimes called an “epsilon join”. We call it a "sort-merge inner range 
join".
This design uses much less memory (not all rows with the same values of X need 
to be loaded into memory at once) and requires a much lower number of 
comparisons (the validity of this statement depends on the actual data and 
conditions used).
h3. The classes that need to be changed 

For implementing the described change we propose changes to these classes:
* _ExtractEquiJoinKeys_ – a pattern that needs to be extended to be able to 
recognize the case where a simple range condition with lower and upper limits 
is used on a secondary column (a column not included in the equi-join 
condition). The pattern also needs to extract the information later required 
for code generation etc.
* _InMemoryUnsafeRowQueue_ – the moving window implementation to be used 
instead of the _ExternalAppendOnlyUnsafeRowArray_ class. The rows need to be 
removed and added to/from the structure as the left key (X) changes, or the 
left secondary value (Y) changes, so the structure needs to be a queue. To make 
the change as less intrusive as possible, we propose to implement 
_InMemoryUnsafeRowQueue_ as a subclass of _ExternalAppendOnlyUnsafeRowArray_
* _JoinSelection_ – a strategy that uses _ExtractEquiJoinKeys_ and needs to be 
aware of the extracted range conditions
* _SortMergeJoinExec_ – the main implementation of the optimization. Needs to 
support two code paths: 
** when whole-stage code generation is turned off (method doExecute, which uses 
sortMergeJoinInnerRangeScanner) 
** when whole-stage code generation is turned on (methods doProduce and 
genScanner)
* _SortMergeJoinInnerRangeScanner_ – implements the SMJ with inner-range 
optimization in the case when whole-stage codegen is turned off
* _InnerJoinSuite_ – functional tests
* _JoinBenchmark_ – performance tests

h2. Triggering the optimization

The optimization should be triggered automatically when an equi-join expression 
is present AND lower and upper range conditions on a secondary column are 
specified. If the tables aren't sorted by both columns, appropriate sorts 
should be added.

To limit the impact of this change we also propose adding a new parameter 
(tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which could 
be used to switch off the optimization entirely.

h2. Applicable use cases

Potential use-cases for this are joins based on spatial or temporal distance 
calculations.

 

  was:
The problem we are solving is the case where you have two big tables 
partitioned by X column, but also sorted within partitions by Y column and you 
need to calculate an expensive function on the joined rows, which reduces the 
number of output rows (e.g. condition based on a spatial distance calculation). 
But you could theoretically reduce the number of joined rows for which the 
calculation itself is performed by using a range condition on the Y column. 
Something like this:

{{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }}

However, during a sort-merge join with this range condition specified, 

[jira] [Updated] (SPARK-24020) Sort-merge join inner range optimization

2018-06-28 Thread Petar Zecevic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Petar Zecevic updated SPARK-24020:
--
Description: 
The problem we are solving is the case where you have two big tables 
partitioned by X column, but also sorted within partitions by Y column and you 
need to calculate an expensive function on the joined rows, which reduces the 
number of output rows (e.g. condition based on a spatial distance calculation). 
But you could theoretically reduce the number of joined rows for which the 
calculation itself is performed by using a range condition on the Y column. 
Something like this:

{{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }}

However, during a sort-merge join with this range condition specified, Spark 
will first cross-join all the rows with the same X value and only then try to 
apply the range condition and any function calculations. This happens because, 
inside the generated sort-merge join (SMJ) code, these extra conditions are put 
in the same block with the function being calculated and there is no way to 
evaluate these conditions before reading all the rows to be checked into memory 
(into an {{ExternalAppendOnlyUnsafeRowArray}}). If the two tables have a large 
number of rows per X, this can result in a huge number of calculations and a 
huge number of rows in executor memory, which can be unfeasible.
h2. The solution implementation

We therefore propose a change to the sort-merge join so that, when these extra 
conditions are specified, a queue is used instead of the 
ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a moving 
window through the values from the right relation as the left row changes. You 
could call this a combination of an equi-join and a theta join; in literature 
it is sometimes called an “epsilon join”. We call it a "sort-merge inner range 
join".
This design uses much less memory (not all rows with the same values of X need 
to be loaded into memory at once) and requires a much lower number of 
comparisons (the validity of this statement depends on the actual data and 
conditions used).
h3. 

For implementing the described change we propose changes to these classes:
* _ExtractEquiJoinKeys_ – a pattern that needs to be extended to be able to 
recognize the case where a simple range condition with lower and upper limits 
is used on a secondary column (a column not included in the equi-join 
condition). The pattern also needs to extract the information later required 
for code generation etc.
* _InMemoryUnsafeRowQueue_ – the moving window implementation to be used 
instead of the _ExternalAppendOnlyUnsafeRowArray_ class. The rows need to be 
removed and added to/from the structure as the left key (X) changes, or the 
left secondary value (Y) changes, so the structure needs to be a queue. To make 
the change as less intrusive as possible, we propose to implement 
_InMemoryUnsafeRowQueue_ as a subclass of _ExternalAppendOnlyUnsafeRowArray_
* _JoinSelection_ – a strategy that uses _ExtractEquiJoinKeys_ and needs to be 
aware of the extracted range conditions
* _SortMergeJoinExec_ – the main implementation of the optimization. Needs to 
support two code paths: 
** when whole-stage code generation is turned off (method doExecute, which uses 
sortMergeJoinInnerRangeScanner) 
** when whole-stage code generation is turned on (methods doProduce and 
genScanner)
* _SortMergeJoinInnerRangeScanner_ – implements the SMJ with inner-range 
optimization in the case when whole-stage codegen is turned off
* _InnerJoinSuite_ – functional tests
* _JoinBenchmark_ – performance tests

The optimization should be triggered automatically when an equi-join expression 
is present AND lower and upper range conditions on a secondary column are 
specified. If the tables aren't sorted by both columns, appropriate sorts 
should be added.

To limit the impact of this change we also propose adding a new parameter 
(tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which could 
be used to switch off the optimization entirely.

Potential use-cases for this are joins based on spatial or temporal distance 
calculations.

 

  was:
The problem we are solving is the case where you have two big tables 
partitioned by X column, but also sorted by Y column (within partitions) and 
you need to calculate an expensive function on the joined rows. During a 
sort-merge join, Spark will do cross-joins of all rows that have the same X 
values and calculate the function's value on all of them. If the two tables 
have a large number of rows per X, this can result in a huge number of 
calculations.

We hereby propose an optimization that would allow you to reduce the number of 
matching rows per X using a range condition on Y columns of the two tables. 
Something like:

... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d

The way SMJ is currently 

[jira] [Updated] (SPARK-24020) Sort-merge join inner range optimization

2018-06-28 Thread Petar Zecevic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Petar Zecevic updated SPARK-24020:
--
Attachment: SMJ-innerRange-PR24020-designDoc.pdf

> Sort-merge join inner range optimization
> 
>
> Key: SPARK-24020
> URL: https://issues.apache.org/jira/browse/SPARK-24020
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Petar Zecevic
>Priority: Major
> Attachments: SMJ-innerRange-PR24020-designDoc.pdf
>
>
> The problem we are solving is the case where you have two big tables 
> partitioned by X column, but also sorted by Y column (within partitions) and 
> you need to calculate an expensive function on the joined rows. During a 
> sort-merge join, Spark will do cross-joins of all rows that have the same X 
> values and calculate the function's value on all of them. If the two tables 
> have a large number of rows per X, this can result in a huge number of 
> calculations.
> We hereby propose an optimization that would allow you to reduce the number 
> of matching rows per X using a range condition on Y columns of the two 
> tables. Something like:
> ... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d
> The way SMJ is currently implemented, these extra conditions have no 
> influence on the number of rows (per X) being checked because these extra 
> conditions are put in the same block with the function being calculated.
> Here we propose a change to the sort-merge join so that, when these extra 
> conditions are specified, a queue is used instead of the 
> ExternalAppendOnlyUnsafeRowArray class. This queue would then used as a 
> moving window across the values from the right relation as the left row 
> changes. You could call this a combination of an equi-join and a theta join 
> (we call it "sort-merge inner range join").
> Potential use-cases for this are joins based on spatial or temporal distance 
> calculations.
> The optimization should be triggered automatically when an equi-join 
> expression is present AND lower and upper range conditions on a secondary 
> column are specified. If the tables aren't sorted by both columns, 
> appropriate sorts should be added.
> To limit the impact of this change we also propose adding a new parameter 
> (tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which 
> could be used to switch off the optimization entirely.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24680) spark.executorEnv.JAVA_HOME does not take effect in Standalone mode

2018-06-28 Thread StanZhai (JIRA)
StanZhai created SPARK-24680:


 Summary: spark.executorEnv.JAVA_HOME does not take effect in 
Standalone mode
 Key: SPARK-24680
 URL: https://issues.apache.org/jira/browse/SPARK-24680
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 2.3.1, 2.2.1, 2.1.1
Reporter: StanZhai


spark.executorEnv.JAVA_HOME does not take effect when a Worker starting an 
Executor process in Standalone mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24673) scala sql function from_utc_timestamp second argument could be Column instead of String

2018-06-28 Thread Antonio Murgia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526307#comment-16526307
 ] 

Antonio Murgia commented on SPARK-24673:


Looks doable. Should I go with a method overload, resulting in:
{code:java}
functions.from_utc_timestamp(ts: Column, tz: String)

functions.from_utc_timestamp(ts: Column, tz: Column)
{code}
Or is there some limitation I am not aware of?

Also do you think
{code:java}
to_utc_timestamp{code}
should receive the same treatment?

> scala sql function from_utc_timestamp second argument could be Column instead 
> of String
> ---
>
> Key: SPARK-24673
> URL: https://issues.apache.org/jira/browse/SPARK-24673
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Antonio Murgia
>Priority: Minor
>
> As of 2.3.1 the scala API for the built-in function from_utc_timestamp 
> (org.apache.spark.sql.functions#from_utc_timestamp) is less powerful than its 
> SQL counter part. In particular, given a dataset/dataframe with the following 
> schema:
> {code:java}
> CREATE TABLE MY_TABLE (
>   ts TIMESTAMP,
>   tz STRING
> ){code}
> from the SQL api I can do something like:
> {code:java}
> SELECT FROM_UTC_TIMESTAMP(TS, TZ){code}
> while from the programmatic api I simply cannot because
> {code:java}
> functions.from_utc_timestamp(ts: Column, tz: String){code}
> second argument is a String.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24679) Download page should not link to unreleased code

2018-06-28 Thread Luciano Resende (JIRA)
Luciano Resende created SPARK-24679:
---

 Summary: Download page should not link to unreleased code
 Key: SPARK-24679
 URL: https://issues.apache.org/jira/browse/SPARK-24679
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.3.1
Reporter: Luciano Resende


The download pages currently link to the git code repository.

Whilst the instructions show how to check out master or a particular release 
branch, this also gives access to the rest of the repo, i.e. to non-released 
code.

Links to code repos should only be published on pages intended for developers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-06-28 Thread Jackey Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526264#comment-16526264
 ] 

Jackey Lee commented on SPARK-24630:


Main Goal:
* SQL API for StructStreaming

Benefits:
* Users, who are unfamiliar with streaming, can easily use SQL to run 
StructStreaming especially when migrating offline tasks to real time processing 
tasks.
* Support SQL API in StructStreaming can also combine StructStreaming with 
hive. Users can store the source/sink metadata in a table and use hive 
metastore to manage it. The users, who want to read this data, can easily 
create a stream by accessing the table, which can greatly reduce the 
development cost and maintenance costs of StructStreaming.
* Easy to achieve unified management and authority control of source and sink, 
and more controllable in the management of some private data, especially in 
some financial or security area.

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24673) scala sql function from_utc_timestamp second argument could be Column instead of String

2018-06-28 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526118#comment-16526118
 ] 

Takeshi Yamamuro commented on SPARK-24673:
--

It makes sense. Can you make a pr?

> scala sql function from_utc_timestamp second argument could be Column instead 
> of String
> ---
>
> Key: SPARK-24673
> URL: https://issues.apache.org/jira/browse/SPARK-24673
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Antonio Murgia
>Priority: Minor
>
> As of 2.3.1 the scala API for the built-in function from_utc_timestamp 
> (org.apache.spark.sql.functions#from_utc_timestamp) is less powerful than its 
> SQL counter part. In particular, given a dataset/dataframe with the following 
> schema:
> {code:java}
> CREATE TABLE MY_TABLE (
>   ts TIMESTAMP,
>   tz STRING
> ){code}
> from the SQL api I can do something like:
> {code:java}
> SELECT FROM_UTC_TIMESTAMP(TS, TZ){code}
> while from the programmatic api I simply cannot because
> {code:java}
> functions.from_utc_timestamp(ts: Column, tz: String){code}
> second argument is a String.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14834) Force adding doc for new api in pyspark with @since annotation

2018-06-28 Thread Alexander Gorokhov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526102#comment-16526102
 ] 

Alexander Gorokhov commented on SPARK-14834:


So, basically, this is about to make "since" decorator require docs to be 
written for underlying function?


> Force adding doc for new api in pyspark with @since annotation
> --
>
> Key: SPARK-14834
> URL: https://issues.apache.org/jira/browse/SPARK-14834
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>Priority: Minor
>
> This is for enforcing user to add python doc when adding new python api with 
> @since annotation. But I think about it again, this is only suitable for 
> adding new api for existing python module. If it is a new python module 
> migrating from scala api, python doc is not mandatory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24674) Spark on Kubernetes BLAS performance

2018-06-28 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-24674.
--
Resolution: Invalid

> Spark on Kubernetes BLAS performance
> 
>
> Key: SPARK-24674
> URL: https://issues.apache.org/jira/browse/SPARK-24674
> Project: Spark
>  Issue Type: Question
>  Components: Build, Kubernetes, MLlib
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1 SNAPSHOT (as of June 25th)
> Kubernetes version 1.7.5
> Kubernetes cluster, consisting of 4 Nodes with 16 GB RAM, 8 core Intel 
> processors.
>Reporter: Dennis Aumiller
>Priority: Minor
>  Labels: performance
>
>  
> Usually native BLAS libraries speed up the execution time of CPU-heavy 
> operations as for example in MLlib quite significantly.
>  Of course, the initial error
> {code:java}
> WARN  BLAS:61 - Failed to load implementation from: 
> com.github.fommil.netlib.NativeSystemBLAS
> {code}
> can be resolved not so easily, since, as reported 
> [here|[https://github.com/apache/spark/pull/19717/files/7d2b30373b2e4d8d5311e10c3f9a62a2d900d568],]
>  this seems to be the issue because of the underlying image used by the Spark 
> Dockerfile.
>  Re-building spark with
> {code:java}
> -Pnetlib-lgpl
> {code}
> also does not solve the problem, but I managed to build BLAS and LAPACK into 
> Alpine, with a lot of tricks involved.
> Interestingly, I noticed that the performance of PCA in my case dropped quite 
> significantly (with BLAS support, compared to the netlib-java fallback). I am 
> aware of [#SPARK-21305] as well, but that did not help my case, either.
>  Furthermore, calling SVD on a matrix of only size 5000x5000 (density 1%) 
> already throws an error when trying to use native ARPACK, but runs perfectly 
> fine with the fallback version.
> The question would be whether there has been some investigation in that 
> direction already.
>  Or, if not, whether it would be interesting for the Spark community to 
> provide a
>  * more detailed report with respect to timings/configurations/test setup
>  * a provided Dockerfile to build Spark with BLAS/LAPACK/ARPACK using the 
> shipped Dockerfile as a basis
>   
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24674) Spark on Kubernetes BLAS performance

2018-06-28 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526101#comment-16526101
 ] 

Takeshi Yamamuro commented on SPARK-24674:
--

You should first ask in the spark-user mailing list. If you still hit actual 
problem, feel free to reopen this.

(Also, I think this issue is not related to kubernetes...)

> Spark on Kubernetes BLAS performance
> 
>
> Key: SPARK-24674
> URL: https://issues.apache.org/jira/browse/SPARK-24674
> Project: Spark
>  Issue Type: Question
>  Components: Build, Kubernetes, MLlib
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1 SNAPSHOT (as of June 25th)
> Kubernetes version 1.7.5
> Kubernetes cluster, consisting of 4 Nodes with 16 GB RAM, 8 core Intel 
> processors.
>Reporter: Dennis Aumiller
>Priority: Minor
>  Labels: performance
>
>  
> Usually native BLAS libraries speed up the execution time of CPU-heavy 
> operations as for example in MLlib quite significantly.
>  Of course, the initial error
> {code:java}
> WARN  BLAS:61 - Failed to load implementation from: 
> com.github.fommil.netlib.NativeSystemBLAS
> {code}
> can be resolved not so easily, since, as reported 
> [here|[https://github.com/apache/spark/pull/19717/files/7d2b30373b2e4d8d5311e10c3f9a62a2d900d568],]
>  this seems to be the issue because of the underlying image used by the Spark 
> Dockerfile.
>  Re-building spark with
> {code:java}
> -Pnetlib-lgpl
> {code}
> also does not solve the problem, but I managed to build BLAS and LAPACK into 
> Alpine, with a lot of tricks involved.
> Interestingly, I noticed that the performance of PCA in my case dropped quite 
> significantly (with BLAS support, compared to the netlib-java fallback). I am 
> aware of [#SPARK-21305] as well, but that did not help my case, either.
>  Furthermore, calling SVD on a matrix of only size 5000x5000 (density 1%) 
> already throws an error when trying to use native ARPACK, but runs perfectly 
> fine with the fallback version.
> The question would be whether there has been some investigation in that 
> direction already.
>  Or, if not, whether it would be interesting for the Spark community to 
> provide a
>  * more detailed report with respect to timings/configurations/test setup
>  * a provided Dockerfile to build Spark with BLAS/LAPACK/ARPACK using the 
> shipped Dockerfile as a basis
>   
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24678) We should use 'PROCESS_LOCAL' first for Spark-Streaming

2018-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24678:


Assignee: (was: Apache Spark)

> We should use 'PROCESS_LOCAL' first for Spark-Streaming
> ---
>
> Key: SPARK-24678
> URL: https://issues.apache.org/jira/browse/SPARK-24678
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.3.1
>Reporter: sharkd tu
>Priority: Major
>
> Currently, `BlockRDD.getPreferredLocations`  only get hosts info of blocks, 
> which results in subsequent schedule level is not better than 'NODE_LOCAL'. 
> We can just make a small changes, the schedule level can be improved to 
> 'PROCESS_LOCAL'
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24678) We should use 'PROCESS_LOCAL' first for Spark-Streaming

2018-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24678:


Assignee: Apache Spark

> We should use 'PROCESS_LOCAL' first for Spark-Streaming
> ---
>
> Key: SPARK-24678
> URL: https://issues.apache.org/jira/browse/SPARK-24678
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.3.1
>Reporter: sharkd tu
>Assignee: Apache Spark
>Priority: Major
>
> Currently, `BlockRDD.getPreferredLocations`  only get hosts info of blocks, 
> which results in subsequent schedule level is not better than 'NODE_LOCAL'. 
> We can just make a small changes, the schedule level can be improved to 
> 'PROCESS_LOCAL'
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24678) We should use 'PROCESS_LOCAL' first for Spark-Streaming

2018-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526078#comment-16526078
 ] 

Apache Spark commented on SPARK-24678:
--

User 'sharkdtu' has created a pull request for this issue:
https://github.com/apache/spark/pull/21658

> We should use 'PROCESS_LOCAL' first for Spark-Streaming
> ---
>
> Key: SPARK-24678
> URL: https://issues.apache.org/jira/browse/SPARK-24678
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.3.1
>Reporter: sharkd tu
>Priority: Major
>
> Currently, `BlockRDD.getPreferredLocations`  only get hosts info of blocks, 
> which results in subsequent schedule level is not better than 'NODE_LOCAL'. 
> We can just make a small changes, the schedule level can be improved to 
> 'PROCESS_LOCAL'
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24678) We should use 'PROCESS_LOCAL' first for Spark-Streaming

2018-06-28 Thread sharkd tu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sharkd tu updated SPARK-24678:
--
Description: 
Currently, `BlockRDD.getPreferredLocations`  only get hosts info of blocks, 
which results in subsequent schedule level is not better than 'NODE_LOCAL'. We 
can just make a small changes, the schedule level can be improved to 
'PROCESS_LOCAL'

 

  was:
Currently, the meta-info of blocks that were received by spark-streaming 
receivers only contains hosts info, which results in subsequent schedule level 
is not better than 'NODE_LOCAL'. We can just make a small changes, the schedule 
level can be improved to 'PROCESS_LOCAL'

 


> We should use 'PROCESS_LOCAL' first for Spark-Streaming
> ---
>
> Key: SPARK-24678
> URL: https://issues.apache.org/jira/browse/SPARK-24678
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.3.1
>Reporter: sharkd tu
>Priority: Major
>
> Currently, `BlockRDD.getPreferredLocations`  only get hosts info of blocks, 
> which results in subsequent schedule level is not better than 'NODE_LOCAL'. 
> We can just make a small changes, the schedule level can be improved to 
> 'PROCESS_LOCAL'
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24144) monotonically_increasing_id on streaming dataFrames

2018-06-28 Thread Richard Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526075#comment-16526075
 ] 

Richard Yu commented on SPARK-24144:


So do you propose to send the information regarding monotonically_increasing_id 
to checkpoint data storage which could later be retrieved?

> monotonically_increasing_id on streaming dataFrames
> ---
>
> Key: SPARK-24144
> URL: https://issues.apache.org/jira/browse/SPARK-24144
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Hemant Bhanawat
>Priority: Major
>
> For our use case, we want to assign snapshot ids (incrementing counters) to 
> the incoming records. In case of failures, the same record should get the 
> same id after failure so that the downstream DB can handle the records in a 
> correct manner. 
> We were trying to do this by zipping the streaming rdds with that counter 
> using a modified version of ZippedWithIndexRDD. There are other ways to do 
> that but it turns out all ways are cumbersome and error prone in failure 
> scenarios.
> As suggested on the spark user dev list, one way to do this would be to 
> support monotonically_increasing_id on streaming dataFrames in Spark code 
> base. This would ensure that counters are incrementing for the records of the 
> stream. Also, since the counter can be checkpointed, it would work well in 
> case of failure scenarios. Last but not the least, doing this in spark would 
> be the most performance efficient way.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24672) No job is running but there are active tasks

2018-06-28 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-24672.
--
Resolution: Invalid

> No job is running but there are active tasks
> 
>
> Key: SPARK-24672
> URL: https://issues.apache.org/jira/browse/SPARK-24672
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: hadoop 2.6.0
> spark 2.2.1
> CDH 5.8.3
> java 1.8.0
>  
> More info :
> image1.png in Attachments
>Reporter: Deegue
>Priority: Major
> Attachments: image1.png, image2.png, image3.png
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Issues:
>  
> There are active tasks while no job is running.
> These active tasks occupy the executors(resources) , and I don't know why 
> they haven't been killed or stopped after its jobs failed.
>  
> More info :
> image2.png & image3.png in Attachments
>  
> I'd be very appreciated it if anyone can help me...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24672) No job is running but there are active tasks

2018-06-28 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526069#comment-16526069
 ] 

Takeshi Yamamuro commented on SPARK-24672:
--

You should first ask in the spark-user mailing list. If you can find some 
concrete conditions to reproduce this, feel free to reopen or file a new jira. 
Thanks.

> No job is running but there are active tasks
> 
>
> Key: SPARK-24672
> URL: https://issues.apache.org/jira/browse/SPARK-24672
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: hadoop 2.6.0
> spark 2.2.1
> CDH 5.8.3
> java 1.8.0
>  
> More info :
> image1.png in Attachments
>Reporter: Deegue
>Priority: Major
> Attachments: image1.png, image2.png, image3.png
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Issues:
>  
> There are active tasks while no job is running.
> These active tasks occupy the executors(resources) , and I don't know why 
> they haven't been killed or stopped after its jobs failed.
>  
> More info :
> image2.png & image3.png in Attachments
>  
> I'd be very appreciated it if anyone can help me...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24662) Structured Streaming should support LIMIT

2018-06-28 Thread Richard Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526065#comment-16526065
 ] 

Richard Yu commented on SPARK-24662:


Just to be clear on the function of the limit operator, could you explain what 
it does?

> Structured Streaming should support LIMIT
> -
>
> Key: SPARK-24662
> URL: https://issues.apache.org/jira/browse/SPARK-24662
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mukul Murthy
>Priority: Major
>
> Make structured streams support the LIMIT operator. 
> This will undo SPARK-24525 as the limit operator would be a superior solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24678) We should use 'PROCESS_LOCAL' first for Spark-Streaming

2018-06-28 Thread sharkd tu (JIRA)
sharkd tu created SPARK-24678:
-

 Summary: We should use 'PROCESS_LOCAL' first for Spark-Streaming
 Key: SPARK-24678
 URL: https://issues.apache.org/jira/browse/SPARK-24678
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Affects Versions: 2.3.1
Reporter: sharkd tu


Currently, the meta-info of blocks that were received by spark-streaming 
receivers only contains hosts info, which results in subsequent schedule level 
is not better than 'NODE_LOCAL'. We can just make a small changes, the schedule 
level can be improved to 'PROCESS_LOCAL'

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24676) Project required data from parsed data when csvColumnPruning disabled

2018-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24676:


Assignee: (was: Apache Spark)

> Project required data from parsed data when csvColumnPruning disabled
> -
>
> Key: SPARK-24676
> URL: https://issues.apache.org/jira/browse/SPARK-24676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> I hit a bug below when parsing csv data;
> {code}
> ./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=false
> scala> val dir = "/tmp/spark-csv/csv"
> scala> spark.range(10).selectExpr("id % 2 AS p", 
> "id").write.mode("overwrite").partitionBy("p").csv(dir)
> scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
> 18/06/25 13:48:46 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
> java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot 
> be cast to java.lang.Integer
> at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24676) Project required data from parsed data when csvColumnPruning disabled

2018-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24676:


Assignee: Apache Spark

> Project required data from parsed data when csvColumnPruning disabled
> -
>
> Key: SPARK-24676
> URL: https://issues.apache.org/jira/browse/SPARK-24676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> I hit a bug below when parsing csv data;
> {code}
> ./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=false
> scala> val dir = "/tmp/spark-csv/csv"
> scala> spark.range(10).selectExpr("id % 2 AS p", 
> "id").write.mode("overwrite").partitionBy("p").csv(dir)
> scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
> 18/06/25 13:48:46 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
> java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot 
> be cast to java.lang.Integer
> at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24676) Project required data from parsed data when csvColumnPruning disabled

2018-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526057#comment-16526057
 ] 

Apache Spark commented on SPARK-24676:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/21657

> Project required data from parsed data when csvColumnPruning disabled
> -
>
> Key: SPARK-24676
> URL: https://issues.apache.org/jira/browse/SPARK-24676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> I hit a bug below when parsing csv data;
> {code}
> ./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=false
> scala> val dir = "/tmp/spark-csv/csv"
> scala> spark.range(10).selectExpr("id % 2 AS p", 
> "id").write.mode("overwrite").partitionBy("p").csv(dir)
> scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
> 18/06/25 13:48:46 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
> java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot 
> be cast to java.lang.Integer
> at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24677) MedianHeap is empty when speculation is enabled, causing the SparkContext to stop

2018-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526051#comment-16526051
 ] 

Apache Spark commented on SPARK-24677:
--

User 'cxzl25' has created a pull request for this issue:
https://github.com/apache/spark/pull/21656

> MedianHeap is empty when speculation is enabled, causing the SparkContext to 
> stop
> -
>
> Key: SPARK-24677
> URL: https://issues.apache.org/jira/browse/SPARK-24677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: dzcxzl
>Priority: Critical
>
> When introducing SPARK-23433 , maybe cause stop sparkcontext.
> {code:java}
> ERROR Utils: uncaught error in thread task-scheduler-speculation, stopping 
> SparkContext
> java.util.NoSuchElementException: MedianHeap is empty.
> at org.apache.spark.util.collection.MedianHeap.median(MedianHeap.scala:83)
> at 
> org.apache.spark.scheduler.TaskSetManager.checkSpeculatableTasks(TaskSetManager.scala:968)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.scheduler.Pool.checkSpeculatableTasks(Pool.scala:93)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24677) MedianHeap is empty when speculation is enabled, causing the SparkContext to stop

2018-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24677:


Assignee: (was: Apache Spark)

> MedianHeap is empty when speculation is enabled, causing the SparkContext to 
> stop
> -
>
> Key: SPARK-24677
> URL: https://issues.apache.org/jira/browse/SPARK-24677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: dzcxzl
>Priority: Critical
>
> When introducing SPARK-23433 , maybe cause stop sparkcontext.
> {code:java}
> ERROR Utils: uncaught error in thread task-scheduler-speculation, stopping 
> SparkContext
> java.util.NoSuchElementException: MedianHeap is empty.
> at org.apache.spark.util.collection.MedianHeap.median(MedianHeap.scala:83)
> at 
> org.apache.spark.scheduler.TaskSetManager.checkSpeculatableTasks(TaskSetManager.scala:968)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.scheduler.Pool.checkSpeculatableTasks(Pool.scala:93)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24677) MedianHeap is empty when speculation is enabled, causing the SparkContext to stop

2018-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24677:


Assignee: Apache Spark

> MedianHeap is empty when speculation is enabled, causing the SparkContext to 
> stop
> -
>
> Key: SPARK-24677
> URL: https://issues.apache.org/jira/browse/SPARK-24677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: dzcxzl
>Assignee: Apache Spark
>Priority: Critical
>
> When introducing SPARK-23433 , maybe cause stop sparkcontext.
> {code:java}
> ERROR Utils: uncaught error in thread task-scheduler-speculation, stopping 
> SparkContext
> java.util.NoSuchElementException: MedianHeap is empty.
> at org.apache.spark.util.collection.MedianHeap.median(MedianHeap.scala:83)
> at 
> org.apache.spark.scheduler.TaskSetManager.checkSpeculatableTasks(TaskSetManager.scala:968)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.scheduler.Pool.checkSpeculatableTasks(Pool.scala:93)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24675) Rename table: validate existence of new location

2018-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24675:


Assignee: (was: Apache Spark)

> Rename table: validate existence of new location
> 
>
> Key: SPARK-24675
> URL: https://issues.apache.org/jira/browse/SPARK-24675
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Priority: Major
>
> If table is renamed to a existing new location, data won't show up.
>  
> scala>  Seq("hello").toDF("a").write.format("parquet").saveAsTable("t")
> scala> sql("select * from t").show()
> +-+
> |    a|
> +-+
> |hello|
> +-+ 
> scala> sql("alter table t rename to test")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("select * from test").show()
> +---+
> |  a|
> +---+
> +—+
>  
> In Hive, if the new location exists, the renaming will fail even the location 
> is empty.
> We should have the same validation in catalog, in case of unexpected bugs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24675) Rename table: validate existence of new location

2018-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24675:


Assignee: Apache Spark

> Rename table: validate existence of new location
> 
>
> Key: SPARK-24675
> URL: https://issues.apache.org/jira/browse/SPARK-24675
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> If table is renamed to a existing new location, data won't show up.
>  
> scala>  Seq("hello").toDF("a").write.format("parquet").saveAsTable("t")
> scala> sql("select * from t").show()
> +-+
> |    a|
> +-+
> |hello|
> +-+ 
> scala> sql("alter table t rename to test")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("select * from test").show()
> +---+
> |  a|
> +---+
> +—+
>  
> In Hive, if the new location exists, the renaming will fail even the location 
> is empty.
> We should have the same validation in catalog, in case of unexpected bugs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24675) Rename table: validate existence of new location

2018-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526048#comment-16526048
 ] 

Apache Spark commented on SPARK-24675:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/21655

> Rename table: validate existence of new location
> 
>
> Key: SPARK-24675
> URL: https://issues.apache.org/jira/browse/SPARK-24675
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Priority: Major
>
> If table is renamed to a existing new location, data won't show up.
>  
> scala>  Seq("hello").toDF("a").write.format("parquet").saveAsTable("t")
> scala> sql("select * from t").show()
> +-+
> |    a|
> +-+
> |hello|
> +-+ 
> scala> sql("alter table t rename to test")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("select * from test").show()
> +---+
> |  a|
> +---+
> +—+
>  
> In Hive, if the new location exists, the renaming will fail even the location 
> is empty.
> We should have the same validation in catalog, in case of unexpected bugs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24677) MedianHeap is empty when speculation is enabled, causing the SparkContext to stop

2018-06-28 Thread dzcxzl (JIRA)
dzcxzl created SPARK-24677:
--

 Summary: MedianHeap is empty when speculation is enabled, causing 
the SparkContext to stop
 Key: SPARK-24677
 URL: https://issues.apache.org/jira/browse/SPARK-24677
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: dzcxzl


When introducing SPARK-23433 , maybe cause stop sparkcontext.
{code:java}
ERROR Utils: uncaught error in thread task-scheduler-speculation, stopping 
SparkContext
java.util.NoSuchElementException: MedianHeap is empty.
at org.apache.spark.util.collection.MedianHeap.median(MedianHeap.scala:83)
at 
org.apache.spark.scheduler.TaskSetManager.checkSpeculatableTasks(TaskSetManager.scala:968)
at 
org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94)
at 
org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.scheduler.Pool.checkSpeculatableTasks(Pool.scala:93)
at 
org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94)
at 
org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24676) Project required data from parsed data when csvColumnPruning disabled

2018-06-28 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-24676:


 Summary: Project required data from parsed data when 
csvColumnPruning disabled
 Key: SPARK-24676
 URL: https://issues.apache.org/jira/browse/SPARK-24676
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Takeshi Yamamuro


I hit a bug below when parsing csv data;

{code}
./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=false
scala> val dir = "/tmp/spark-csv/csv"
scala> spark.range(10).selectExpr("id % 2 AS p", 
"id").write.mode("overwrite").partitionBy("p").csv(dir)
scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
18/06/25 13:48:46 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot 
be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41)
...
{code}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >