[jira] [Comment Edited] (SPARK-24261) Spark cannot read renamed managed Hive table

2018-05-13 Thread Suraj Nayak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473662#comment-16473662
 ] 

Suraj Nayak edited comment on SPARK-24261 at 5/14/18 6:50 AM:
--

Also, need some pointers here from committers to know if was there any specific 
requirement or design why the paths are dependent on {{SERDEPROPERTIES}} path 
rather than the Storage descriptor location. Thus helping us to understand the 
impact of this change.


was (Author: snayakm):
Also, need some pointers here from committers so as was there any specific 
requirement or design why the paths are dependent on {{SERDEPROPERTIS}} path 
rather than the Storage descriptor location. Thus helping us to understand the 
impact of this change.

> Spark cannot read renamed managed Hive table
> 
>
> Key: SPARK-24261
> URL: https://issues.apache.org/jira/browse/SPARK-24261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Suraj Nayak
>Priority: Major
> Attachments: some_db.some_new_table.ddl, 
> some_db.some_new_table_buggy_path.ddl, some_db.some_table.ddl
>
>
> When spark creates hive table using df.write.saveAsTable, it creates managed 
> table in hive with SERDEPROPERTIES like 
> {{WITH SERDEPROPERTIES 
> ('path'='gs://some-gs-bucket/warehouse/hive/some.db/some_table') }}
> When any external user changes hive table name via Hive CLI or Hue, Hive 
> makes sure the table name is changed and also the path is changed to new 
> location. But it never updates the serdeproperties mentioned above. 
> *Steps to Reproduce:*
> 1. Save table using spark:
>  {{spark.sql("select * from 
> some_db.some_table").write.saveAsTable("some_db.some_new_table")}}
> 2. In Hive CLI or Hue, run 
> {{alter table some_db.some_new_table rename to 
> some_db.some_new_table_buggy_path}}
> 3. Try to ready the buggy table *some_db.some_new_table_buggy_path* in spark 
> {{spark.sql("select * from some_db.some_new_table_buggy_path limit 
> 10").collect}}
> Spark throws following warning (Spark fails to read while hive can read this 
> table):
> {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
> stale CacheEntry; failed to fetch item info for: 
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/ - removing from 
> cache}}
>  {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
> stale CacheEntry; failed to fetch item info for: 
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/_SUCCESS - removing 
> from cache}}
>  {{18/05/13 17:45:16 WARN datasources.InMemoryFileIndex: The directory 
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table was not found. Was 
> it deleted very recently?}}
>  {{res2: Array[org.apache.spark.sql.Row] = Array()}}
> The DDLs for each of the tables are attached. 
> This will create inconsistency and endusers will spend endless time in 
> finding bug if data exists in both location, but spark reads it from 
> different location while hive process writes the new data in new location. 
> I went through similar JIRAs, but those address different issues.
> SPARK-15635 and SPARK-16570 address alter table in spark, unlike this jira, 
> while other external process renames the table.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23904) Big execution plan cause OOM

2018-05-13 Thread Izek Greenfield (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473827#comment-16473827
 ] 

Izek Greenfield commented on SPARK-23904:
-

[~RBerenguel] Any update on that? 

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24211) Flaky test: StreamingOuterJoinSuite

2018-05-13 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24211:
--
Description: 
*windowed left outer join*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/330/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/317/

*windowed right outer join*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/334/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/328/

*left outer join with non-key condition violated*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/337/

  was:
*windowed left outer join*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/330/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/317/

*windowed right outer join*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/334/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/328/


> Flaky test: StreamingOuterJoinSuite
> ---
>
> Key: SPARK-24211
> URL: https://issues.apache.org/jira/browse/SPARK-24211
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> *windowed left outer join*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/330/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/317/
> *windowed right outer join*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/334/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/328/
> *left outer join with non-key condition violated*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/337/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24220) java.lang.NullPointerException at org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:83)

2018-05-13 Thread joy-m (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473800#comment-16473800
 ] 

joy-m commented on SPARK-24220:
---

@[~kiszk]  I used yarn cluster mode to run my application.when i used 
pipeStream,the issue arised.So I changed to use fileInputStream, the issue not 
appeared,I do not known why  ?

> java.lang.NullPointerException at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:83)
> 
>
> Key: SPARK-24220
> URL: https://issues.apache.org/jira/browse/SPARK-24220
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: joy-m
>Priority: Major
>
> def getInputStream(rows:Iterator[Row]): PipedInputStream ={
>  printMem("before gen string")
>  val pipedOutputStream = new PipedOutputStream()
>  (new Thread() {
>  override def run(){
>  if(rows == null){
>  logError("rows is null==>")
>  }else{
>  println(s"record-start-${rows.length}")
>  try {
>  while (rows.hasNext) {
>  val row = rows.next()
>  println(row)
>  val str = row.mkString("\001") + "\r\n"
>  println(str)
>  pipedOutputStream.write(str.getBytes(StandardCharsets.UTF_8))
>  }
>  println("record-end-")
>  pipedOutputStream.close()
>  } catch {
>  case ex:Exception =>
>  ex.printStackTrace()
>  }
>  }
>  }
>  }).start()
>  println("pipedInPutStream--")
>  val pipedInPutStream = new PipedInputStream()
>  pipedInPutStream.connect(pipedOutputStream)
>  println("pipedInPutStream--- conn---")
>  printMem("after gen string")
>  pipedInPutStream
> }
> resDf.coalesce(15).foreachPartition(rows=>{
>  if(rows == null){
>  logError("rows is null=>")
>  }else{
>  val copyCmd = s"COPY ${tableName} FROM STDIN with DELIMITER as '\001' NULL 
> as 'null string'"
>  var con: Connection = null
>  try {
>  con = DriverManager.getConnection(adminUrl)
>  val copyManager = new CopyManager(con.asInstanceOf[BaseConnection])
>  val start = System.currentTimeMillis()
>  var count: Long = 0
>  var copyCount: Long = 0
>  println("before copyManager=>")
>  copyCount += copyManager.copyIn(copyCmd, getInputStream(rows))
>  println("after copyManager=>")
>  val finish = System.currentTimeMillis()
>  println("copyCount:" + copyCount + " count:" + count + " time(s):" + (finish 
> - start) / 1000)
>  con.close()
>  } catch {
>  case ex:Exception =>
>  ex.printStackTrace()
>  println(s"copyIn error!${ex.toString}")
>  } finally {
>  try {
>  if (con != null) {
>  con.close()
>  }
>  } catch {
>  case ex:SQLException =>
>  ex.printStackTrace()
>  println(s"copyIn error!${ex.toString}")
>  }
>  }
>  }
>  
> 18/05/09 13:31:30 ERROR util.SparkUncaughtExceptionHandler: Uncaught 
> exception in thread Thread[Thread-4,5,main]
> java.lang.NullPointerException
>  at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:83)
>  at org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:87)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.init(Unknown
>  Source)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:392)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:389)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
>  at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:10

[jira] [Comment Edited] (SPARK-24130) Data Source V2: Join Push Down

2018-05-13 Thread Jia Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473765#comment-16473765
 ] 

Jia Li edited comment on SPARK-24130 at 5/14/18 4:48 AM:
-

Thank you for letting me know the permission issue, [~rdblue]. I have fixed it. 
Please feel free to let me know of any further issue or comment. 


was (Author: jliwork):
Thank you for letting me know the permission issue, Ryan Blue. I have fixed it. 
Please feel free to let me know of any further issue or comment. 

> Data Source V2: Join Push Down
> --
>
> Key: SPARK-24130
> URL: https://issues.apache.org/jira/browse/SPARK-24130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jia Li
>Priority: Major
>
> Spark applications often directly query external data sources such as 
> relational databases, or files. Spark provides Data Sources APIs for 
> accessing structured data through Spark SQL. Data Sources APIs in both V1 and 
> V2 support optimizations such as Filter push down and Column pruning which 
> are subset of the functionality that can be pushed down to some data sources. 
> We’re proposing to extend Data Sources APIs with join push down (JPD). Join 
> push down significantly improves query performance by reducing the amount of 
> data transfer and exploiting the capabilities of the data sources such as 
> index access.
> Join push down design document is available 
> [here|https://docs.google.com/document/d/1k-kRadTcUbxVfUQwqBbIXs_yPZMxh18-e-cz77O_TaE/edit?usp=sharing].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24130) Data Source V2: Join Push Down

2018-05-13 Thread Jia Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473765#comment-16473765
 ] 

Jia Li commented on SPARK-24130:


Thank you for letting me know the permission issue, Ryan Blue. I have fixed it. 
Please feel free to let me know of any further issue or comment. 

> Data Source V2: Join Push Down
> --
>
> Key: SPARK-24130
> URL: https://issues.apache.org/jira/browse/SPARK-24130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jia Li
>Priority: Major
>
> Spark applications often directly query external data sources such as 
> relational databases, or files. Spark provides Data Sources APIs for 
> accessing structured data through Spark SQL. Data Sources APIs in both V1 and 
> V2 support optimizations such as Filter push down and Column pruning which 
> are subset of the functionality that can be pushed down to some data sources. 
> We’re proposing to extend Data Sources APIs with join push down (JPD). Join 
> push down significantly improves query performance by reducing the amount of 
> data transfer and exploiting the capabilities of the data sources such as 
> index access.
> Join push down design document is available 
> [here|https://docs.google.com/document/d/1k-kRadTcUbxVfUQwqBbIXs_yPZMxh18-e-cz77O_TaE/edit?usp=sharing].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24228) Fix the lint error

2018-05-13 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24228.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Fixed in https://github.com/apache/spark/pull/21301

> Fix the lint error
> --
>
> Key: SPARK-24228
> URL: https://issues.apache.org/jira/browse/SPARK-24228
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Minor
> Fix For: 2.4.0
>
>
> [ERROR] 
> src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[21,8]
>  (imports) UnusedImports: Unused import - java.io.ByteArrayInputStream.
> [ERROR] 
> src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java:[29,8]
>  (imports) UnusedImports: Unused import - org.apache.spark.unsafe.Platform.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24232) Allow referring to kubernetes secrets as env variable

2018-05-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473719#comment-16473719
 ] 

Apache Spark commented on SPARK-24232:
--

User 'skonto' has created a pull request for this issue:
https://github.com/apache/spark/pull/21317

> Allow referring to kubernetes secrets as env variable
> -
>
> Key: SPARK-24232
> URL: https://issues.apache.org/jira/browse/SPARK-24232
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Dharmesh Kakadia
>Priority: Major
>
> Allow referring to kubernetes secrets in the driver process via environment 
> variables. This will allow developers to use secretes without leaking them in 
> the code and at the same time secrets can be decoupled and managed 
> separately. This can be used to refer to passwords, certificates etc while 
> talking to other service (jdbc passwords, storage keys etc).
> So, at the deployment time, something like 
> ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified 
> which will make [EnvName].[key] available as an environment variable and in 
> the code its always referred as env variable [key].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24232) Allow referring to kubernetes secrets as env variable

2018-05-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24232:


Assignee: (was: Apache Spark)

> Allow referring to kubernetes secrets as env variable
> -
>
> Key: SPARK-24232
> URL: https://issues.apache.org/jira/browse/SPARK-24232
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Dharmesh Kakadia
>Priority: Major
>
> Allow referring to kubernetes secrets in the driver process via environment 
> variables. This will allow developers to use secretes without leaking them in 
> the code and at the same time secrets can be decoupled and managed 
> separately. This can be used to refer to passwords, certificates etc while 
> talking to other service (jdbc passwords, storage keys etc).
> So, at the deployment time, something like 
> ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified 
> which will make [EnvName].[key] available as an environment variable and in 
> the code its always referred as env variable [key].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24232) Allow referring to kubernetes secrets as env variable

2018-05-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24232:


Assignee: Apache Spark

> Allow referring to kubernetes secrets as env variable
> -
>
> Key: SPARK-24232
> URL: https://issues.apache.org/jira/browse/SPARK-24232
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Dharmesh Kakadia
>Assignee: Apache Spark
>Priority: Major
>
> Allow referring to kubernetes secrets in the driver process via environment 
> variables. This will allow developers to use secretes without leaking them in 
> the code and at the same time secrets can be decoupled and managed 
> separately. This can be used to refer to passwords, certificates etc while 
> talking to other service (jdbc passwords, storage keys etc).
> So, at the deployment time, something like 
> ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified 
> which will make [EnvName].[key] available as an environment variable and in 
> the code its always referred as env variable [key].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24266) Spark client terminates while driver is still running

2018-05-13 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473717#comment-16473717
 ] 

Chun Chen commented on SPARK-24266:
---

cc [~anirudh]

> Spark client terminates while driver is still running
> -
>
> Key: SPARK-24266
> URL: https://issues.apache.org/jira/browse/SPARK-24266
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Chun Chen
>Priority: Major
>
> {code}
> Warning: Ignoring non-spark config property: Default=system properties 
> included when running spark-submit.
> 18/05/11 14:50:12 WARN Config: Error reading service account token from: 
> [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring.
> 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: 
> Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
> 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. 
> Mounting Hadoop specific files
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: N/A
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: 2018-05-11T06:50:17Z
>container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
>phase: Pending
>status: [ContainerStatus(containerID=null, 
> image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
> lastState=ContainerState(running=null, terminated=null, waiting=null, 
> additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
> restartCount=0, state=ContainerState(running=null, terminated=null, 
> waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
> additionalProperties={}), additionalProperties={}), additionalProperties={})]
> 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to 
> finish...
> 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>sta

[jira] [Updated] (SPARK-24266) Spark client terminates while driver is still running

2018-05-13 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated SPARK-24266:
--
Description: 
{code}
Warning: Ignoring non-spark config property: Default=system properties included 
when running spark-submit.
18/05/11 14:50:12 WARN Config: Error reading service account token from: 
[/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring.
18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: 
Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. 
Mounting Hadoop specific files
18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
 pod name: spark-64-293-980-1526021412180-driver
 namespace: tione-603074457
 labels: network -> FLOATINGIP, spark-app-selector -> 
spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
 creation time: 2018-05-11T06:50:17Z
 service account name: default
 volumes: spark-local-dir-0-spark-local, spark-init-properties, 
download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
default-token-xvjt9
 node name: N/A
 start time: N/A
 container images: N/A
 phase: Pending
 status: []
18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
 pod name: spark-64-293-980-1526021412180-driver
 namespace: tione-603074457
 labels: network -> FLOATINGIP, spark-app-selector -> 
spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
 creation time: 2018-05-11T06:50:17Z
 service account name: default
 volumes: spark-local-dir-0-spark-local, spark-init-properties, 
download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
default-token-xvjt9
 node name: tbds-100-98-45-69
 start time: N/A
 container images: N/A
 phase: Pending
 status: []
18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
 pod name: spark-64-293-980-1526021412180-driver
 namespace: tione-603074457
 labels: network -> FLOATINGIP, spark-app-selector -> 
spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
 creation time: 2018-05-11T06:50:17Z
 service account name: default
 volumes: spark-local-dir-0-spark-local, spark-init-properties, 
download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
default-token-xvjt9
 node name: tbds-100-98-45-69
 start time: 2018-05-11T06:50:17Z
 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
 phase: Pending
 status: [ContainerStatus(containerID=null, 
image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
lastState=ContainerState(running=null, terminated=null, waiting=null, 
additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
restartCount=0, state=ContainerState(running=null, terminated=null, 
waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
additionalProperties={}), additionalProperties={}), additionalProperties={})]
18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to 
finish...
18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
 pod name: spark-64-293-980-1526021412180-driver
 namespace: tione-603074457
 labels: network -> FLOATINGIP, spark-app-selector -> 
spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
 creation time: 2018-05-11T06:50:17Z
 service account name: default
 volumes: spark-local-dir-0-spark-local, spark-init-properties, 
download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
default-token-xvjt9
 node name: tbds-100-98-45-69
 start time: 2018-05-11T06:50:17Z
 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
 phase: Pending
 status: [ContainerStatus(containerID=null, 
image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
lastState=ContainerState(running=null, terminated=null, waiting=null, 
additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
restartCount=0, state=ContainerState(running=null, terminated=null, 
waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
add

[jira] [Updated] (SPARK-24266) Spark client terminates while driver is still running

2018-05-13 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated SPARK-24266:
--
Description: 

{code}
Warning: Ignoring non-spark config property: Default=system properties included 
when running spark-submit.
18/05/11 14:50:12 WARN Config: Error reading service account token from: 
[/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring.
18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: 
Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. 
Mounting Hadoop specific files
18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
 pod name: spark-64-293-980-1526021412180-driver
 namespace: tione-603074457
 labels: network -> FLOATINGIP, spark-app-selector -> 
spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
 creation time: 2018-05-11T06:50:17Z
 service account name: default
 volumes: spark-local-dir-0-spark-local, spark-init-properties, 
download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
default-token-xvjt9
 node name: N/A
 start time: N/A
 container images: N/A
 phase: Pending
 status: []
18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
 pod name: spark-64-293-980-1526021412180-driver
 namespace: tione-603074457
 labels: network -> FLOATINGIP, spark-app-selector -> 
spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
 creation time: 2018-05-11T06:50:17Z
 service account name: default
 volumes: spark-local-dir-0-spark-local, spark-init-properties, 
download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
default-token-xvjt9
 node name: tbds-100-98-45-69
 start time: N/A
 container images: N/A
 phase: Pending
 status: []
18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
 pod name: spark-64-293-980-1526021412180-driver
 namespace: tione-603074457
 labels: network -> FLOATINGIP, spark-app-selector -> 
spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
 creation time: 2018-05-11T06:50:17Z
 service account name: default
 volumes: spark-local-dir-0-spark-local, spark-init-properties, 
download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
default-token-xvjt9
 node name: tbds-100-98-45-69
 start time: 2018-05-11T06:50:17Z
 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
 phase: Pending
 status: [ContainerStatus(containerID=null, 
image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
lastState=ContainerState(running=null, terminated=null, waiting=null, 
additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
restartCount=0, state=ContainerState(running=null, terminated=null, 
waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
additionalProperties={}), additionalProperties={}), additionalProperties={})]
18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to 
finish...
18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
 pod name: spark-64-293-980-1526021412180-driver
 namespace: tione-603074457
 labels: network -> FLOATINGIP, spark-app-selector -> 
spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
 creation time: 2018-05-11T06:50:17Z
 service account name: default
 volumes: spark-local-dir-0-spark-local, spark-init-properties, 
download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
default-token-xvjt9
 node name: tbds-100-98-45-69
 start time: 2018-05-11T06:50:17Z
 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
 phase: Pending
 status: [ContainerStatus(containerID=null, 
image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
lastState=ContainerState(running=null, terminated=null, waiting=null, 
additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
restartCount=0, state=ContainerState(running=null, terminated=null, 
waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
ad

[jira] [Updated] (SPARK-24266) Spark client terminates while driver is still running

2018-05-13 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated SPARK-24266:
--
Description: 
{code}
Warning: Ignoring non-spark config property: Default=system properties included 
when running spark-submit.
18/05/11 14:50:12 WARN Config: Error reading service account token from: 
[/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring.
18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: 
Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. 
Mounting Hadoop specific files
18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
 pod name: spark-64-293-980-1526021412180-driver
 namespace: tione-603074457
 labels: network -> FLOATINGIP, spark-app-selector -> 
spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
 creation time: 2018-05-11T06:50:17Z
 service account name: default
 volumes: spark-local-dir-0-spark-local, spark-init-properties, 
download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
default-token-xvjt9
 node name: N/A
 start time: N/A
 container images: N/A
 phase: Pending
 status: []
18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
 pod name: spark-64-293-980-1526021412180-driver
 namespace: tione-603074457
 labels: network -> FLOATINGIP, spark-app-selector -> 
spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
 creation time: 2018-05-11T06:50:17Z
 service account name: default
 volumes: spark-local-dir-0-spark-local, spark-init-properties, 
download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
default-token-xvjt9
 node name: tbds-100-98-45-69
 start time: N/A
 container images: N/A
 phase: Pending
 status: []
18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
 pod name: spark-64-293-980-1526021412180-driver
 namespace: tione-603074457
 labels: network -> FLOATINGIP, spark-app-selector -> 
spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
 creation time: 2018-05-11T06:50:17Z
 service account name: default
 volumes: spark-local-dir-0-spark-local, spark-init-properties, 
download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
default-token-xvjt9
 node name: tbds-100-98-45-69
 start time: 2018-05-11T06:50:17Z
 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
 phase: Pending
 status: [ContainerStatus(containerID=null, 
image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
lastState=ContainerState(running=null, terminated=null, waiting=null, 
additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
restartCount=0, state=ContainerState(running=null, terminated=null, 
waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
additionalProperties={}), additionalProperties={}), additionalProperties={})]
18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to 
finish...
18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
 pod name: spark-64-293-980-1526021412180-driver
 namespace: tione-603074457
 labels: network -> FLOATINGIP, spark-app-selector -> 
spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
 creation time: 2018-05-11T06:50:17Z
 service account name: default
 volumes: spark-local-dir-0-spark-local, spark-init-properties, 
download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
default-token-xvjt9
 node name: tbds-100-98-45-69
 start time: 2018-05-11T06:50:17Z
 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
 phase: Pending
 status: [ContainerStatus(containerID=null, 
image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
lastState=ContainerState(running=null, terminated=null, waiting=null, 
additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
restartCount=0, state=ContainerState(running=null, terminated=null, 
waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
add

[jira] [Updated] (SPARK-24266) Spark client terminates while driver is still running

2018-05-13 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated SPARK-24266:
--
Description: 



{code}
Warning: Ignoring non-spark config property: Default=system properties included 
when running spark-submit.
18/05/11 14:50:12 WARN Config: Error reading service account token from: 
[/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring.
18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: 
Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. 
Mounting Hadoop specific files
18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
 pod name: spark-64-293-980-1526021412180-driver
 namespace: tione-603074457
 labels: network -> FLOATINGIP, spark-app-selector -> 
spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
 creation time: 2018-05-11T06:50:17Z
 service account name: default
 volumes: spark-local-dir-0-spark-local, spark-init-properties, 
download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
default-token-xvjt9
 node name: N/A
 start time: N/A
 container images: N/A
 phase: Pending
 status: []
18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
 pod name: spark-64-293-980-1526021412180-driver
 namespace: tione-603074457
 labels: network -> FLOATINGIP, spark-app-selector -> 
spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
 creation time: 2018-05-11T06:50:17Z
 service account name: default
 volumes: spark-local-dir-0-spark-local, spark-init-properties, 
download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
default-token-xvjt9
 node name: tbds-100-98-45-69
 start time: N/A
 container images: N/A
 phase: Pending
 status: []
18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
 pod name: spark-64-293-980-1526021412180-driver
 namespace: tione-603074457
 labels: network -> FLOATINGIP, spark-app-selector -> 
spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
 creation time: 2018-05-11T06:50:17Z
 service account name: default
 volumes: spark-local-dir-0-spark-local, spark-init-properties, 
download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
default-token-xvjt9
 node name: tbds-100-98-45-69
 start time: 2018-05-11T06:50:17Z
 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
 phase: Pending
 status: [ContainerStatus(containerID=null, 
image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
lastState=ContainerState(running=null, terminated=null, waiting=null, 
additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
restartCount=0, state=ContainerState(running=null, terminated=null, 
waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
additionalProperties={}), additionalProperties={}), additionalProperties={})]
18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to 
finish...
18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
 pod name: spark-64-293-980-1526021412180-driver
 namespace: tione-603074457
 labels: network -> FLOATINGIP, spark-app-selector -> 
spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
 creation time: 2018-05-11T06:50:17Z
 service account name: default
 volumes: spark-local-dir-0-spark-local, spark-init-properties, 
download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
default-token-xvjt9
 node name: tbds-100-98-45-69
 start time: 2018-05-11T06:50:17Z
 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
 phase: Pending
 status: [ContainerStatus(containerID=null, 
image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
lastState=ContainerState(running=null, terminated=null, waiting=null, 
additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
restartCount=0, state=ContainerState(running=null, terminated=null, 
waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 

[jira] [Created] (SPARK-24266) Spark client terminates while driver is still running

2018-05-13 Thread Chun Chen (JIRA)
Chun Chen created SPARK-24266:
-

 Summary: Spark client terminates while driver is still running
 Key: SPARK-24266
 URL: https://issues.apache.org/jira/browse/SPARK-24266
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Chun Chen


```
Warning: Ignoring non-spark config property: Default=system properties included 
when running spark-submit.
18/05/11 14:50:12 WARN Config: Error reading service account token from: 
[/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring.
18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: 
Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. 
Mounting Hadoop specific files
18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
 pod name: spark-64-293-980-1526021412180-driver
 namespace: tione-603074457
 labels: network -> FLOATINGIP, spark-app-selector -> 
spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
 creation time: 2018-05-11T06:50:17Z
 service account name: default
 volumes: spark-local-dir-0-spark-local, spark-init-properties, 
download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
default-token-xvjt9
 node name: N/A
 start time: N/A
 container images: N/A
 phase: Pending
 status: []
18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
 pod name: spark-64-293-980-1526021412180-driver
 namespace: tione-603074457
 labels: network -> FLOATINGIP, spark-app-selector -> 
spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
 creation time: 2018-05-11T06:50:17Z
 service account name: default
 volumes: spark-local-dir-0-spark-local, spark-init-properties, 
download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
default-token-xvjt9
 node name: tbds-100-98-45-69
 start time: N/A
 container images: N/A
 phase: Pending
 status: []
18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
 pod name: spark-64-293-980-1526021412180-driver
 namespace: tione-603074457
 labels: network -> FLOATINGIP, spark-app-selector -> 
spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
 creation time: 2018-05-11T06:50:17Z
 service account name: default
 volumes: spark-local-dir-0-spark-local, spark-init-properties, 
download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
default-token-xvjt9
 node name: tbds-100-98-45-69
 start time: 2018-05-11T06:50:17Z
 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
 phase: Pending
 status: [ContainerStatus(containerID=null, 
image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
lastState=ContainerState(running=null, terminated=null, waiting=null, 
additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
restartCount=0, state=ContainerState(running=null, terminated=null, 
waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
additionalProperties={}), additionalProperties={}), additionalProperties={})]
18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to 
finish...
18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
 pod name: spark-64-293-980-1526021412180-driver
 namespace: tione-603074457
 labels: network -> FLOATINGIP, spark-app-selector -> 
spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
 creation time: 2018-05-11T06:50:17Z
 service account name: default
 volumes: spark-local-dir-0-spark-local, spark-init-properties, 
download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
default-token-xvjt9
 node name: tbds-100-98-45-69
 start time: 2018-05-11T06:50:17Z
 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
 phase: Pending
 status: [ContainerStatus(containerID=null, 
image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
lastState=ContainerState(running=null, terminated=null, waiting=null, 
additionalProperties={}), 

[jira] [Commented] (SPARK-24262) Fix typo in UDF error message

2018-05-13 Thread Holden Karau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473692#comment-16473692
 ] 

Holden Karau commented on SPARK-24262:
--

Just a JIRA issue kept me from closing, was going to try again when I land.

On Sun, May 13, 2018 at 9:57 PM Takeshi Yamamuro (JIRA) 

-- 
Cell : 425-233-8271


> Fix typo in UDF error message
> -
>
> Key: SPARK-24262
> URL: https://issues.apache.org/jira/browse/SPARK-24262
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: holdenk
>Assignee: Kelley Robinson
>Priority: Trivial
> Fix For: 2.3.1, 2.4.0
>
>
> Fix the spelling of functon to function



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2018-05-13 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-17916:


Assignee: Maxim Gekk

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2018-05-13 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-17916.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21273
[https://github.com/apache/spark/pull/21273]

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24260) Support for multi-statement SQL in SparkSession.sql API

2018-05-13 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-24260:
-
Component/s: (was: Spark Core)
 SQL

> Support for multi-statement SQL in SparkSession.sql API
> ---
>
> Key: SPARK-24260
> URL: https://issues.apache.org/jira/browse/SPARK-24260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ravindra Nath Kakarla
>Priority: Minor
>
> sparkSession.sql API only supports a single SQL statement to be executed for 
> a call. A multi-statement SQL cannot be executed in a single call. For 
> example,
> {code:java}
> SparkSession sparkSession = 
> SparkSession.builder().appName("MultiStatementSQL")                           
>                .master("local").config("", "").getOrCreate()
> sparkSession.sql("DROP TABLE IF EXISTS count_employees; CACHE TABLE 
> employees; CREATE TEMPORARY VIEW count_employees AS SELECT count(*) as cnt 
> FROM employees; SELECT * FROM count_employees") 
> {code}
> Above code fails with the error, 
> {code:java}
> org.apache.spark.sql.catalyst.parser.ParseException: mismatched input ';' 
> expecting {code}
> Solution to this problem is to use the .sql API multiple times in a specific 
> order.
> {code:java}
> sparkSession.sql("DROP TABLE IF EXISTS count_employees")
> sparkSession.sql("CACHE TABLE employees")
> sparkSession.sql("CREATE TEMPORARY VIEW count_employees AS SELECT count(*) as 
> cnt FROM employees;")
> sparkSession.sql("SELECT * FROM count_employees")
> {code}
> If these SQL statements come from a string / file, users have to implement 
> their own parsers to execute this. Like,
> {code:java}
> val sqlFromFile = """DROP TABLE IF EXISTS count_employees;
>  |CACHE TABLE employees;
>  |CREATE TEMPORARY VIEW count_employees AS SELECT count(*) as cnt FROM 
> employees; SELECT * FROM count_employees""".stripMargin{code}
> {code:java}
> sqlFromFile.split(";")
> .forEach(line => sparkSession.sql(line))
> {code}
> This naive parser can fail for many edge cases (like ";" inside a string). 
> Even if users use the same grammar used by Spark and implement their own 
> parsing, it can go out of sync with the way Spark parses the statements.
> Can support for multiple SQL statements be built into SparkSession.sql API 
> itself?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24262) Fix typo in UDF error message

2018-05-13 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473687#comment-16473687
 ] 

Takeshi Yamamuro commented on SPARK-24262:
--

cc: [~holdenk] Probably, you forgot to close this?

> Fix typo in UDF error message
> -
>
> Key: SPARK-24262
> URL: https://issues.apache.org/jira/browse/SPARK-24262
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: holdenk
>Assignee: Kelley Robinson
>Priority: Trivial
> Fix For: 2.3.1, 2.4.0
>
>
> Fix the spelling of functon to function



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24186) add array reverse and concat

2018-05-13 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24186.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21307
[https://github.com/apache/spark/pull/21307]

> add array reverse and concat 
> -
>
> Key: SPARK-24186
> URL: https://issues.apache.org/jira/browse/SPARK-24186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 2.4.0
>
>
> Add R versions of https://issues.apache.org/jira/browse/SPARK-23736 and 
> https://issues.apache.org/jira/browse/SPARK-23926
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24186) add array reverse and concat

2018-05-13 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24186:


Assignee: Huaxin Gao

> add array reverse and concat 
> -
>
> Key: SPARK-24186
> URL: https://issues.apache.org/jira/browse/SPARK-24186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> Add R versions of https://issues.apache.org/jira/browse/SPARK-23736 and 
> https://issues.apache.org/jira/browse/SPARK-23926
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24261) Spark cannot read renamed managed Hive table

2018-05-13 Thread Suraj Nayak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473662#comment-16473662
 ] 

Suraj Nayak commented on SPARK-24261:
-

Also, need some pointers here from committers so as was there any specific 
requirement or design why the paths are dependent on {{SERDEPROPERTIS}} path 
rather than the Storage descriptor location. Thus helping us to understand the 
impact of this change.

> Spark cannot read renamed managed Hive table
> 
>
> Key: SPARK-24261
> URL: https://issues.apache.org/jira/browse/SPARK-24261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Suraj Nayak
>Priority: Major
> Attachments: some_db.some_new_table.ddl, 
> some_db.some_new_table_buggy_path.ddl, some_db.some_table.ddl
>
>
> When spark creates hive table using df.write.saveAsTable, it creates managed 
> table in hive with SERDEPROPERTIES like 
> {{WITH SERDEPROPERTIES 
> ('path'='gs://some-gs-bucket/warehouse/hive/some.db/some_table') }}
> When any external user changes hive table name via Hive CLI or Hue, Hive 
> makes sure the table name is changed and also the path is changed to new 
> location. But it never updates the serdeproperties mentioned above. 
> *Steps to Reproduce:*
> 1. Save table using spark:
>  {{spark.sql("select * from 
> some_db.some_table").write.saveAsTable("some_db.some_new_table")}}
> 2. In Hive CLI or Hue, run 
> {{alter table some_db.some_new_table rename to 
> some_db.some_new_table_buggy_path}}
> 3. Try to ready the buggy table *some_db.some_new_table_buggy_path* in spark 
> {{spark.sql("select * from some_db.some_new_table_buggy_path limit 
> 10").collect}}
> Spark throws following warning (Spark fails to read while hive can read this 
> table):
> {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
> stale CacheEntry; failed to fetch item info for: 
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/ - removing from 
> cache}}
>  {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
> stale CacheEntry; failed to fetch item info for: 
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/_SUCCESS - removing 
> from cache}}
>  {{18/05/13 17:45:16 WARN datasources.InMemoryFileIndex: The directory 
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table was not found. Was 
> it deleted very recently?}}
>  {{res2: Array[org.apache.spark.sql.Row] = Array()}}
> The DDLs for each of the tables are attached. 
> This will create inconsistency and endusers will spend endless time in 
> finding bug if data exists in both location, but spark reads it from 
> different location while hive process writes the new data in new location. 
> I went through similar JIRAs, but those address different issues.
> SPARK-15635 and SPARK-16570 address alter table in spark, unlike this jira, 
> while other external process renames the table.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24265) lintr checks not failing PR build

2018-05-13 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473651#comment-16473651
 ] 

Felix Cheung commented on SPARK-24265:
--

example: 
https://github.com/apache/spark/pull/21315/files#diff-5277c0f5b53da38579f8c0d5c63fba3eR66

> lintr checks not failing PR build
> -
>
> Key: SPARK-24265
> URL: https://issues.apache.org/jira/browse/SPARK-24265
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Felix Cheung
>Priority: Major
>
> a few lintr violations went through recently, need to check why they are not 
> flagged by Jenkins build



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20538) Dataset.reduce operator should use withNewExecutionId (as foreach or foreachPartition)

2018-05-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20538:


Assignee: (was: Apache Spark)

> Dataset.reduce operator should use withNewExecutionId (as foreach or 
> foreachPartition)
> --
>
> Key: SPARK-20538
> URL: https://issues.apache.org/jira/browse/SPARK-20538
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> {{Dataset.reduce}} is not tracked using {{executionId}} so it's not displayed 
> in SQL tab (like {{foreach}} or {{foreachPartition}}).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20538) Dataset.reduce operator should use withNewExecutionId (as foreach or foreachPartition)

2018-05-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20538:


Assignee: Apache Spark

> Dataset.reduce operator should use withNewExecutionId (as foreach or 
> foreachPartition)
> --
>
> Key: SPARK-20538
> URL: https://issues.apache.org/jira/browse/SPARK-20538
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Trivial
>
> {{Dataset.reduce}} is not tracked using {{executionId}} so it's not displayed 
> in SQL tab (like {{foreach}} or {{foreachPartition}}).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20538) Dataset.reduce operator should use withNewExecutionId (as foreach or foreachPartition)

2018-05-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473650#comment-16473650
 ] 

Apache Spark commented on SPARK-20538:
--

User 'sohama4' has created a pull request for this issue:
https://github.com/apache/spark/pull/21316

> Dataset.reduce operator should use withNewExecutionId (as foreach or 
> foreachPartition)
> --
>
> Key: SPARK-20538
> URL: https://issues.apache.org/jira/browse/SPARK-20538
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> {{Dataset.reduce}} is not tracked using {{executionId}} so it's not displayed 
> in SQL tab (like {{foreach}} or {{foreachPartition}}).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-05-13 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23780:
-
Affects Version/s: 2.3.0

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Ivan Dzikovsky
>Priority: Major
>  Labels: regression
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-05-13 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23780:
-
Labels: regression  (was: regresion)

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Ivan Dzikovsky
>Priority: Major
>  Labels: regression
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-05-13 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23780:
-
Labels: regresion  (was: )

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Ivan Dzikovsky
>Priority: Major
>  Labels: regression
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-05-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23780:


Assignee: Apache Spark

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Ivan Dzikovsky
>Assignee: Apache Spark
>Priority: Major
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24261) Spark cannot read renamed managed Hive table

2018-05-13 Thread Chitral Verma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473640#comment-16473640
 ] 

Chitral Verma commented on SPARK-24261:
---

I can see that in {{DataSources.getOrInferFileFormatSchema(...)}}, there is an 
explicit use of {{path}} config which seems to be coming from SerDe props which 
were added when spark created the table. Need to check why it doesn't take the 
path from location clause.

> Spark cannot read renamed managed Hive table
> 
>
> Key: SPARK-24261
> URL: https://issues.apache.org/jira/browse/SPARK-24261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Suraj Nayak
>Priority: Major
> Attachments: some_db.some_new_table.ddl, 
> some_db.some_new_table_buggy_path.ddl, some_db.some_table.ddl
>
>
> When spark creates hive table using df.write.saveAsTable, it creates managed 
> table in hive with SERDEPROPERTIES like 
> {{WITH SERDEPROPERTIES 
> ('path'='gs://some-gs-bucket/warehouse/hive/some.db/some_table') }}
> When any external user changes hive table name via Hive CLI or Hue, Hive 
> makes sure the table name is changed and also the path is changed to new 
> location. But it never updates the serdeproperties mentioned above. 
> *Steps to Reproduce:*
> 1. Save table using spark:
>  {{spark.sql("select * from 
> some_db.some_table").write.saveAsTable("some_db.some_new_table")}}
> 2. In Hive CLI or Hue, run 
> {{alter table some_db.some_new_table rename to 
> some_db.some_new_table_buggy_path}}
> 3. Try to ready the buggy table *some_db.some_new_table_buggy_path* in spark 
> {{spark.sql("select * from some_db.some_new_table_buggy_path limit 
> 10").collect}}
> Spark throws following warning (Spark fails to read while hive can read this 
> table):
> {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
> stale CacheEntry; failed to fetch item info for: 
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/ - removing from 
> cache}}
>  {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
> stale CacheEntry; failed to fetch item info for: 
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/_SUCCESS - removing 
> from cache}}
>  {{18/05/13 17:45:16 WARN datasources.InMemoryFileIndex: The directory 
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table was not found. Was 
> it deleted very recently?}}
>  {{res2: Array[org.apache.spark.sql.Row] = Array()}}
> The DDLs for each of the tables are attached. 
> This will create inconsistency and endusers will spend endless time in 
> finding bug if data exists in both location, but spark reads it from 
> different location while hive process writes the new data in new location. 
> I went through similar JIRAs, but those address different issues.
> SPARK-15635 and SPARK-16570 address alter table in spark, unlike this jira, 
> while other external process renames the table.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-05-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23780:


Assignee: (was: Apache Spark)

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Ivan Dzikovsky
>Priority: Major
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-05-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473641#comment-16473641
 ] 

Apache Spark commented on SPARK-23780:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/21315

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Ivan Dzikovsky
>Priority: Major
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24265) lintr checks not failing PR build

2018-05-13 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-24265:


 Summary: lintr checks not failing PR build
 Key: SPARK-24265
 URL: https://issues.apache.org/jira/browse/SPARK-24265
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.0, 2.3.1
Reporter: Felix Cheung


a few lintr violations went through recently, need to check why they are not 
flagged by Jenkins build



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24263) SparkR java check breaks on openjdk

2018-05-13 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-24263:
-
Description: 
testing with openjdk, noticed that it breaks because the version string is 
different

 

instead of

"java version"

it has

"openjdk version \"1.8.0_91\""

  was:
testing with openjdk, noticed that it breaks because the open string is 
different

 

instead of "java version" it has

"openjdk version \"1.8.0_91\""


> SparkR java check breaks on openjdk
> ---
>
> Key: SPARK-24263
> URL: https://issues.apache.org/jira/browse/SPARK-24263
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Felix Cheung
>Priority: Blocker
>
> testing with openjdk, noticed that it breaks because the version string is 
> different
>  
> instead of
> "java version"
> it has
> "openjdk version \"1.8.0_91\""



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24263) SparkR java check breaks on openjdk

2018-05-13 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-24263:
-
Description: 
testing with openjdk, noticed that it breaks because the version string is 
different

 

instead of

"java version \"1.8.0\"""

it has

"openjdk version \"1.8.0_91\""

  was:
testing with openjdk, noticed that it breaks because the version string is 
different

 

instead of

"java version"

it has

"openjdk version \"1.8.0_91\""


> SparkR java check breaks on openjdk
> ---
>
> Key: SPARK-24263
> URL: https://issues.apache.org/jira/browse/SPARK-24263
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Felix Cheung
>Priority: Blocker
>
> testing with openjdk, noticed that it breaks because the version string is 
> different
>  
> instead of
> "java version \"1.8.0\"""
> it has
> "openjdk version \"1.8.0_91\""



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24264) [Structured Streaming] Remove 'mergeSchema' option from Parquet source configuration

2018-05-13 Thread Gerard Maas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerard Maas updated SPARK-24264:

Issue Type: Bug  (was: Improvement)

> [Structured Streaming] Remove 'mergeSchema' option from Parquet source 
> configuration
> 
>
> Key: SPARK-24264
> URL: https://issues.apache.org/jira/browse/SPARK-24264
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Gerard Maas
>Priority: Major
>  Labels: features, usability
>
> Looking into the Parquet format support for the File source in Structured 
> Streaming, the docs mention the use of the option 'mergeSchema' to merge the 
> schemas of the part files found.[1]
>  
> There seems to be no practical use of that configuration in a streaming 
> context.
>  
> In its batch counterpart, `mergeSchemas` would infer the schema superset of 
> the part-files found. 
>  
>  When using the File source + parquet format in streaming mode, we must 
> provide a schema to the readStream.schema(...) builder and that schema is 
> fixed for the duration of the stream.
>  
> My current understanding is that:
>  
> - Files containing a subset of the fields declared in the schema will render 
> null values for the non-existing fields.
> - For files containing a superset of the fields, the additional data fields 
> will be lost. 
> - Files not matching the schema set on the streaming source will render all 
> fields null for each record in the file.
>  
> It looks like 'mergeSchema' has no practical effect, although enabling it 
> might lead to additional processing to actually merge the Parquet schema of 
> the input files. 
>  
> I inquired on the dev+user mailing lists about any other behavior but I got 
> no responses.
>  
> From the user perspective, they may think that this option would help their 
> job cope with schema evolution at runtime, but that is also not the case. 
>  
> Looks like removing this option and leaving the value always set to false is 
> the reasonable thing to do.  
>  
> [1] 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala#L376]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24263) SparkR java check breaks on openjdk

2018-05-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473632#comment-16473632
 ] 

Apache Spark commented on SPARK-24263:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/21314

> SparkR java check breaks on openjdk
> ---
>
> Key: SPARK-24263
> URL: https://issues.apache.org/jira/browse/SPARK-24263
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Felix Cheung
>Priority: Major
>
> testing with openjdk, noticed that it breaks because the open string is 
> different
>  
> instead of "java version" it has
> "openjdk version \"1.8.0_91\""



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24263) SparkR java check breaks on openjdk

2018-05-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24263:


Assignee: (was: Apache Spark)

> SparkR java check breaks on openjdk
> ---
>
> Key: SPARK-24263
> URL: https://issues.apache.org/jira/browse/SPARK-24263
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Felix Cheung
>Priority: Major
>
> testing with openjdk, noticed that it breaks because the open string is 
> different
>  
> instead of "java version" it has
> "openjdk version \"1.8.0_91\""



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24263) SparkR java check breaks on openjdk

2018-05-13 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-24263:
-
Priority: Blocker  (was: Major)

> SparkR java check breaks on openjdk
> ---
>
> Key: SPARK-24263
> URL: https://issues.apache.org/jira/browse/SPARK-24263
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Felix Cheung
>Priority: Blocker
>
> testing with openjdk, noticed that it breaks because the open string is 
> different
>  
> instead of "java version" it has
> "openjdk version \"1.8.0_91\""



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24263) SparkR java check breaks on openjdk

2018-05-13 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-24263:
-
Affects Version/s: (was: 2.3.0)
   2.3.1

> SparkR java check breaks on openjdk
> ---
>
> Key: SPARK-24263
> URL: https://issues.apache.org/jira/browse/SPARK-24263
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Felix Cheung
>Priority: Blocker
>
> testing with openjdk, noticed that it breaks because the open string is 
> different
>  
> instead of "java version" it has
> "openjdk version \"1.8.0_91\""



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24263) SparkR java check breaks on openjdk

2018-05-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24263:


Assignee: Apache Spark

> SparkR java check breaks on openjdk
> ---
>
> Key: SPARK-24263
> URL: https://issues.apache.org/jira/browse/SPARK-24263
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Major
>
> testing with openjdk, noticed that it breaks because the open string is 
> different
>  
> instead of "java version" it has
> "openjdk version \"1.8.0_91\""



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24264) [Structured Streaming] Remove 'mergeSchema' option from Parquet source configuration

2018-05-13 Thread Gerard Maas (JIRA)
Gerard Maas created SPARK-24264:
---

 Summary: [Structured Streaming] Remove 'mergeSchema' option from 
Parquet source configuration
 Key: SPARK-24264
 URL: https://issues.apache.org/jira/browse/SPARK-24264
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Gerard Maas


Looking into the Parquet format support for the File source in Structured 
Streaming, the docs mention the use of the option 'mergeSchema' to merge the 
schemas of the part files found.[1]
 
There seems to be no practical use of that configuration in a streaming context.
 
In its batch counterpart, `mergeSchemas` would infer the schema superset of the 
part-files found. 
 
 When using the File source + parquet format in streaming mode, we must provide 
a schema to the readStream.schema(...) builder and that schema is fixed for the 
duration of the stream.
 
My current understanding is that:
 
- Files containing a subset of the fields declared in the schema will render 
null values for the non-existing fields.
- For files containing a superset of the fields, the additional data fields 
will be lost. 
- Files not matching the schema set on the streaming source will render all 
fields null for each record in the file.
 
It looks like 'mergeSchema' has no practical effect, although enabling it might 
lead to additional processing to actually merge the Parquet schema of the input 
files. 
 
I inquired on the dev+user mailing lists about any other behavior but I got no 
responses.
 
>From the user perspective, they may think that this option would help their 
>job cope with schema evolution at runtime, but that is also not the case. 
 
Looks like removing this option and leaving the value always set to false is 
the reasonable thing to do.  
 
[1] 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala#L376]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24263) SparkR java check breaks on openjdk

2018-05-13 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-24263:


 Summary: SparkR java check breaks on openjdk
 Key: SPARK-24263
 URL: https://issues.apache.org/jira/browse/SPARK-24263
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.0
Reporter: Felix Cheung


testing with openjdk, noticed that it breaks because the open string is 
different

 

instead of "java version" it has

"openjdk version \"1.8.0_91\""



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24262) Fix typo in UDF error message

2018-05-13 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-24262:
---

Assignee: Kelley Robinson

> Fix typo in UDF error message
> -
>
> Key: SPARK-24262
> URL: https://issues.apache.org/jira/browse/SPARK-24262
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: holdenk
>Assignee: Kelley Robinson
>Priority: Trivial
> Fix For: 2.3.1, 2.4.0
>
>
> Fix the spelling of functon to function



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24262) Fix typo in UDF error message

2018-05-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24262:


Assignee: Apache Spark

> Fix typo in UDF error message
> -
>
> Key: SPARK-24262
> URL: https://issues.apache.org/jira/browse/SPARK-24262
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Trivial
> Fix For: 2.3.1, 2.4.0
>
>
> Fix the spelling of functon to function



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24262) Fix typo in UDF error message

2018-05-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24262:


Assignee: (was: Apache Spark)

> Fix typo in UDF error message
> -
>
> Key: SPARK-24262
> URL: https://issues.apache.org/jira/browse/SPARK-24262
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Trivial
> Fix For: 2.3.1, 2.4.0
>
>
> Fix the spelling of functon to function



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24262) Fix typo in UDF error message

2018-05-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473623#comment-16473623
 ] 

Apache Spark commented on SPARK-24262:
--

User 'robinske' has created a pull request for this issue:
https://github.com/apache/spark/pull/21304

> Fix typo in UDF error message
> -
>
> Key: SPARK-24262
> URL: https://issues.apache.org/jira/browse/SPARK-24262
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Trivial
> Fix For: 2.3.1, 2.4.0
>
>
> Fix the spelling of functon to function



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24262) Fix typo in UDF error message

2018-05-13 Thread holdenk (JIRA)
holdenk created SPARK-24262:
---

 Summary: Fix typo in UDF error message
 Key: SPARK-24262
 URL: https://issues.apache.org/jira/browse/SPARK-24262
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.3.0
Reporter: holdenk
 Fix For: 2.3.1, 2.4.0


Fix the spelling of functon to function



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20568) Delete files after processing in structured streaming

2018-05-13 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20568:
--
Affects Version/s: 2.2.1

> Delete files after processing in structured streaming
> -
>
> Key: SPARK-20568
> URL: https://issues.apache.org/jira/browse/SPARK-20568
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.1
>Reporter: Saul Shanabrook
>Priority: Major
>
> It would be great to be able to delete files after processing them with 
> structured streaming.
> For example, I am reading in a bunch of JSON files and converting them into 
> Parquet. If the JSON files are not deleted after they are processed, it 
> quickly fills up my hard drive. I originally [posted this on Stack 
> Overflow|http://stackoverflow.com/q/43671757/907060] and was recommended to 
> make a feature request for it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24261) Spark cannot read renamed managed Hive table

2018-05-13 Thread Suraj Nayak (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Nayak updated SPARK-24261:

Description: 
When spark creates hive table using df.write.saveAsTable, it creates managed 
table in hive with SERDEPROPERTIES like 

{{WITH SERDEPROPERTIES 
('path'='gs://some-gs-bucket/warehouse/hive/some.db/some_table') }}

When any external user changes hive table name via Hive CLI or Hue, Hive makes 
sure the table name is changed and also the path is changed to new location. 
But it never updates the serdeproperties mentioned above. 

*Steps to Reproduce:*

1. Save table using spark:
 {{spark.sql("select * from 
some_db.some_table").write.saveAsTable("some_db.some_new_table")}}

2. In Hive CLI or Hue, run 
{{alter table some_db.some_new_table rename to 
some_db.some_new_table_buggy_path}}

3. Try to ready the buggy table *some_db.some_new_table_buggy_path* in spark 
{{spark.sql("select * from some_db.some_new_table_buggy_path limit 
10").collect}}

Spark throws following warning (Spark fails to read while hive can read this 
table):

{{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
stale CacheEntry; failed to fetch item info for: 
gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/ - removing from 
cache}}
 {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
stale CacheEntry; failed to fetch item info for: 
gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/_SUCCESS - removing 
from cache}}
 {{18/05/13 17:45:16 WARN datasources.InMemoryFileIndex: The directory 
gs://some-gs-bucket/warehouse/hive/some.db/some_new_table was not found. Was it 
deleted very recently?}}
 {{res2: Array[org.apache.spark.sql.Row] = Array()}}

The DDLs for each of the tables are attached. 

This will create inconsistency and endusers will spend endless time in finding 
bug if data exists in both location, but spark reads it from different location 
while hive process writes the new data in new location. 

I went through similar JIRAs, but those address different issues.

SPARK-15635 and SPARK-16570 address alter table in spark, unlike this jira, 
while other external process renames the table.

 

  was:
When spark creates hive table using df.write.saveAsTable, it creates managed 
table in hive with SERDEPROPERTIES like 

{{WITH SERDEPROPERTIES (}}
{{'path'='gs://some-gs-bucket/warehouse/hive/some.db/some_table') }}

When any external user changes hive table name via Hive CLI or Hue, Hive makes 
sure the table name is changed and also the path is changed to new location. 
But it never updates the serdeproperties mentioned above. 

 

*Steps to Reproduce:*

1. Save table using spark
 {{spark.sql("select * from 
some_db.some_table").write.saveAsTable("some_db.some_new_table")}}

2. In Hive CLI or Hue, run {{alter table some_db.some_new_table rename to 
some_db.some_new_table_buggy_path}}

3. Try to ready the buggy table *some_db.some_new_table_buggy_path* in spark 
{{spark.sql("select * from some_db.some_new_table_buggy_path limit 
10").collect}}

 

Spark throws following warning (Spark fails to read while hive can read this 
table):

{{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
stale CacheEntry; failed to fetch item info for: 
gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/ - removing from 
cache}}
 {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
stale CacheEntry; failed to fetch item info for: 
gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/_SUCCESS - removing 
from cache}}
 {{18/05/13 17:45:16 WARN datasources.InMemoryFileIndex: The directory 
gs://some-gs-bucket/warehouse/hive/some.db/some_new_table was not found. Was it 
deleted very recently?}}
 {{res2: Array[org.apache.spark.sql.Row] = Array()}}

 

The DDLs for each of the tables are attached. 

 

This will create inconsistency and endusers will spend endless time in finding 
bug if data exists in both location, but spark reads it from different location 
while hive process writes the new data in new location. 

 

I went through similar JIRAs, but those address different issues.

SPARK-15635 and SPARK-16570 address alter table in spark, unlike this jira, 
while other external process renames the table.

 


> Spark cannot read renamed managed Hive table
> 
>
> Key: SPARK-24261
> URL: https://issues.apache.org/jira/browse/SPARK-24261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Suraj Nayak
>Priority: Major
> Attachments: some_db.some_new_table.ddl, 
> some_db.some_new_table_buggy_path.ddl, some_db.some_table.ddl
>
>
> When spark creates hive table using df.write.saveAsTable, it creates managed 
> table in hive with SERDEPROPERTIES l

[jira] [Updated] (SPARK-24261) Spark cannot read renamed managed Hive table

2018-05-13 Thread Suraj Nayak (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Nayak updated SPARK-24261:

Attachment: some_db.some_new_table_buggy_path.ddl

> Spark cannot read renamed managed Hive table
> 
>
> Key: SPARK-24261
> URL: https://issues.apache.org/jira/browse/SPARK-24261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Suraj Nayak
>Priority: Major
> Attachments: some_db.some_new_table.ddl, 
> some_db.some_new_table_buggy_path.ddl, some_db.some_table.ddl
>
>
> When spark creates hive table using df.write.saveAsTable, it creates managed 
> table in hive with SERDEPROPERTIES like 
> {{WITH SERDEPROPERTIES (}}
> {{'path'='gs://some-gs-bucket/warehouse/hive/some.db/some_table') }}
> When any external user changes hive table name via Hive CLI or Hue, Hive 
> makes sure the table name is changed and also the path is changed to new 
> location. But it never updates the serdeproperties mentioned above. 
>  
> *Steps to Reproduce:*
> 1. Save table using spark
>  {{spark.sql("select * from 
> some_db.some_table").write.saveAsTable("some_db.some_new_table")}}
> 2. In Hive CLI or Hue, run {{alter table some_db.some_new_table rename to 
> some_db.some_new_table_buggy_path}}
> 3. Try to ready the buggy table *some_db.some_new_table_buggy_path* in spark 
> {{spark.sql("select * from some_db.some_new_table_buggy_path limit 
> 10").collect}}
>  
> Spark throws following warning (Spark fails to read while hive can read this 
> table):
> {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
> stale CacheEntry; failed to fetch item info for: 
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/ - removing from 
> cache}}
>  {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
> stale CacheEntry; failed to fetch item info for: 
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/_SUCCESS - removing 
> from cache}}
>  {{18/05/13 17:45:16 WARN datasources.InMemoryFileIndex: The directory 
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table was not found. Was 
> it deleted very recently?}}
>  {{res2: Array[org.apache.spark.sql.Row] = Array()}}
>  
> The DDLs for each of the tables are attached. 
>  
> This will create inconsistency and endusers will spend endless time in 
> finding bug if data exists in both location, but spark reads it from 
> different location while hive process writes the new data in new location. 
>  
> I went through similar JIRAs, but those address different issues.
> SPARK-15635 and SPARK-16570 address alter table in spark, unlike this jira, 
> while other external process renames the table.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24261) Spark cannot read renamed managed Hive table

2018-05-13 Thread Suraj Nayak (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Nayak updated SPARK-24261:

Attachment: some_db.some_table.ddl
some_db.some_new_table.ddl
some_db.some_new_table_buggy_path.ddl

> Spark cannot read renamed managed Hive table
> 
>
> Key: SPARK-24261
> URL: https://issues.apache.org/jira/browse/SPARK-24261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Suraj Nayak
>Priority: Major
> Attachments: some_db.some_new_table.ddl, 
> some_db.some_new_table_buggy_path.ddl, some_db.some_table.ddl
>
>
> When spark creates hive table using df.write.saveAsTable, it creates managed 
> table in hive with SERDEPROPERTIES like 
> {{WITH SERDEPROPERTIES (}}
> {{'path'='gs://some-gs-bucket/warehouse/hive/some.db/some_table') }}
> When any external user changes hive table name via Hive CLI or Hue, Hive 
> makes sure the table name is changed and also the path is changed to new 
> location. But it never updates the serdeproperties mentioned above. 
>  
> *Steps to Reproduce:*
> 1. Save table using spark
>  {{spark.sql("select * from 
> some_db.some_table").write.saveAsTable("some_db.some_new_table")}}
> 2. In Hive CLI or Hue, run {{alter table some_db.some_new_table rename to 
> some_db.some_new_table_buggy_path}}
> 3. Try to ready the buggy table *some_db.some_new_table_buggy_path* in spark 
> {{spark.sql("select * from some_db.some_new_table_buggy_path limit 
> 10").collect}}
>  
> Spark throws following warning (Spark fails to read while hive can read this 
> table):
> {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
> stale CacheEntry; failed to fetch item info for: 
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/ - removing from 
> cache}}
>  {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
> stale CacheEntry; failed to fetch item info for: 
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/_SUCCESS - removing 
> from cache}}
>  {{18/05/13 17:45:16 WARN datasources.InMemoryFileIndex: The directory 
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table was not found. Was 
> it deleted very recently?}}
>  {{res2: Array[org.apache.spark.sql.Row] = Array()}}
>  
> The DDLs for each of the tables are attached. 
>  
> This will create inconsistency and endusers will spend endless time in 
> finding bug if data exists in both location, but spark reads it from 
> different location while hive process writes the new data in new location. 
>  
> I went through similar JIRAs, but those address different issues.
> SPARK-15635 and SPARK-16570 address alter table in spark, unlike this jira, 
> while other external process renames the table.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24261) Spark cannot read renamed managed Hive table

2018-05-13 Thread Suraj Nayak (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Nayak updated SPARK-24261:

Attachment: (was: some_db.some_new_table_buggy_path.ddl)

> Spark cannot read renamed managed Hive table
> 
>
> Key: SPARK-24261
> URL: https://issues.apache.org/jira/browse/SPARK-24261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Suraj Nayak
>Priority: Major
> Attachments: some_db.some_new_table.ddl, 
> some_db.some_new_table_buggy_path.ddl, some_db.some_table.ddl
>
>
> When spark creates hive table using df.write.saveAsTable, it creates managed 
> table in hive with SERDEPROPERTIES like 
> {{WITH SERDEPROPERTIES (}}
> {{'path'='gs://some-gs-bucket/warehouse/hive/some.db/some_table') }}
> When any external user changes hive table name via Hive CLI or Hue, Hive 
> makes sure the table name is changed and also the path is changed to new 
> location. But it never updates the serdeproperties mentioned above. 
>  
> *Steps to Reproduce:*
> 1. Save table using spark
>  {{spark.sql("select * from 
> some_db.some_table").write.saveAsTable("some_db.some_new_table")}}
> 2. In Hive CLI or Hue, run {{alter table some_db.some_new_table rename to 
> some_db.some_new_table_buggy_path}}
> 3. Try to ready the buggy table *some_db.some_new_table_buggy_path* in spark 
> {{spark.sql("select * from some_db.some_new_table_buggy_path limit 
> 10").collect}}
>  
> Spark throws following warning (Spark fails to read while hive can read this 
> table):
> {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
> stale CacheEntry; failed to fetch item info for: 
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/ - removing from 
> cache}}
>  {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
> stale CacheEntry; failed to fetch item info for: 
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/_SUCCESS - removing 
> from cache}}
>  {{18/05/13 17:45:16 WARN datasources.InMemoryFileIndex: The directory 
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table was not found. Was 
> it deleted very recently?}}
>  {{res2: Array[org.apache.spark.sql.Row] = Array()}}
>  
> The DDLs for each of the tables are attached. 
>  
> This will create inconsistency and endusers will spend endless time in 
> finding bug if data exists in both location, but spark reads it from 
> different location while hive process writes the new data in new location. 
>  
> I went through similar JIRAs, but those address different issues.
> SPARK-15635 and SPARK-16570 address alter table in spark, unlike this jira, 
> while other external process renames the table.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24261) Spark cannot read renamed managed Hive table

2018-05-13 Thread Suraj Nayak (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Nayak updated SPARK-24261:

Description: 
When spark creates hive table using df.write.saveAsTable, it creates managed 
table in hive with SERDEPROPERTIES like 

{{WITH SERDEPROPERTIES (}}
{{'path'='gs://some-gs-bucket/warehouse/hive/some.db/some_table') }}

When any external user changes hive table name via Hive CLI or Hue, Hive makes 
sure the table name is changed and also the path is changed to new location. 
But it never updates the serdeproperties mentioned above. 

 

*Steps to Reproduce:*

1. Save table using spark
 {{spark.sql("select * from 
some_db.some_table").write.saveAsTable("some_db.some_new_table")}}

2. In Hive CLI or Hue, run {{alter table some_db.some_new_table rename to 
some_db.some_new_table_buggy_path}}

3. Try to ready the buggy table *some_db.some_new_table_buggy_path* in spark 
{{spark.sql("select * from some_db.some_new_table_buggy_path limit 
10").collect}}

 

Spark throws following warning (Spark fails to read while hive can read this 
table):

{{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
stale CacheEntry; failed to fetch item info for: 
gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/ - removing from 
cache}}
 {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
stale CacheEntry; failed to fetch item info for: 
gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/_SUCCESS - removing 
from cache}}
 {{18/05/13 17:45:16 WARN datasources.InMemoryFileIndex: The directory 
gs://some-gs-bucket/warehouse/hive/some.db/some_new_table was not found. Was it 
deleted very recently?}}
 {{res2: Array[org.apache.spark.sql.Row] = Array()}}

 

The DDLs for each of the tables are attached. 

 

This will create inconsistency and endusers will spend endless time in finding 
bug if data exists in both location, but spark reads it from different location 
while hive process writes the new data in new location. 

 

I went through similar JIRAs, but those address different issues.

SPARK-15635 and SPARK-16570 address alter table in spark, unlike this jira, 
while other external process renames the table.

 

  was:
When spark creates hive table using df.write.saveAsTable, it creates managed 
table in hive with SERDEPROPERTIES like 

{\{WITH SERDEPROPERTIES ( }}
 {{'path'='gs://some-gs-bucket/warehouse/hive/some.db/some_table') }}

When any external user changes hive table name via Hive CLI or Hue, Hive makes 
sure the table name is changed and also the path is changed to new location. 
But it never updates the serdeproperties mentioned above. 

 

Steps to Reproduce:

{\{1. Save table using spark }}
 {{spark.sql("select * from 
some_db.some_table").write.saveAsTable("some_db.some_new_table")}}

2. In Hive CLI or Hue, run {{alter table some_db.some_new_table rename to 
some_db.some_new_table_buggy_path}}

3. Try to ready the buggy table *some_db.some_new_table_buggy_path* in spark 
{{spark.sql("select * from some_db.some_new_table_buggy_path limit 
10").collect}}

 

Spark throws following warning (Spark fails to read while hive can read this 
table):

{{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
stale CacheEntry; failed to fetch item info for: 
gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/ - removing from 
cache}}
 {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
stale CacheEntry; failed to fetch item info for: 
gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/_SUCCESS - removing 
from cache}}
 {{18/05/13 17:45:16 WARN datasources.InMemoryFileIndex: The directory 
gs://some-gs-bucket/warehouse/hive/some.db/some_new_table was not found. Was it 
deleted very recently?}}
 {{res2: Array[org.apache.spark.sql.Row] = Array()}}

 

The DDLs for each of the tables are attached. 

 

This will create inconsistency and endusers will spend endless time in finding 
bug if data exists in both location, but spark reads it from different location 
while hive process writes the new data in new location. 

 

I went through similar JIRAs, but those address different issues.

[SPARK-15635|https://issues.apache.org/jira/browse/SPARK-15635] and 
[SPARK-16570|https://issues.apache.org/jira/browse/SPARK-16570] address alter 
table in spark, unlike this jira, while other external process renames the 
table.

 


> Spark cannot read renamed managed Hive table
> 
>
> Key: SPARK-24261
> URL: https://issues.apache.org/jira/browse/SPARK-24261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Suraj Nayak
>Priority: Major
>
> When spark creates hive table using df.write.saveAsTable, it creates managed 
> table in hive with SERDE

[jira] [Updated] (SPARK-24261) Spark cannot read renamed managed Hive table

2018-05-13 Thread Suraj Nayak (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Nayak updated SPARK-24261:

Description: 
When spark creates hive table using df.write.saveAsTable, it creates managed 
table in hive with SERDEPROPERTIES like 

{\{WITH SERDEPROPERTIES ( }}
 {{'path'='gs://some-gs-bucket/warehouse/hive/some.db/some_table') }}

When any external user changes hive table name via Hive CLI or Hue, Hive makes 
sure the table name is changed and also the path is changed to new location. 
But it never updates the serdeproperties mentioned above. 

 

Steps to Reproduce:

{\{1. Save table using spark }}
 {{spark.sql("select * from 
some_db.some_table").write.saveAsTable("some_db.some_new_table")}}

2. In Hive CLI or Hue, run {{alter table some_db.some_new_table rename to 
some_db.some_new_table_buggy_path}}

3. Try to ready the buggy table *some_db.some_new_table_buggy_path* in spark 
{{spark.sql("select * from some_db.some_new_table_buggy_path limit 
10").collect}}

 

Spark throws following warning (Spark fails to read while hive can read this 
table):

{{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
stale CacheEntry; failed to fetch item info for: 
gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/ - removing from 
cache}}
 {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
stale CacheEntry; failed to fetch item info for: 
gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/_SUCCESS - removing 
from cache}}
 {{18/05/13 17:45:16 WARN datasources.InMemoryFileIndex: The directory 
gs://some-gs-bucket/warehouse/hive/some.db/some_new_table was not found. Was it 
deleted very recently?}}
 {{res2: Array[org.apache.spark.sql.Row] = Array()}}

 

The DDLs for each of the tables are attached. 

 

This will create inconsistency and endusers will spend endless time in finding 
bug if data exists in both location, but spark reads it from different location 
while hive process writes the new data in new location. 

 

I went through similar JIRAs, but those address different issues.

[SPARK-15635|https://issues.apache.org/jira/browse/SPARK-15635] and 
[SPARK-16570|https://issues.apache.org/jira/browse/SPARK-16570] address alter 
table in spark, unlike this jira, while other external process renames the 
table.

 

  was:
When spark creates hive table using df.write.saveAsTable, it creates managed 
table in hive with SERDEPROPERTIES like 

{\{WITH SERDEPROPERTIES ( }}
 {{'path'='gs://some-gs-bucket/warehouse/hive/some.db/some_table') }}

When any external user changes hive table name via Hive CLI or Hue, Hive makes 
sure the table name is changed and also the path is changed to new location. 
But it never updates the serdeproperties mentioned above. 

 

Steps to Reproduce:

{\{1. Save table using spark }}
 {{spark.sql("select * from 
some_db.some_table").write.saveAsTable("some_db.some_new_table")}}

2. In Hive CLI or Hue, run {{alter table some_db.some_new_table rename to 
some_db.some_new_table_buggy_path}}

3. Try to ready the buggy table *some_db.some_new_table_buggy_path* in spark 
{{spark.sql("select * from some_db.some_new_table_buggy_path limit 
10").collect}}

 

Spark throws following warning (Spark fails to read while hive can read this 
table):


{{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
stale CacheEntry; failed to fetch item info for: 
gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/ - removing from 
cache}}
{{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
stale CacheEntry; failed to fetch item info for: 
gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/_SUCCESS - removing 
from cache}}
{{18/05/13 17:45:16 WARN datasources.InMemoryFileIndex: The directory 
gs://some-gs-bucket/warehouse/hive/some.db/some_new_table was not found. Was it 
deleted very recently?}}
{{res2: Array[org.apache.spark.sql.Row] = Array()}}

 

The DDLs for each of the tables are attached. 

 

This will create inconsistency and endusers will spend endless time in finding 
bug if data exists in both location, but spark reads it from different location 
while hive process writes the new data in new location. 

 


> Spark cannot read renamed managed Hive table
> 
>
> Key: SPARK-24261
> URL: https://issues.apache.org/jira/browse/SPARK-24261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Suraj Nayak
>Priority: Major
>
> When spark creates hive table using df.write.saveAsTable, it creates managed 
> table in hive with SERDEPROPERTIES like 
> {\{WITH SERDEPROPERTIES ( }}
>  {{'path'='gs://some-gs-bucket/warehouse/hive/some.db/some_table') }}
> When any external user changes hive table name via Hive CLI or Hue, H

[jira] [Created] (SPARK-24261) Spark cannot read renamed managed Hive table

2018-05-13 Thread Suraj Nayak (JIRA)
Suraj Nayak created SPARK-24261:
---

 Summary: Spark cannot read renamed managed Hive table
 Key: SPARK-24261
 URL: https://issues.apache.org/jira/browse/SPARK-24261
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Suraj Nayak


When spark creates hive table using df.write.saveAsTable, it creates managed 
table in hive with SERDEPROPERTIES like 

{\{WITH SERDEPROPERTIES ( }}
 {{'path'='gs://some-gs-bucket/warehouse/hive/some.db/some_table') }}

When any external user changes hive table name via Hive CLI or Hue, Hive makes 
sure the table name is changed and also the path is changed to new location. 
But it never updates the serdeproperties mentioned above. 

 

Steps to Reproduce:

{\{1. Save table using spark }}
 {{spark.sql("select * from 
some_db.some_table").write.saveAsTable("some_db.some_new_table")}}

2. In Hive CLI or Hue, run {{alter table some_db.some_new_table rename to 
some_db.some_new_table_buggy_path}}

3. Try to ready the buggy table *some_db.some_new_table_buggy_path* in spark 
{{spark.sql("select * from some_db.some_new_table_buggy_path limit 
10").collect}}

 

Spark throws following warning (Spark fails to read while hive can read this 
table):


{{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
stale CacheEntry; failed to fetch item info for: 
gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/ - removing from 
cache}}
{{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
stale CacheEntry; failed to fetch item info for: 
gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/_SUCCESS - removing 
from cache}}
{{18/05/13 17:45:16 WARN datasources.InMemoryFileIndex: The directory 
gs://some-gs-bucket/warehouse/hive/some.db/some_new_table was not found. Was it 
deleted very recently?}}
{{res2: Array[org.apache.spark.sql.Row] = Array()}}

 

The DDLs for each of the tables are attached. 

 

This will create inconsistency and endusers will spend endless time in finding 
bug if data exists in both location, but spark reads it from different location 
while hive process writes the new data in new location. 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24130) Data Source V2: Join Push Down

2018-05-13 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473594#comment-16473594
 ] 

Ryan Blue commented on SPARK-24130:
---

[~jliwork] could you please open up access to allow open commenting on the 
design doc? I'd like to add some questions and comments while I review it. 
Thank you!

> Data Source V2: Join Push Down
> --
>
> Key: SPARK-24130
> URL: https://issues.apache.org/jira/browse/SPARK-24130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jia Li
>Priority: Major
>
> Spark applications often directly query external data sources such as 
> relational databases, or files. Spark provides Data Sources APIs for 
> accessing structured data through Spark SQL. Data Sources APIs in both V1 and 
> V2 support optimizations such as Filter push down and Column pruning which 
> are subset of the functionality that can be pushed down to some data sources. 
> We’re proposing to extend Data Sources APIs with join push down (JPD). Join 
> push down significantly improves query performance by reducing the amount of 
> data transfer and exploiting the capabilities of the data sources such as 
> index access.
> Join push down design document is available 
> [here|https://docs.google.com/document/d/1k-kRadTcUbxVfUQwqBbIXs_yPZMxh18-e-cz77O_TaE/edit?usp=sharing].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24229) Upgrade to the latest Apache Thrift 0.10.0 release

2018-05-13 Thread Ray Donnelly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473595#comment-16473595
 ] 

Ray Donnelly commented on SPARK-24229:
--

Ok I'll need to figure out how to test it then. Thanks for the help.

> Upgrade to the latest Apache Thrift 0.10.0 release
> --
>
> Key: SPARK-24229
> URL: https://issues.apache.org/jira/browse/SPARK-24229
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.0
>Reporter: Ray Donnelly
>Priority: Critical
>
> According to [https://www.cvedetails.com/cve/CVE-2016-5397/]
>  
> .. there are critical vulnerabilities in libthrift 0.9.3 currently vendored 
> in Apache Spark (and then, for us, into PySpark).
>  
> Can anyone help to assess the seriousness of this and what should be done 
> about it?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24229) Upgrade to the latest Apache Thrift 0.10.0 release

2018-05-13 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473590#comment-16473590
 ] 

Marcelo Vanzin commented on SPARK-24229:


The best approach if you really want this fixed is to make the change, test to 
make sure that it works, and open a PR.

> Upgrade to the latest Apache Thrift 0.10.0 release
> --
>
> Key: SPARK-24229
> URL: https://issues.apache.org/jira/browse/SPARK-24229
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.0
>Reporter: Ray Donnelly
>Priority: Critical
>
> According to [https://www.cvedetails.com/cve/CVE-2016-5397/]
>  
> .. there are critical vulnerabilities in libthrift 0.9.3 currently vendored 
> in Apache Spark (and then, for us, into PySpark).
>  
> Can anyone help to assess the seriousness of this and what should be done 
> about it?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24260) Support for multi-statement SQL in SparkSession.sql API

2018-05-13 Thread Ravindra Nath Kakarla (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravindra Nath Kakarla updated SPARK-24260:
--
Description: 
sparkSession.sql API only supports a single SQL statement to be executed for a 
call. A multi-statement SQL cannot be executed in a single call. For example,
{code:java}
SparkSession sparkSession = SparkSession.builder().appName("MultiStatementSQL") 
                                         .master("local").config("", 
"").getOrCreate()
sparkSession.sql("DROP TABLE IF EXISTS count_employees; CACHE TABLE employees; 
CREATE TEMPORARY VIEW count_employees AS SELECT count(*) as cnt FROM employees; 
SELECT * FROM count_employees") 

{code}
Above code fails with the error, 
{code:java}
org.apache.spark.sql.catalyst.parser.ParseException: mismatched input ';' 
expecting {code}
Solution to this problem is to use the .sql API multiple times in a specific 
order.
{code:java}
sparkSession.sql("DROP TABLE IF EXISTS count_employees")
sparkSession.sql("CACHE TABLE employees")
sparkSession.sql("CREATE TEMPORARY VIEW count_employees AS SELECT count(*) as 
cnt FROM employees;")
sparkSession.sql("SELECT * FROM count_employees")

{code}
If these SQL statements come from a string / file, users have to implement 
their own parsers to execute this. Like,
{code:java}
val sqlFromFile = """DROP TABLE IF EXISTS count_employees;
 |CACHE TABLE employees;
 |CREATE TEMPORARY VIEW count_employees AS SELECT count(*) as cnt FROM 
employees; SELECT * FROM count_employees""".stripMargin{code}
{code:java}
sqlFromFile.split(";")
.forEach(line => sparkSession.sql(line))

{code}
This naive parser can fail for many edge cases (like ";" inside a string). Even 
if users use the same grammar used by Spark and implement their own parsing, it 
can go out of sync with the way Spark parses the statements.

Can support for multiple SQL statements be built into SparkSession.sql API 
itself?

 

  was:
sparkSession.sql API only supports a single SQL statement to be executed for a 
call. A multi-statement SQL cannot be executed in a single call. For example,
{code:java}
SparkSession sparkSession = SparkSession.builder().appName("MultiStatementSQL") 
                                         .master("local").config("", 
"").getOrCreate()
sparkSession.sql("DROP TABLE IF EXISTS count_employees; CACHE TABLE employees; 
CREATE TEMPORARY VIEW count_employees AS SELECT count(*) as cnt FROM employees; 
SELECT * FROM count_employees") 

{code}
Above code fails with the error, 
{code:java}
org.apache.spark.sql.catalyst.parser.ParseException: mismatched input ';' 
expecting {code}
Solution to this problem is to use the .sql API multiple times in a specific 
order.
{code:java}
sparkSession.sql("DROP TABLE IF EXISTS count_employees")
sparkSession.sql("CACHE TABLE employees")
sparkSession.sql("CREATE TEMPORARY VIEW count_employees AS SELECT count(*) as 
cnt FROM employees;")
sparkSession.sql("SELECT * FROM count_employees")

{code}
If these SQL statements come from a string / file, users have to implement 
their owners parsers to execute this. Like,
{code:java}
val sqlFromFile = """DROP TABLE IF EXISTS count_employees;
 |CACHE TABLE employees;
 |CREATE TEMPORARY VIEW count_employees AS SELECT count(*) as cnt FROM 
employees; SELECT * FROM count_employees""".stripMargin{code}
One has to implement custom parsing to execute this, 
{code:java}
sqlFromFile.split(";")
.forEach(line => sparkSession.sql(line))

{code}
 This naive parser can fail for many edge cases. Even if users use the same 
grammar used by Spark, it won't be consistent when Spark updates the grammar. 

Can support for multiple SQL statements be built into SparkSession.sql API 
itself?

 


> Support for multi-statement SQL in SparkSession.sql API
> ---
>
> Key: SPARK-24260
> URL: https://issues.apache.org/jira/browse/SPARK-24260
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Ravindra Nath Kakarla
>Priority: Minor
>
> sparkSession.sql API only supports a single SQL statement to be executed for 
> a call. A multi-statement SQL cannot be executed in a single call. For 
> example,
> {code:java}
> SparkSession sparkSession = 
> SparkSession.builder().appName("MultiStatementSQL")                           
>                .master("local").config("", "").getOrCreate()
> sparkSession.sql("DROP TABLE IF EXISTS count_employees; CACHE TABLE 
> employees; CREATE TEMPORARY VIEW count_employees AS SELECT count(*) as cnt 
> FROM employees; SELECT * FROM count_employees") 
> {code}
> Above code fails with the error, 
> {code:java}
> org.apache.spark.sql.catalyst.parser.ParseException: mismatched input ';' 
> expecting {code}
> Solution to this problem is to use the .sql API multiple times in a specif

[jira] [Created] (SPARK-24260) Support for multi-statement SQL in SparkSession.sql API

2018-05-13 Thread Ravindra Nath Kakarla (JIRA)
Ravindra Nath Kakarla created SPARK-24260:
-

 Summary: Support for multi-statement SQL in SparkSession.sql API
 Key: SPARK-24260
 URL: https://issues.apache.org/jira/browse/SPARK-24260
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Ravindra Nath Kakarla


sparkSession.sql API only supports a single SQL statement to be executed for a 
call. A multi-statement SQL cannot be executed in a single call. For example,
{code:java}
SparkSession sparkSession = SparkSession.builder().appName("MultiStatementSQL") 
                                         .master("local").config("", 
"").getOrCreate()
sparkSession.sql("DROP TABLE IF EXISTS count_employees; CACHE TABLE employees; 
CREATE TEMPORARY VIEW count_employees AS SELECT count(*) as cnt FROM employees; 
SELECT * FROM count_employees") 

{code}
Above code fails with the error, 
{code:java}
org.apache.spark.sql.catalyst.parser.ParseException: mismatched input ';' 
expecting {code}
Solution to this problem is to use the .sql API multiple times in a specific 
order.
{code:java}
sparkSession.sql("DROP TABLE IF EXISTS count_employees")
sparkSession.sql("CACHE TABLE employees")
sparkSession.sql("CREATE TEMPORARY VIEW count_employees AS SELECT count(*) as 
cnt FROM employees;")
sparkSession.sql("SELECT * FROM count_employees")

{code}
If these SQL statements come from a string / file, users have to implement 
their owners parsers to execute this. Like,
{code:java}
val sqlFromFile = """DROP TABLE IF EXISTS count_employees;
 |CACHE TABLE employees;
 |CREATE TEMPORARY VIEW count_employees AS SELECT count(*) as cnt FROM 
employees; SELECT * FROM count_employees""".stripMargin{code}
One has to implement custom parsing to execute this, 
{code:java}
sqlFromFile.split(";")
.forEach(line => sparkSession.sql(line))

{code}
 This naive parser can fail for many edge cases. Even if users use the same 
grammar used by Spark, it won't be consistent when Spark updates the grammar. 

Can support for multiple SQL statements be built into SparkSession.sql API 
itself?

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL

2018-05-13 Thread Fernando Pereira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473584#comment-16473584
 ] 

Fernando Pereira commented on SPARK-19618:
--

[~cloud_fan] I have created the Jira and an implementation to lift the limit 
via a configuration option. Internally we are forced to use our mod, and it 
would be nice to get in sync with upstream again at some point. It is a very 
small patch in the end. Thanks.

> Inconsistency wrt max. buckets allowed from Dataframe API vs SQL
> 
>
> Key: SPARK-19618
> URL: https://issues.apache.org/jira/browse/SPARK-19618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Major
> Fix For: 2.2.0
>
>
> High number of buckets is allowed while creating a table via SQL query:
> {code}
> sparkSession.sql("""
> CREATE TABLE bucketed_table(col1 INT) USING parquet 
> CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS
> """)
> sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println)
> 
> [Num Buckets:,147483647,]
> [Bucket Columns:,[col1],]
> [Sort Columns:,[col1],]
> 
> {code}
> Trying the same via dataframe API does not work:
> {code}
> > df.write.format("orc").bucketBy(147483647, 
> > "j","k").sortBy("j","k").saveAsTable("bucketed_table")
> java.lang.IllegalArgumentException: requirement failed: Bucket number must be 
> greater than 0 and less than 10.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291)
>   at 
> org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365)
>   ... 50 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18580) Use spark.streaming.backpressure.initialRate in DirectKafkaInputDStream

2018-05-13 Thread Farah Fertassi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473557#comment-16473557
 ] 

Farah Fertassi commented on SPARK-18580:


Hello folks, is the fix version 2.4.0 ? Thank you :)

> Use spark.streaming.backpressure.initialRate in DirectKafkaInputDStream
> ---
>
> Key: SPARK-18580
> URL: https://issues.apache.org/jira/browse/SPARK-18580
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Oleg Muravskiy
>Assignee: Oleksandr Konopko
>Priority: Major
>
> Currently the `spark.streaming.kafka.maxRatePerPartition` is used as the 
> initial rate when the backpressure is enabled. This is too exhaustive for the 
> application while it still warms up.
> This is similar to SPARK-11627, applying the solution provided there to 
> DirectKafkaInputDStream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24187) add array join

2018-05-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473547#comment-16473547
 ] 

Apache Spark commented on SPARK-24187:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/21313

> add array join
> --
>
> Key: SPARK-24187
> URL: https://issues.apache.org/jira/browse/SPARK-24187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> add R version of https://issues.apache.org/jira/browse/SPARK-23916



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24187) add array join

2018-05-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24187:


Assignee: (was: Apache Spark)

> add array join
> --
>
> Key: SPARK-24187
> URL: https://issues.apache.org/jira/browse/SPARK-24187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> add R version of https://issues.apache.org/jira/browse/SPARK-23916



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24187) add array join

2018-05-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24187:


Assignee: Apache Spark

> add array join
> --
>
> Key: SPARK-24187
> URL: https://issues.apache.org/jira/browse/SPARK-24187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>
> add R version of https://issues.apache.org/jira/browse/SPARK-23916



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16658) Add EdgePartition.withVertexAttributes

2018-05-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16658.
---
Resolution: Won't Fix

> Add EdgePartition.withVertexAttributes
> --
>
> Key: SPARK-16658
> URL: https://issues.apache.org/jira/browse/SPARK-16658
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ben McCann
>Priority: Major
>
> I'm using cloudml/zen, which has forked graphx. I'd like to see their changes 
> upstreamed, so that they can go back to using the upstream graphx instead of 
> having a fork.
> Their implementation of withVertexAttributes: 
> https://github.com/cloudml/zen/blob/94ba7d7f216feb2bff910eec7285dd7caf9440f0/ml/src/main/scala/org/apache/spark/graphx2/impl/EdgePartition.scala
> Their usage of that method: 
> https://github.com/cloudml/zen/blob/8a64a141685d6637a993c3cc6d1788f414d6c3cf/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDADefines.scala



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23884) hasLaunchedTask should be true when launchedAnyTask be true

2018-05-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23884.
---
   Resolution: Not A Problem
Fix Version/s: (was: 2.3.0)

> hasLaunchedTask should be true when launchedAnyTask be true
> ---
>
> Key: SPARK-23884
> URL: https://issues.apache.org/jira/browse/SPARK-23884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: wuyi
>Priority: Major
>  Labels: easyfix
> Attachments: SPARK-23884.patch
>
>
> *hasLaunchedTask* should be *true* when *launchedAnyTask* be *true*, rather 
> than *task's size > 0.*
> *task'size* would be geater than 0 as long as there‘s any *WorkOffers,*but 
> this dose not ensure there's any tasks launched.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15739) Expose aggregateMessagesWithActiveSet to users.

2018-05-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15739.
---
Resolution: Won't Fix

> Expose aggregateMessagesWithActiveSet to users.
> ---
>
> Key: SPARK-15739
> URL: https://issues.apache.org/jira/browse/SPARK-15739
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.6.1
>Reporter: Anderson de Andrade
>Priority: Minor
>
> The current version of Pregel has some flaws:
> * Each iteration expands the lineage, making stages slower as it progresses.
> * It uses the deprecated {{mapReduceTriplets}} method, which makes the 
> interface of the sendMsg function different from {{aggregateMessages}}, 
> bringing inconsistencies to the API.
> * It enforces an initialization stage that prevents users from having a 
> custom state at the beginning of the process.
> It would be fairly trivial to create custom versions of Pregel that would 
> work for many other use cases if {{aggregateMessagesWithActiveSet}} were not 
> a private method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23812) DFS should be removed from unsupportedHiveNativeCommands in SqlBase.g4

2018-05-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23812.
---
Resolution: Not A Problem

> DFS should be removed from unsupportedHiveNativeCommands in SqlBase.g4
> --
>
> Key: SPARK-23812
> URL: https://issues.apache.org/jira/browse/SPARK-23812
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: wangtao93
>Priority: Minor
>
> dfs command has been supported,but SqlBase.g4 also put it in 
> unsupportedHiveNativeCommands .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org