[jira] [Resolved] (SPARK-27259) Allow setting -1 as split size for InputFileBlock

2019-10-16 Thread Praneet Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praneet Sharma resolved SPARK-27259.

Resolution: Fixed

Fixed with [https://github.com/apache/spark/pull/26123]

> Allow setting -1 as split size for InputFileBlock
> -
>
> Key: SPARK-27259
> URL: https://issues.apache.org/jira/browse/SPARK-27259
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Simon poortman
>Assignee: Praneet Sharma
>Priority: Major
>
>  
> From spark 2.2.x versions, when spark job processing any compressed HDFS 
> files with custom input file format then spark jobs are failing with error 
> "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot 
> be negative", the custom input file format will return the number of bytes 
> length value as -1 for compressed file formats due to the compressed HDFS 
> file are non splitable, so for compressed input file format the split will be 
> offset as 0 and number of bytes length as -1, spark should consider the bytes 
> length value -1 as valid split for the compressed file formats.
>  
> We observed that earlier versions of spark doesn’t have this validation, and 
> found that from spark 2.2.x new validation got introduced in the class 
> InputFileBlockHolder, so spark should accept the number of bytes length value 
> -1 as valid length for input splits from spark 2.2.x as well.
>  
> +Below is the stack trace.+
>  Caused by: java.lang.IllegalArgumentException: requirement failed: length 
> (-1) cannot be negative
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.rdd.InputFileBlockHolder$.set(InputFileBlockHolder.scala:70)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:226)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>  
> +Below is the code snippet which caused this issue.+
>    **    {color:#ff}require(length >= 0, s"length ($length) cannot be 
> negative"){color} // This validation caused the issue. 
>  
> {code:java}
> // code placeholder
>  org.apache.spark.rdd.InputFileBlockHolder - spark-core
>  
> def set(filePath: String, startOffset: Long, length: Long): Unit = {
>     require(filePath != null, "filePath cannot be null")
>     require(startOffset >= 0, s"startOffset ($startOffset) cannot be 
> negative")
>     require(length >= 0, s"length ($length) cannot be negative")  
>     inputBlock.set(new FileBlock(UTF8String.fromString(filePath), 
> startOffset, length))
>   }
> {code}
>  
> +Steps to reproduce the issue.+
>  Please refer the below code to reproduce the issue.  
> {code:java}
> // code placeholder
> import org.apache.hadoop.mapred.JobConf
> val hadoopConf = new JobConf()
> import org.apache.hadoop.mapred.FileInputFormat
> import org.apache.hadoop.fs.Path
> FileInputFormat.setInputPaths(hadoopConf, new 
> Path("/output656/part-r-0.gz"))    
> val records = 
> sc.hadoopRDD(hadoopConf,classOf[com.platform.custom.storagehandler.INFAInputFormat],
>  classOf[org.apache.hadoop.io.LongWritable], 
> classOf[org.apache.hadoop.io.Writable]) 
> records.count()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27259) Allow setting -1 as split size for InputFileBlock

2019-10-16 Thread Praneet Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praneet Sharma updated SPARK-27259:
---
Fix Version/s: 3.0.0

> Allow setting -1 as split size for InputFileBlock
> -
>
> Key: SPARK-27259
> URL: https://issues.apache.org/jira/browse/SPARK-27259
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Simon poortman
>Assignee: Praneet Sharma
>Priority: Major
> Fix For: 3.0.0
>
>
>  
> From spark 2.2.x versions, when spark job processing any compressed HDFS 
> files with custom input file format then spark jobs are failing with error 
> "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot 
> be negative", the custom input file format will return the number of bytes 
> length value as -1 for compressed file formats due to the compressed HDFS 
> file are non splitable, so for compressed input file format the split will be 
> offset as 0 and number of bytes length as -1, spark should consider the bytes 
> length value -1 as valid split for the compressed file formats.
>  
> We observed that earlier versions of spark doesn’t have this validation, and 
> found that from spark 2.2.x new validation got introduced in the class 
> InputFileBlockHolder, so spark should accept the number of bytes length value 
> -1 as valid length for input splits from spark 2.2.x as well.
>  
> +Below is the stack trace.+
>  Caused by: java.lang.IllegalArgumentException: requirement failed: length 
> (-1) cannot be negative
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.rdd.InputFileBlockHolder$.set(InputFileBlockHolder.scala:70)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:226)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>  
> +Below is the code snippet which caused this issue.+
>    **    {color:#ff}require(length >= 0, s"length ($length) cannot be 
> negative"){color} // This validation caused the issue. 
>  
> {code:java}
> // code placeholder
>  org.apache.spark.rdd.InputFileBlockHolder - spark-core
>  
> def set(filePath: String, startOffset: Long, length: Long): Unit = {
>     require(filePath != null, "filePath cannot be null")
>     require(startOffset >= 0, s"startOffset ($startOffset) cannot be 
> negative")
>     require(length >= 0, s"length ($length) cannot be negative")  
>     inputBlock.set(new FileBlock(UTF8String.fromString(filePath), 
> startOffset, length))
>   }
> {code}
>  
> +Steps to reproduce the issue.+
>  Please refer the below code to reproduce the issue.  
> {code:java}
> // code placeholder
> import org.apache.hadoop.mapred.JobConf
> val hadoopConf = new JobConf()
> import org.apache.hadoop.mapred.FileInputFormat
> import org.apache.hadoop.fs.Path
> FileInputFormat.setInputPaths(hadoopConf, new 
> Path("/output656/part-r-0.gz"))    
> val records = 
> sc.hadoopRDD(hadoopConf,classOf[com.platform.custom.storagehandler.INFAInputFormat],
>  classOf[org.apache.hadoop.io.LongWritable], 
> classOf[org.apache.hadoop.io.Writable]) 
> records.count()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29488) In Web UI, stage page has js error when sort table.

2019-10-16 Thread jenny (Jira)
jenny created SPARK-29488:
-

 Summary: In Web UI, stage page has js error when sort table.
 Key: SPARK-29488
 URL: https://issues.apache.org/jira/browse/SPARK-29488
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.4.4, 2.3.2
Reporter: jenny


In Web UI, follow the steps below, get js error "Uncaught TypeError: Failed to 
execute 'removeChild' on 'Node': parameter 1 is not of type 'Node'.".
 # Click "Summary Metrics..." 's tablehead "Min"
 # Click "Aggregated Metrics by Executor" 's tablehead "Task Time"
 # Click "Summary Metrics..." 's tablehead "Min"(the same as step 1.)

 

!image-2019-10-16-15-38-42-379.png!

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29488) In Web UI, stage page has js error when sort table.

2019-10-16 Thread jenny (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jenny updated SPARK-29488:
--
Attachment: image-2019-10-16-15-47-25-212.png

> In Web UI, stage page has js error when sort table.
> ---
>
> Key: SPARK-29488
> URL: https://issues.apache.org/jira/browse/SPARK-29488
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.2, 2.4.4
>Reporter: jenny
>Priority: Major
> Attachments: image-2019-10-16-15-47-25-212.png
>
>
> In Web UI, follow the steps below, get js error "Uncaught TypeError: Failed 
> to execute 'removeChild' on 'Node': parameter 1 is not of type 'Node'.".
>  # Click "Summary Metrics..." 's tablehead "Min"
>  # Click "Aggregated Metrics by Executor" 's tablehead "Task Time"
>  # Click "Summary Metrics..." 's tablehead "Min"(the same as step 1.)
>  
> !image-2019-10-16-15-38-42-379.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29488) In Web UI, stage page has js error when sort table.

2019-10-16 Thread jenny (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jenny updated SPARK-29488:
--
Description: 
In Web UI, follow the steps below, get js error "Uncaught TypeError: Failed to 
execute 'removeChild' on 'Node': parameter 1 is not of type 'Node'.".
 # Click "Summary Metrics..." 's tablehead "Min"
 # Click "Aggregated Metrics by Executor" 's tablehead "Task Time"
 # Click "Summary Metrics..." 's tablehead "Min"(the same as step 1.)

  !image-2019-10-16-15-47-25-212.png!

 

  was:
In Web UI, follow the steps below, get js error "Uncaught TypeError: Failed to 
execute 'removeChild' on 'Node': parameter 1 is not of type 'Node'.".
 # Click "Summary Metrics..." 's tablehead "Min"
 # Click "Aggregated Metrics by Executor" 's tablehead "Task Time"
 # Click "Summary Metrics..." 's tablehead "Min"(the same as step 1.)

 

!image-2019-10-16-15-38-42-379.png!

 


> In Web UI, stage page has js error when sort table.
> ---
>
> Key: SPARK-29488
> URL: https://issues.apache.org/jira/browse/SPARK-29488
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.2, 2.4.4
>Reporter: jenny
>Priority: Major
> Attachments: image-2019-10-16-15-47-25-212.png
>
>
> In Web UI, follow the steps below, get js error "Uncaught TypeError: Failed 
> to execute 'removeChild' on 'Node': parameter 1 is not of type 'Node'.".
>  # Click "Summary Metrics..." 's tablehead "Min"
>  # Click "Aggregated Metrics by Executor" 's tablehead "Task Time"
>  # Click "Summary Metrics..." 's tablehead "Min"(the same as step 1.)
>   !image-2019-10-16-15-47-25-212.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29489) ml.evaluation support log-loss

2019-10-16 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-29489:


 Summary: ml.evaluation support log-loss
 Key: SPARK-29489
 URL: https://issues.apache.org/jira/browse/SPARK-29489
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


{color:#5a6e5a}log-loss (aka logistic loss or cross-entropy loss) is one of the 
most widely used metrics in classification tasks. It is already impled in 
famous libraries like sklearn.
{color}

{color:#5a6e5a}However, it is missing so far.
{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29490) Reset 'WritableColumnVector' in 'RowToColumnarExec'

2019-10-16 Thread Rong Ma (Jira)
Rong Ma created SPARK-29490:
---

 Summary: Reset 'WritableColumnVector' in 'RowToColumnarExec'
 Key: SPARK-29490
 URL: https://issues.apache.org/jira/browse/SPARK-29490
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Rong Ma


When converting {{Iterator[InternalRow]}} to {{Iterator[ColumnarBatch]}}, the 
vectors used to create a new {{ColumnarBatch}} should be reset in the 
iterator's "next()" method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.

2019-10-16 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952603#comment-16952603
 ] 

Thomas Graves commented on SPARK-29465:
---

Note the problem I see with just using the port if user specified is many users 
dont know what they are doing in different environments and they result in 
random failures that they dont necessarily understand. The default port is 4040 
and not 0. So I'm a bit on the fence if the solution here is purely use port 
when specified a specific port.  If ots a range of ports that makes more sense. 
 So I would like to understand your use case and why you are trying to specify 
specific port.

> Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. 
> -
>
> Key: SPARK-29465
> URL: https://issues.apache.org/jira/browse/SPARK-29465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 3.0.0
>Reporter: Vishwas Nalka
>Priority: Major
>
>  I'm trying to restrict the ports used by spark app which is launched in yarn 
> cluster mode. All ports (viz. driver, executor, blockmanager) could be 
> specified using the respective properties except the ui port. The spark app 
> is launched using JAVA code and setting the property spark.ui.port in 
> sparkConf doesn't seem to help. Even setting a JVM option 
> -Dspark.ui.port="some_port" does not spawn the UI is required port. 
> From the logs of the spark app, *_the property spark.ui.port is overridden 
> and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set 
> to 0. 
> _(Run in Spark 1.6.2) From the logs ->_
> _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
>  {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
> -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' 
> '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
> '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
> '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_
> _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port 
> 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ 
> [_http://10.65.170.98:35167_|http://10.65.170.98:35167/]
> Even tried using a *spark-submit command with --conf spark.ui.port* does 
> spawn UI in required port
> {color:#172b4d}_(Run in Spark 2.4.4)_{color}
>  {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g 
> --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 
> --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color}
> _From the logs::_
>  _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at 
> [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_
> _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m 
> -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0'  'Dspark.driver.port=12340' 
> -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 
> --executor-id  --hostname  --cores 1 --app-id 
> application_1570992022035_0089 --user-class-path 
> [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_
>  
> Looks like the application master override this and set a JVM property before 
> launch resulting in random UI port even though spark.ui.port is set by the 
> user.
> In these links
>  # 
> [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 214)
>  # 
> [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 75)
> I can see that the method _*run() in above files sets a system property 
> UI_PORT*_ and _*spark.ui.port respectively.*_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.

2019-10-16 Thread Vishwas Nalka (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952608#comment-16952608
 ] 

Vishwas Nalka commented on SPARK-29465:
---

The use case I'm trying to handle is restrict the ports used by spark job to a 
range of ports.

I was trying to restrict using ports using properties _"spark.driver.port", 
"spark.blockManager.port", "spark.ui.port"_ and _"spark.port.maxRetries"_ to 
control the range. However, I was unable to configure "spark.ui.port" and it 
was being overridden to 0.

> Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. 
> -
>
> Key: SPARK-29465
> URL: https://issues.apache.org/jira/browse/SPARK-29465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 3.0.0
>Reporter: Vishwas Nalka
>Priority: Major
>
>  I'm trying to restrict the ports used by spark app which is launched in yarn 
> cluster mode. All ports (viz. driver, executor, blockmanager) could be 
> specified using the respective properties except the ui port. The spark app 
> is launched using JAVA code and setting the property spark.ui.port in 
> sparkConf doesn't seem to help. Even setting a JVM option 
> -Dspark.ui.port="some_port" does not spawn the UI is required port. 
> From the logs of the spark app, *_the property spark.ui.port is overridden 
> and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set 
> to 0. 
> _(Run in Spark 1.6.2) From the logs ->_
> _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
>  {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
> -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' 
> '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
> '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
> '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_
> _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port 
> 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ 
> [_http://10.65.170.98:35167_|http://10.65.170.98:35167/]
> Even tried using a *spark-submit command with --conf spark.ui.port* does 
> spawn UI in required port
> {color:#172b4d}_(Run in Spark 2.4.4)_{color}
>  {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g 
> --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 
> --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color}
> _From the logs::_
>  _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at 
> [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_
> _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m 
> -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0'  'Dspark.driver.port=12340' 
> -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 
> --executor-id  --hostname  --cores 1 --app-id 
> application_1570992022035_0089 --user-class-path 
> [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_
>  
> Looks like the application master override this and set a JVM property before 
> launch resulting in random UI port even though spark.ui.port is set by the 
> user.
> In these links
>  # 
> [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 214)
>  # 
> [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 75)
> I can see that the method _*run() in above files sets a system property 
> UI_PORT*_ and _*spark.ui.port respectively.*_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.

2019-10-16 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952617#comment-16952617
 ] 

Thomas Graves commented on SPARK-29465:
---

So I'd be fine with a range of ports. I suggest updating the description and 
supporting for all port types. I'm pretty sure this has been brought up before, 
see 
https://issues.apache.org/jira/browse/SPARK-4449?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22port%20range%22
 and possibly others

> Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. 
> -
>
> Key: SPARK-29465
> URL: https://issues.apache.org/jira/browse/SPARK-29465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 3.0.0
>Reporter: Vishwas Nalka
>Priority: Major
>
>  I'm trying to restrict the ports used by spark app which is launched in yarn 
> cluster mode. All ports (viz. driver, executor, blockmanager) could be 
> specified using the respective properties except the ui port. The spark app 
> is launched using JAVA code and setting the property spark.ui.port in 
> sparkConf doesn't seem to help. Even setting a JVM option 
> -Dspark.ui.port="some_port" does not spawn the UI is required port. 
> From the logs of the spark app, *_the property spark.ui.port is overridden 
> and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set 
> to 0. 
> _(Run in Spark 1.6.2) From the logs ->_
> _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
>  {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
> -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' 
> '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
> '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
> '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_
> _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port 
> 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ 
> [_http://10.65.170.98:35167_|http://10.65.170.98:35167/]
> Even tried using a *spark-submit command with --conf spark.ui.port* does 
> spawn UI in required port
> {color:#172b4d}_(Run in Spark 2.4.4)_{color}
>  {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g 
> --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 
> --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color}
> _From the logs::_
>  _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at 
> [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_
> _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m 
> -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0'  'Dspark.driver.port=12340' 
> -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 
> --executor-id  --hostname  --cores 1 --app-id 
> application_1570992022035_0089 --user-class-path 
> [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_
>  
> Looks like the application master override this and set a JVM property before 
> launch resulting in random UI port even though spark.ui.port is set by the 
> user.
> In these links
>  # 
> [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 214)
>  # 
> [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 75)
> I can see that the method _*run() in above files sets a system property 
> UI_PORT*_ and _*spark.ui.port respectively.*_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.

2019-10-16 Thread Vishwas Nalka (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952647#comment-16952647
 ] 

Vishwas Nalka commented on SPARK-29465:
---

Need a small clarification for restricting ports,

I was able to configure values for all port types using their appropriate 
property. It was only spark.ui.port that was being overridden by 
ApplicationMaster and set to 0. Once the spark job was launched, I could see 
that all ports of the spark job (both driver process and executor processes) 
were assigned as configured except the UI port which was spawned in a random 
port. Even from the logs of the spark app, I was able to check that 
spark.driver.port, spark.executor.port and other port types were spawned in the 
range _(port_mentioned to port_mentioned+spark.port.maxRetries)_ except the UI 
port which started on random port. 

 As shared in description, from the spark app logs

command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
 JAVA_HOME/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
-Xmx4096m -Djava.io.tmpdir=PWD/tmp '-Dspark.blockManager.port=9900' 
'-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
'-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
*_'-Dspark.ui.port=0'_* '-Dspark.executor.port=9905'


_You can see that the UI port is being set to 0 i.e. random port select even 
though I configured it to a different value._

Instead of adding new support to restrict the port range of spark, I felt this 
could be done by right combination of "_spark.port.maxRetries"_ and specifying 
values for port types. However, wrt UI port, ApplicationMaster overrides it 
using JVM property just before launch giving the user no other way to 
restrict/set the UI port. I felt that it's only the UI port causing the issue.

Please share your suggestion.

> Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. 
> -
>
> Key: SPARK-29465
> URL: https://issues.apache.org/jira/browse/SPARK-29465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 3.0.0
>Reporter: Vishwas Nalka
>Priority: Major
>
>  I'm trying to restrict the ports used by spark app which is launched in yarn 
> cluster mode. All ports (viz. driver, executor, blockmanager) could be 
> specified using the respective properties except the ui port. The spark app 
> is launched using JAVA code and setting the property spark.ui.port in 
> sparkConf doesn't seem to help. Even setting a JVM option 
> -Dspark.ui.port="some_port" does not spawn the UI is required port. 
> From the logs of the spark app, *_the property spark.ui.port is overridden 
> and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set 
> to 0. 
> _(Run in Spark 1.6.2) From the logs ->_
> _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
>  {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
> -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' 
> '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
> '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
> '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_
> _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port 
> 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ 
> [_http://10.65.170.98:35167_|http://10.65.170.98:35167/]
> Even tried using a *spark-submit command with --conf spark.ui.port* does 
> spawn UI in required port
> {color:#172b4d}_(Run in Spark 2.4.4)_{color}
>  {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g 
> --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 
> --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color}
> _From the logs::_
>  _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at 
> [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_
> _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m 
> -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0'  'Dspark.driver.port=12340' 
> -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 
> --executor-id  --hostname  --cores 1 --app-id 
> application_1570992022035_0089 --user-class-path 
> [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_
>  
> Looks like the application master override this and set a JVM property before 
> launch resulting in random UI port even though spark.ui.port is set by the 
> user.

[jira] [Comment Edited] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.

2019-10-16 Thread Vishwas Nalka (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952647#comment-16952647
 ] 

Vishwas Nalka edited comment on SPARK-29465 at 10/16/19 9:18 AM:
-

Need a small clarification for restricting ports,

I was able to configure values for all port types using their appropriate 
property. It was only spark.ui.port that was being overridden by 
ApplicationMaster and set to 0. Once the spark job was launched, I could see 
that all ports of the spark job (both driver process and executor processes) 
were assigned as configured except the UI port which was spawned in a random 
port. Even from the logs of the spark app, I was able to check that 
spark.driver.port, spark.executor.port and other port types were spawned in the 
range _(port_mentioned to port_mentioned+spark.port.maxRetries)_ except the UI 
port which started on random port. 

 As shared in description, from the spark app logs

command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
 JAVA_HOME/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
-Xmx4096m -Djava.io.tmpdir=PWD/tmp '-Dspark.blockManager.port=9900' 
'-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
'-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
*_'-Dspark.ui.port=0'_* '-Dspark.executor.port=9905'

_You can see that the UI port is being set to 0 i.e. random port select even 
though *I configured it to a different value.*_

Instead of adding new support to restrict the port range of spark, I felt this 
could be done by right combination of "_spark.port.maxRetries"_ and specifying 
values for port types. However, wrt UI port, ApplicationMaster overrides it 
using JVM property just before launch giving the user no other way to 
restrict/set the UI port. I felt that it's only the UI port causing the issue.

Please share your suggestion.


was (Author: vishwasn):
Need a small clarification for restricting ports,

I was able to configure values for all port types using their appropriate 
property. It was only spark.ui.port that was being overridden by 
ApplicationMaster and set to 0. Once the spark job was launched, I could see 
that all ports of the spark job (both driver process and executor processes) 
were assigned as configured except the UI port which was spawned in a random 
port. Even from the logs of the spark app, I was able to check that 
spark.driver.port, spark.executor.port and other port types were spawned in the 
range _(port_mentioned to port_mentioned+spark.port.maxRetries)_ except the UI 
port which started on random port. 

 As shared in description, from the spark app logs

command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
 JAVA_HOME/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
-Xmx4096m -Djava.io.tmpdir=PWD/tmp '-Dspark.blockManager.port=9900' 
'-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
'-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
*_'-Dspark.ui.port=0'_* '-Dspark.executor.port=9905'


_You can see that the UI port is being set to 0 i.e. random port select even 
though I configured it to a different value._

Instead of adding new support to restrict the port range of spark, I felt this 
could be done by right combination of "_spark.port.maxRetries"_ and specifying 
values for port types. However, wrt UI port, ApplicationMaster overrides it 
using JVM property just before launch giving the user no other way to 
restrict/set the UI port. I felt that it's only the UI port causing the issue.

Please share your suggestion.

> Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. 
> -
>
> Key: SPARK-29465
> URL: https://issues.apache.org/jira/browse/SPARK-29465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 3.0.0
>Reporter: Vishwas Nalka
>Priority: Major
>
>  I'm trying to restrict the ports used by spark app which is launched in yarn 
> cluster mode. All ports (viz. driver, executor, blockmanager) could be 
> specified using the respective properties except the ui port. The spark app 
> is launched using JAVA code and setting the property spark.ui.port in 
> sparkConf doesn't seem to help. Even setting a JVM option 
> -Dspark.ui.port="some_port" does not spawn the UI is required port. 
> From the logs of the spark app, *_the property spark.ui.port is overridden 
> and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set 
> to 0. 
> _(Run in Spark 1.6.2) From the logs ->_
> _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
>  {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
> -Xmx4096m -Djava

[jira] [Created] (SPARK-29491) Add bit_count function support

2019-10-16 Thread Kent Yao (Jira)
Kent Yao created SPARK-29491:


 Summary: Add bit_count function support
 Key: SPARK-29491
 URL: https://issues.apache.org/jira/browse/SPARK-29491
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


[https://dev.mysql.com/doc/refman/8.0/en/bit-functions.html#function_bit-count]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23578) Add multicolumn support for Binarizer

2019-10-16 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-23578.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26064
[https://github.com/apache/spark/pull/26064]

> Add multicolumn support for Binarizer
> -
>
> Key: SPARK-23578
> URL: https://issues.apache.org/jira/browse/SPARK-23578
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Teng Peng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> [Spark-20542] added an API that Bucketizer that can bin multiple columns. 
> Based on this change, a multicolumn support is added for Binarizer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23578) Add multicolumn support for Binarizer

2019-10-16 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-23578:


Assignee: zhengruifeng

> Add multicolumn support for Binarizer
> -
>
> Key: SPARK-23578
> URL: https://issues.apache.org/jira/browse/SPARK-23578
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Teng Peng
>Assignee: zhengruifeng
>Priority: Minor
>
> [Spark-20542] added an API that Bucketizer that can bin multiple columns. 
> Based on this change, a multicolumn support is added for Binarizer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29492) HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table.

2019-10-16 Thread angerszhu (Jira)
angerszhu created SPARK-29492:
-

 Summary: HiveThriftBinaryServerSuite#withCLIServiceClient didn't 
delete created table.
 Key: SPARK-29492
 URL: https://issues.apache.org/jira/browse/SPARK-29492
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0, 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29492) HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table.

2019-10-16 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952804#comment-16952804
 ] 

angerszhu commented on SPARK-29492:
---

raise a pr soon

> HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table.
> -
>
> Key: SPARK-29492
> URL: https://issues.apache.org/jira/browse/SPARK-29492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29492) HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table.

2019-10-16 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29492:
--
Description: HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete 
created table.

> HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table.
> -
>
> Key: SPARK-29492
> URL: https://issues.apache.org/jira/browse/SPARK-29492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29492) SparkThriftServer can't support jar class as table serde class when executestatement in sync mode

2019-10-16 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29492:
--
Summary: SparkThriftServer  can't support jar class as table serde class 
when executestatement in sync mode  (was: 
HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table.)

> SparkThriftServer  can't support jar class as table serde class when 
> executestatement in sync mode
> --
>
> Key: SPARK-29492
> URL: https://issues.apache.org/jira/browse/SPARK-29492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29468) Floating point literals produce incorrect SQL

2019-10-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29468:
---

Assignee: Jose Torres

> Floating point literals produce incorrect SQL
> -
>
> Key: SPARK-29468
> URL: https://issues.apache.org/jira/browse/SPARK-29468
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
>
> A FLOAT literal 1.2345 returns SQL `CAST(1.2345 AS FLOAT)`. For very small 
> values this doesn't work; `CAST(1e-44 AS FLOAT)` for example doesn't parse, 
> because the parser tries to squeeze the numeric literal 1e-44 into a 
> DECIMAL(38).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29468) Floating point literals produce incorrect SQL

2019-10-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29468.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26114
[https://github.com/apache/spark/pull/26114]

> Floating point literals produce incorrect SQL
> -
>
> Key: SPARK-29468
> URL: https://issues.apache.org/jira/browse/SPARK-29468
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 3.0.0
>
>
> A FLOAT literal 1.2345 returns SQL `CAST(1.2345 AS FLOAT)`. For very small 
> values this doesn't work; `CAST(1e-44 AS FLOAT)` for example doesn't parse, 
> because the parser tries to squeeze the numeric literal 1e-44 into a 
> DECIMAL(38).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29492) SparkThriftServer can't support jar class as table serde class when executestatement in sync mode

2019-10-16 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29492:
--
Description: 
Add UT in HiveThriftBinaryServerSuit:

{code}
  test("jar in sync mode") {
withCLIServiceClient { client =>
  val user = System.getProperty("user.name")
  val sessionHandle = client.openSession(user, "")
  val confOverlay = new java.util.HashMap[java.lang.String, 
java.lang.String]
  val jarFile = HiveTestJars.getHiveHcatalogCoreJar().getCanonicalPath

  Seq(s"ADD JAR $jarFile",
"CREATE TABLE smallKV(key INT, val STRING)",
s"LOAD DATA LOCAL INPATH '${TestData.smallKv}' OVERWRITE INTO TABLE 
smallKV")
.foreach(query => client.executeStatement(sessionHandle, query, 
confOverlay))

  client.executeStatement(sessionHandle,
"""CREATE TABLE addJar(key string)
  |ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
""".stripMargin, confOverlay)

  client.executeStatement(sessionHandle,
"INSERT INTO TABLE addJar SELECT 'k1' as key FROM smallKV limit 1", 
confOverlay)


  val operationHandle = client.executeStatement(
sessionHandle,
"SELECT key FROM addJar",
confOverlay)

  // Fetch result first time
  assertResult(1, "Fetching result first time from next row") {

val rows_next = client.fetchResults(
  operationHandle,
  FetchOrientation.FETCH_NEXT,
  1000,
  FetchType.QUERY_OUTPUT)

rows_next.numRows()
  }
}
  }
{code}

Run it then got ClassNotFound error.

  was:HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created 
table.


> SparkThriftServer  can't support jar class as table serde class when 
> executestatement in sync mode
> --
>
> Key: SPARK-29492
> URL: https://issues.apache.org/jira/browse/SPARK-29492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Add UT in HiveThriftBinaryServerSuit:
> {code}
>   test("jar in sync mode") {
> withCLIServiceClient { client =>
>   val user = System.getProperty("user.name")
>   val sessionHandle = client.openSession(user, "")
>   val confOverlay = new java.util.HashMap[java.lang.String, 
> java.lang.String]
>   val jarFile = HiveTestJars.getHiveHcatalogCoreJar().getCanonicalPath
>   Seq(s"ADD JAR $jarFile",
> "CREATE TABLE smallKV(key INT, val STRING)",
> s"LOAD DATA LOCAL INPATH '${TestData.smallKv}' OVERWRITE INTO TABLE 
> smallKV")
> .foreach(query => client.executeStatement(sessionHandle, query, 
> confOverlay))
>   client.executeStatement(sessionHandle,
> """CREATE TABLE addJar(key string)
>   |ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
> """.stripMargin, confOverlay)
>   client.executeStatement(sessionHandle,
> "INSERT INTO TABLE addJar SELECT 'k1' as key FROM smallKV limit 1", 
> confOverlay)
>   val operationHandle = client.executeStatement(
> sessionHandle,
> "SELECT key FROM addJar",
> confOverlay)
>   // Fetch result first time
>   assertResult(1, "Fetching result first time from next row") {
> val rows_next = client.fetchResults(
>   operationHandle,
>   FetchOrientation.FETCH_NEXT,
>   1000,
>   FetchType.QUERY_OUTPUT)
> rows_next.numRows()
>   }
> }
>   }
> {code}
> Run it then got ClassNotFound error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29492) SparkThriftServer can't support jar class as table serde class when executestatement in sync mode

2019-10-16 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952804#comment-16952804
 ] 

angerszhu edited comment on SPARK-29492 at 10/16/19 1:10 PM:
-

raise a pr soon
Conennect by pyhive will use sync mode.


was (Author: angerszhuuu):
raise a pr soon

> SparkThriftServer  can't support jar class as table serde class when 
> executestatement in sync mode
> --
>
> Key: SPARK-29492
> URL: https://issues.apache.org/jira/browse/SPARK-29492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Add UT in HiveThriftBinaryServerSuit:
> {code}
>   test("jar in sync mode") {
> withCLIServiceClient { client =>
>   val user = System.getProperty("user.name")
>   val sessionHandle = client.openSession(user, "")
>   val confOverlay = new java.util.HashMap[java.lang.String, 
> java.lang.String]
>   val jarFile = HiveTestJars.getHiveHcatalogCoreJar().getCanonicalPath
>   Seq(s"ADD JAR $jarFile",
> "CREATE TABLE smallKV(key INT, val STRING)",
> s"LOAD DATA LOCAL INPATH '${TestData.smallKv}' OVERWRITE INTO TABLE 
> smallKV")
> .foreach(query => client.executeStatement(sessionHandle, query, 
> confOverlay))
>   client.executeStatement(sessionHandle,
> """CREATE TABLE addJar(key string)
>   |ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
> """.stripMargin, confOverlay)
>   client.executeStatement(sessionHandle,
> "INSERT INTO TABLE addJar SELECT 'k1' as key FROM smallKV limit 1", 
> confOverlay)
>   val operationHandle = client.executeStatement(
> sessionHandle,
> "SELECT key FROM addJar",
> confOverlay)
>   // Fetch result first time
>   assertResult(1, "Fetching result first time from next row") {
> val rows_next = client.fetchResults(
>   operationHandle,
>   FetchOrientation.FETCH_NEXT,
>   1000,
>   FetchType.QUERY_OUTPUT)
> rows_next.numRows()
>   }
> }
>   }
> {code}
> Run it then got ClassNotFound error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26154) Stream-stream joins - left outer join gives inconsistent output

2019-10-16 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-26154:
--
Affects Version/s: 3.0.0

> Stream-stream joins - left outer join gives inconsistent output
> ---
>
> Key: SPARK-26154
> URL: https://issues.apache.org/jira/browse/SPARK-26154
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.2, 3.0.0
> Environment: Spark version - Spark 2.3.2
> OS- Suse 11
>Reporter: Haripriya
>Priority: Blocker
>  Labels: correctness
>
> Stream-stream joins using left outer join gives inconsistent  output 
> The data processed once, is being processed again and gives null value. In 
> Batch 2, the input data  "3" is processed. But again in batch 6, null value 
> is provided for same data
> Steps
> In spark-shell
> {code:java}
> scala> import org.apache.spark.sql.functions.{col, expr}
> import org.apache.spark.sql.functions.{col, expr}
> scala> import org.apache.spark.sql.streaming.Trigger
> import org.apache.spark.sql.streaming.Trigger
> scala> val lines_stream1 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic1").
>  |   option("includeTimestamp", true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data"),col("timestamp") 
> as("recordTime")).
>  |   select("data","recordTime").
>  |   withWatermark("recordTime", "5 seconds ")
> lines_stream1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data: string, recordTime: timestamp]
> scala> val lines_stream2 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic2").
>  |   option("includeTimestamp", value = true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data1"),col("timestamp") 
> as("recordTime1")).
>  |   select("data1","recordTime1").
>  |   withWatermark("recordTime1", "10 seconds ")
> lines_stream2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data1: string, recordTime1: timestamp]
> scala> val query = lines_stream1.join(lines_stream2, expr (
>  |   """
>  | | data == data1 and
>  | | recordTime1 >= recordTime and
>  | | recordTime1 <= recordTime + interval 5 seconds
>  |   """.stripMargin),"left").
>  |   writeStream.
>  |   option("truncate","false").
>  |   outputMode("append").
>  |   format("console").option("checkpointLocation", 
> "/tmp/leftouter/").
>  |   trigger(Trigger.ProcessingTime ("5 seconds")).
>  |   start()
> query: org.apache.spark.sql.streaming.StreamingQuery = 
> org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@1a48f55b
> {code}
> Step2 : Start producing data
> kafka-console-producer.sh --broker-list ip:9092 --topic topic1
>  >1
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >bb
>  >cc
> kafka-console-producer.sh --broker-list ip:9092 --topic topic2
>  >2
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >cc
>  >ee
>  >ee
>  
> Output obtained:
> {code:java}
> Batch: 0
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 1
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 2
> ---
> ++---+-+---+
> |data|recordTime |data1|recordTime1|
> ++---+-+---+
> |3   |2018-11-22 20:09:35.053|3|2018-11-22 20:09:36.506|
> |2   |2018-11-22 20:09:31.613|2|2018-11-22 20:09:33.116|
> ++---+-+---+
> ---
> Batch: 3
> ---
> ++---+-+---+
> |data|recordTime |data1|recordTime1|
> ++---+-+---+
> |4   |2018-11-22 20:09:38.654|4|2018-11-22 20:09:39.818|
> ++---+-+---

[jira] [Assigned] (SPARK-29364) Return intervals from date subtracts

2019-10-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-29364:
---

Assignee: Maxim Gekk

> Return intervals from date subtracts
> 
>
> Key: SPARK-29364
> URL: https://issues.apache.org/jira/browse/SPARK-29364
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> According to the SQL standard, date1 - date2 is an interval. See 
> http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt at +_4.5.3  
> Operations involving datetimes and intervals_+. The ticket aims to modify the 
> DateDiff expression to produce another expression of the INTERVAL type When 
> wspark.sql.ansi.enabled` is set to *true* and `spark.sql.dialect` is *Spark*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29364) Return intervals from date subtracts

2019-10-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-29364.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> Return intervals from date subtracts
> 
>
> Key: SPARK-29364
> URL: https://issues.apache.org/jira/browse/SPARK-29364
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> According to the SQL standard, date1 - date2 is an interval. See 
> http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt at +_4.5.3  
> Operations involving datetimes and intervals_+. The ticket aims to modify the 
> DateDiff expression to produce another expression of the INTERVAL type When 
> wspark.sql.ansi.enabled` is set to *true* and `spark.sql.dialect` is *Spark*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29364) Return intervals from date subtracts

2019-10-16 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952828#comment-16952828
 ] 

Yuming Wang commented on SPARK-29364:
-

Issue resolved by pull request 26112
[https://github.com/apache/spark/pull/26112|https://github.com/apache/spark/pull/26112]

> Return intervals from date subtracts
> 
>
> Key: SPARK-29364
> URL: https://issues.apache.org/jira/browse/SPARK-29364
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> According to the SQL standard, date1 - date2 is an interval. See 
> http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt at +_4.5.3  
> Operations involving datetimes and intervals_+. The ticket aims to modify the 
> DateDiff expression to produce another expression of the INTERVAL type When 
> wspark.sql.ansi.enabled` is set to *true* and `spark.sql.dialect` is *Spark*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27880) Implement boolean aggregates(BOOL_AND, BOOL_OR and EVERY)

2019-10-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27880.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26126
[https://github.com/apache/spark/pull/26126]

> Implement boolean aggregates(BOOL_AND, BOOL_OR and EVERY)
> -
>
> Key: SPARK-27880
> URL: https://issues.apache.org/jira/browse/SPARK-27880
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:sql}
> bool_and/booland_statefunc(expression) -- true if all input values are true, 
> otherwise false
> {code}
> {code:sql}
> bool_or/boolor_statefunc(expression) -- true if at least one input value is 
> true, otherwise false
> {code}
> {code:sql}
> every(expression) -- equivalent to bool_and
> {code}
> More details:
>  [https://www.postgresql.org/docs/9.3/functions-aggregate.html]
>  
> Presto and Vertica also support this feature:
> https://prestodb.github.io/docs/current/functions/aggregate.html
> https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Aggregate/AggregateFunctions.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CAggregate%20Functions%7C_0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27880) Implement boolean aggregates(BOOL_AND, BOOL_OR and EVERY)

2019-10-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27880:
---

Assignee: Wenchen Fan

> Implement boolean aggregates(BOOL_AND, BOOL_OR and EVERY)
> -
>
> Key: SPARK-27880
> URL: https://issues.apache.org/jira/browse/SPARK-27880
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:sql}
> bool_and/booland_statefunc(expression) -- true if all input values are true, 
> otherwise false
> {code}
> {code:sql}
> bool_or/boolor_statefunc(expression) -- true if at least one input value is 
> true, otherwise false
> {code}
> {code:sql}
> every(expression) -- equivalent to bool_and
> {code}
> More details:
>  [https://www.postgresql.org/docs/9.3/functions-aggregate.html]
>  
> Presto and Vertica also support this feature:
> https://prestodb.github.io/docs/current/functions/aggregate.html
> https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Aggregate/AggregateFunctions.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CAggregate%20Functions%7C_0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27880) Implement boolean aggregates(BOOL_AND, BOOL_OR and EVERY)

2019-10-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27880:
---

Assignee: Kent Yao  (was: Wenchen Fan)

> Implement boolean aggregates(BOOL_AND, BOOL_OR and EVERY)
> -
>
> Key: SPARK-27880
> URL: https://issues.apache.org/jira/browse/SPARK-27880
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:sql}
> bool_and/booland_statefunc(expression) -- true if all input values are true, 
> otherwise false
> {code}
> {code:sql}
> bool_or/boolor_statefunc(expression) -- true if at least one input value is 
> true, otherwise false
> {code}
> {code:sql}
> every(expression) -- equivalent to bool_and
> {code}
> More details:
>  [https://www.postgresql.org/docs/9.3/functions-aggregate.html]
>  
> Presto and Vertica also support this feature:
> https://prestodb.github.io/docs/current/functions/aggregate.html
> https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Aggregate/AggregateFunctions.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CAggregate%20Functions%7C_0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29493) Add MapType support for Arrow Java

2019-10-16 Thread Bryan Cutler (Jira)
Bryan Cutler created SPARK-29493:


 Summary: Add MapType support for Arrow Java
 Key: SPARK-29493
 URL: https://issues.apache.org/jira/browse/SPARK-29493
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Bryan Cutler


This will add MapType support for Arrow in Spark ArrowConverters. This can 
happen after the Arrow 0.15.0 upgrade, but MapType is not available in pyarrow 
yet, so pyspark and pandas_udf will come later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-24554) Add MapType Support for Arrow in PySpark

2019-10-16 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reopened SPARK-24554:
--

Reopening this to be completed in 2 steps, first Java after Arrow 0.15.0 and 
then pyspark when MapType is available

> Add MapType Support for Arrow in PySpark
> 
>
> Key: SPARK-24554
> URL: https://issues.apache.org/jira/browse/SPARK-24554
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: bulk-closed
>
> Add support for MapType in Arrow related classes in Scala/Java and pyarrow 
> functionality in Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27177) Update jenkins locale to en_US.UTF-8

2019-10-16 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953045#comment-16953045
 ] 

Shane Knapp commented on SPARK-27177:
-

this is done!

> Update jenkins locale to en_US.UTF-8
> 
>
> Key: SPARK-27177
> URL: https://issues.apache.org/jira/browse/SPARK-27177
> Project: Spark
>  Issue Type: Bug
>  Components: Build, jenkins
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Shane Knapp
>Priority: Major
>
> Two test cases will failed on our jenkins since HADOOP-12045(Hadoop-2.8.0). 
> I'd like to update our jenkins locale to en_US.UTF-8 to workaround this issue.
>  How to reproduce:
> {code:java}
> export LANG=
> git clone https://github.com/apache/spark.git && cd spark && git checkout 
> v2.4.0
> build/sbt "hive/testOnly *.HiveDDLSuite" -Phive -Phadoop-2.7 
> -Dhadoop.version=2.8.0
> {code}
> Stack trace:
> {noformat}
> Caused by: sbt.ForkMain$ForkError: java.nio.file.InvalidPathException: 
> Malformed input or input contains unmappable characters: 
> /home/jenkins/workspace/SparkPullRequestBuilder@2/target/tmp/warehouse-15474fdf-0808-40ab-946d-1309fb05bf26/DaTaBaSe_I.db/tab_ı
>   at sun.nio.fs.UnixPath.encode(UnixPath.java:147)
>   at sun.nio.fs.UnixPath.(UnixPath.java:71)
>   at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:281)
>   at java.io.File.toPath(File.java:2234)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getLastAccessTime(RawLocalFileSystem.java:683)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.(RawLocalFileSystem.java:694)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:664)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:987)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:656)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)
>   at org.apache.hadoop.hive.metastore.Warehouse.isDir(Warehouse.java:520)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1436)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1503)
> {noformat}
> Workaround:
> {code:java}
> export LANG=en_US.UTF-8
> build/sbt "hive/testOnly *.HiveDDLSuite" -Phive -Phadoop-2.7 
> -Dhadoop.version=2.8.0
> {code}
> More details: 
> https://issues.apache.org/jira/browse/HADOOP-16180
> https://github.com/apache/spark/pull/24044/commits/4c1ec25d3bc64bf358edf1380a7c863596722362



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27177) Update jenkins locale to en_US.UTF-8

2019-10-16 Thread Shane Knapp (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp resolved SPARK-27177.
-
Resolution: Fixed

> Update jenkins locale to en_US.UTF-8
> 
>
> Key: SPARK-27177
> URL: https://issues.apache.org/jira/browse/SPARK-27177
> Project: Spark
>  Issue Type: Bug
>  Components: Build, jenkins
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Shane Knapp
>Priority: Major
>
> Two test cases will failed on our jenkins since HADOOP-12045(Hadoop-2.8.0). 
> I'd like to update our jenkins locale to en_US.UTF-8 to workaround this issue.
>  How to reproduce:
> {code:java}
> export LANG=
> git clone https://github.com/apache/spark.git && cd spark && git checkout 
> v2.4.0
> build/sbt "hive/testOnly *.HiveDDLSuite" -Phive -Phadoop-2.7 
> -Dhadoop.version=2.8.0
> {code}
> Stack trace:
> {noformat}
> Caused by: sbt.ForkMain$ForkError: java.nio.file.InvalidPathException: 
> Malformed input or input contains unmappable characters: 
> /home/jenkins/workspace/SparkPullRequestBuilder@2/target/tmp/warehouse-15474fdf-0808-40ab-946d-1309fb05bf26/DaTaBaSe_I.db/tab_ı
>   at sun.nio.fs.UnixPath.encode(UnixPath.java:147)
>   at sun.nio.fs.UnixPath.(UnixPath.java:71)
>   at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:281)
>   at java.io.File.toPath(File.java:2234)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getLastAccessTime(RawLocalFileSystem.java:683)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.(RawLocalFileSystem.java:694)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:664)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:987)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:656)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)
>   at org.apache.hadoop.hive.metastore.Warehouse.isDir(Warehouse.java:520)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1436)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1503)
> {noformat}
> Workaround:
> {code:java}
> export LANG=en_US.UTF-8
> build/sbt "hive/testOnly *.HiveDDLSuite" -Phive -Phadoop-2.7 
> -Dhadoop.version=2.8.0
> {code}
> More details: 
> https://issues.apache.org/jira/browse/HADOOP-16180
> https://github.com/apache/spark/pull/24044/commits/4c1ec25d3bc64bf358edf1380a7c863596722362



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29494) ArrayOutOfBoundsException when converting from string to timestamp

2019-10-16 Thread Rahul Shivu Mahadev (Jira)
Rahul Shivu Mahadev created SPARK-29494:
---

 Summary: ArrayOutOfBoundsException when converting from string to 
timestamp
 Key: SPARK-29494
 URL: https://issues.apache.org/jira/browse/SPARK-29494
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
Reporter: Rahul Shivu Mahadev


In a couple of scenarios while converting from String to Timestamp `

DateTimeUtils.stringToTimestamp` throws an array out of bounds exception if 
there is trailing spaces or ':'. The behavior of this method requires it to 
return `None` in case the format of the string is incorrect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29041) Allow createDataFrame to accept bytes as binary type

2019-10-16 Thread Ruslan Dautkhanov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953163#comment-16953163
 ] 

Ruslan Dautkhanov commented on SPARK-29041:
---

[~hyukjin.kwon] thanks for getting back on this .. I see discussion in the PR 
regarding Python 2 and Python 3,

but no discussion regarding applying that patch to Spark 2.3... what do I miss? 

Thanks.

> Allow createDataFrame to accept bytes as binary type
> 
>
> Key: SPARK-29041
> URL: https://issues.apache.org/jira/browse/SPARK-29041
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
> spark.createDataFrame([[b"abcd"]], "col binary")
> {code}
> simply fails as below:
> in Python 3
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/sql/session.py", line 787, in 
> createDataFrame
> rdd, schema = self._createFromLocal(map(prepare, data), schema)
>   File "/.../spark/python/pyspark/sql/session.py", line 442, in 
> _createFromLocal
> data = list(data)
>   File "/.../spark/python/pyspark/sql/session.py", line 769, in prepare
> verify_func(obj)
>   File "/.../forked/spark/python/pyspark/sql/types.py", line 1403, in verify
> verify_value(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1384, in verify_struct
> verifier(v)
>   File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
> verify_value(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1397, in verify_default
> verify_acceptable_types(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1282, in 
> verify_acceptable_types
> % (dataType, obj, type(obj
> TypeError: field col: BinaryType can not accept object b'abcd' in type  'bytes'>
> {code}
> in Python 2:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/sql/session.py", line 787, in 
> createDataFrame
> rdd, schema = self._createFromLocal(map(prepare, data), schema)
>   File "/.../spark/python/pyspark/sql/session.py", line 442, in 
> _createFromLocal
> data = list(data)
>   File "/.../spark/python/pyspark/sql/session.py", line 769, in prepare
> verify_func(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
> verify_value(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1384, in verify_struct
> verifier(v)
>   File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
> verify_value(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1397, in verify_default
> verify_acceptable_types(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1282, in 
> verify_acceptable_types
> % (dataType, obj, type(obj
> TypeError: field col: BinaryType can not accept object 'abcd' in type  'str'>
> {code}
> {{bytes}} should also be able to accepted as binary type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29472) Mechanism for Excluding Jars at Launch for YARN

2019-10-16 Thread BoYang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953209#comment-16953209
 ] 

BoYang commented on SPARK-29472:


This is a pretty good feature, helping to solve production issue when there is 
jar file conflict!

> Mechanism for Excluding Jars at Launch for YARN
> ---
>
> Key: SPARK-29472
> URL: https://issues.apache.org/jira/browse/SPARK-29472
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 2.4.4
>Reporter: Abhishek Modi
>Priority: Minor
>
> *Summary*
> It would be convenient if there were an easy way to exclude jars from Spark’s 
> classpath at launch time. This would complement the way in which jars can be 
> added to the classpath using {{extraClassPath}}.
>  
> *Context*
> The Spark build contains its dependency jars in the {{/jars}} directory. 
> These jars become part of the executor’s classpath. By default on YARN, these 
> jars are packaged and distributed to containers at launch ({{spark-submit}}) 
> time.
>  
> While developing Spark applications, customers sometimes need to debug using 
> different versions of dependencies. This can become difficult if the 
> dependency (eg. Parquet 1.11.0) is one that Spark already has in {{/jars}} 
> (eg. Parquet 1.10.1 in Spark 2.4), as the dependency included with Spark is 
> preferentially loaded. 
>  
> Configurations such as {{userClassPathFirst}} are available. However these 
> have often come with other side effects. For example, if the customer’s build 
> includes Avro they will likely see {{Caused by: java.lang.LinkageError: 
> loader constraint violation: when resolving method 
> "org.apache.spark.SparkConf.registerAvroSchemas(Lscala/collection/Seq;)Lorg/apache/spark/SparkConf;"
>  the class loader (instance of 
> org/apache/spark/util/ChildFirstURLClassLoader) of the current class, 
> com/uber/marmaray/common/spark/SparkFactory, and the class loader (instance 
> of sun/misc/Launcher$AppClassLoader) for the method's defining class, 
> org/apache/spark/SparkConf, have different Class objects for the type 
> scala/collection/Seq used in the signature}}. Resolving such issues often 
> takes many hours.
>  
> To deal with these sorts of issues, customers often download the Spark build, 
> remove the target jars and then do spark-submit. Other times, customers may 
> not be able to do spark-submit as it is gated behind some Spark Job Server. 
> In this case, customers may try downloading the build, removing the jars, and 
> then using configurations such as {{spark.yarn.dist.jars}} or 
> {{spark.yarn.dist.archives}}. Both of these options are undesirable as they 
> are very operationally heavy, error prone and often result in the customer’s 
> spark builds going out of sync with the authoritative build. 
>  
> *Solution*
> I’d like to propose adding a {{spark.yarn.jars.exclusionRegex}} 
> configuration. Customers could provide a regex such as {{.\*parquet.\*}} and 
> jar files matching this regex would not be included in the driver and 
> executor classpath.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29483) Bump Jackson to 2.10.0

2019-10-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29483.
---
Resolution: Fixed

Issue resolved by pull request 26131
[https://github.com/apache/spark/pull/26131]

> Bump Jackson to 2.10.0
> --
>
> Key: SPARK-29483
> URL: https://issues.apache.org/jira/browse/SPARK-29483
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 3.0.0
>
>
> Fixes the following CVE's:
> https://www.cvedetails.com/cve/CVE-2019-16942/
> https://www.cvedetails.com/cve/CVE-2019-16943/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29483) Bump Jackson to 2.10.0

2019-10-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29483:
-

Assignee: Fokko Driesprong

> Bump Jackson to 2.10.0
> --
>
> Key: SPARK-29483
> URL: https://issues.apache.org/jira/browse/SPARK-29483
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 3.0.0
>
>
> Fixes the following CVE's:
> https://www.cvedetails.com/cve/CVE-2019-16942/
> https://www.cvedetails.com/cve/CVE-2019-16943/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27177) Update jenkins locale to en_US.UTF-8

2019-10-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953255#comment-16953255
 ] 

Dongjoon Hyun commented on SPARK-27177:
---

Oh. Thanks, [~shaneknapp]!

> Update jenkins locale to en_US.UTF-8
> 
>
> Key: SPARK-27177
> URL: https://issues.apache.org/jira/browse/SPARK-27177
> Project: Spark
>  Issue Type: Bug
>  Components: Build, jenkins
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Shane Knapp
>Priority: Major
>
> Two test cases will failed on our jenkins since HADOOP-12045(Hadoop-2.8.0). 
> I'd like to update our jenkins locale to en_US.UTF-8 to workaround this issue.
>  How to reproduce:
> {code:java}
> export LANG=
> git clone https://github.com/apache/spark.git && cd spark && git checkout 
> v2.4.0
> build/sbt "hive/testOnly *.HiveDDLSuite" -Phive -Phadoop-2.7 
> -Dhadoop.version=2.8.0
> {code}
> Stack trace:
> {noformat}
> Caused by: sbt.ForkMain$ForkError: java.nio.file.InvalidPathException: 
> Malformed input or input contains unmappable characters: 
> /home/jenkins/workspace/SparkPullRequestBuilder@2/target/tmp/warehouse-15474fdf-0808-40ab-946d-1309fb05bf26/DaTaBaSe_I.db/tab_ı
>   at sun.nio.fs.UnixPath.encode(UnixPath.java:147)
>   at sun.nio.fs.UnixPath.(UnixPath.java:71)
>   at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:281)
>   at java.io.File.toPath(File.java:2234)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getLastAccessTime(RawLocalFileSystem.java:683)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.(RawLocalFileSystem.java:694)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:664)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:987)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:656)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)
>   at org.apache.hadoop.hive.metastore.Warehouse.isDir(Warehouse.java:520)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1436)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1503)
> {noformat}
> Workaround:
> {code:java}
> export LANG=en_US.UTF-8
> build/sbt "hive/testOnly *.HiveDDLSuite" -Phive -Phadoop-2.7 
> -Dhadoop.version=2.8.0
> {code}
> More details: 
> https://issues.apache.org/jira/browse/HADOOP-16180
> https://github.com/apache/spark/pull/24044/commits/4c1ec25d3bc64bf358edf1380a7c863596722362



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28008) Default values & column comments in AVRO schema converters

2019-10-16 Thread Terry Moschou (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949125#comment-16949125
 ] 

Terry Moschou edited comment on SPARK-28008 at 10/17/19 1:14 AM:
-

We also have a use case for propagating application specific metadata other 
than {{comment}}, that is currently being dropped by {{SchemaConverters}}. The 
Avro [specification|http://avro.apache.org/docs/current/spec.html#schemas] does 
support user-defined attributes whose names are not reserved:

bq. Attributes not defined in this document are permitted as metadata, but must 
not affect the format of serialized data.

Something like a {{"metadata"}} key would work. I guess {{doc}} could be a 
shortcut for {{metadata.comment}}?

{code:json}
{
  "type": "record",
  "name": "topLevelRecord",
  "fields": [
{
  "name": "a",
  "type": "string",
  "metadata": {
"comment": "AAA",
"foo": "bar"
  }
}
  ]
}
{code}


was (Author: tmoschou):
We also have a use case for propagating application specific metadata other 
than {{comment}}, that is currently being dropped by {{SchemaConverters}}. The 
Avro [specification|http://avro.apache.org/docs/current/spec.html#schemas] does 
support user-defined attributes whose names are not reserved:

bq. Attributes not defined in this document are permitted as metadata, but must 
not affect the format of serialized data.

Some something like a {{"metadata"}} key would work. I guess {{doc}} could be a 
shortcut for {{metadata.comment}}?

{code:json}
{
  "type": "record",
  "name": "topLevelRecord",
  "fields": [
{
  "name": "a",
  "type": "string",
  "metadata": {
"comment": "AAA",
"foo": "bar"
  }
}
  ]
}
{code}

> Default values & column comments in AVRO schema converters
> --
>
> Key: SPARK-28008
> URL: https://issues.apache.org/jira/browse/SPARK-28008
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mathew Wicks
>Priority: Major
>
> Currently in both `toAvroType` and `toSqlType` 
> [SchemaConverters.scala#L134|https://github.com/apache/spark/blob/branch-2.4/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L134]
>  there are two behaviours which are unexpected.
> h2. Nullable fields in spark are converted to UNION[TYPE, NULL] and no 
> default value is set:
> *Current Behaviour:*
> {code:java}
> import org.apache.spark.sql.avro.SchemaConverters
> import org.apache.spark.sql.types._
> val schema = new StructType().add("a", "string", nullable = true)
> val avroSchema = SchemaConverters.toAvroType(schema)
> println(avroSchema.toString(true))
> {
>   "type" : "record",
>   "name" : "topLevelRecord",
>   "fields" : [ {
> "name" : "a",
> "type" : [ "string", "null" ]
>   } ]
> }
> {code}
> *Expected Behaviour:*
> (NOTE: The reversal of "null" & "string" in the union, needed for a default 
> value of null)
> {code:java}
> import org.apache.spark.sql.avro.SchemaConverters
> import org.apache.spark.sql.types._
> val schema = new StructType().add("a", "string", nullable = true)
> val avroSchema = SchemaConverters.toAvroType(schema)
> println(avroSchema.toString(true))
> {
>   "type" : "record",
>   "name" : "topLevelRecord",
>   "fields" : [ {
> "name" : "a",
> "type" : [ "null", "string" ],
> "default" : null
>   } ]
> }{code}
> h2. Field comments/metadata is not propagated:
> *Current Behaviour:*
> {code:java}
> import org.apache.spark.sql.avro.SchemaConverters
> import org.apache.spark.sql.types._
> val schema = new StructType().add("a", "string", nullable=false, 
> comment="AAA")
> val avroSchema = SchemaConverters.toAvroType(schema)
> println(avroSchema.toString(true))
> {
>   "type" : "record",
>   "name" : "topLevelRecord",
>   "fields" : [ {
> "name" : "a",
> "type" : "string"
>   } ]
> }{code}
> *Expected Behaviour:*
> {code:java}
> import org.apache.spark.sql.avro.SchemaConverters
> import org.apache.spark.sql.types._
> val schema = new StructType().add("a", "string", nullable=false, 
> comment="AAA")
> val avroSchema = SchemaConverters.toAvroType(schema)
> println(avroSchema.toString(true))
> {
>   "type" : "record",
>   "name" : "topLevelRecord",
>   "fields" : [ {
> "name" : "a",
> "type" : "string",
> "doc" : "AAA"
>   } ]
> }{code}
>  
> The behaviour should be similar (but the reverse) for `toSqlType`.
> I think we should aim to get this in before 3.0, as it will probably be a 
> breaking change for some usage of the AVRO API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.

[jira] [Commented] (SPARK-24036) Stateful operators in continuous processing

2019-10-16 Thread Ladislav Jech (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953304#comment-16953304
 ] 

Ladislav Jech commented on SPARK-24036:
---

Hi [~joseph.torres] - any update on this work guys?

> Stateful operators in continuous processing
> ---
>
> Key: SPARK-24036
> URL: https://issues.apache.org/jira/browse/SPARK-24036
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jose Torres
>Priority: Major
>
> The first iteration of continuous processing in Spark 2.3 does not work with 
> stateful operators.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29283) Error message is hidden when query from JDBC, especially enabled adaptive execution

2019-10-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29283:

Affects Version/s: (was: 2.4.4)

> Error message is hidden when query from JDBC, especially enabled adaptive 
> execution
> ---
>
> Key: SPARK-29283
> URL: https://issues.apache.org/jira/browse/SPARK-29283
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> When adaptive execution is enabled, the Spark users who connected from JDBC 
> always get adaptive execution error whatever the under root cause is. It's 
> very confused. We have to check the driver log to find out why.
> {code}
> 0: jdbc:hive2://localhost:1> SELECT * FROM testData join testData2 ON key 
> = v;
> SELECT * FROM testData join testData2 ON key = v;
> Error: Error running query: org.apache.spark.SparkException: Adaptive 
> execution failed due to stage materialization failures. (state=,code=0)
> 0: jdbc:hive2://localhost:1> 
> {code}
> For example, a job queried from JDBC failed due to HDFS missing block. User 
> still get the error message {{Adaptive execution failed due to stage 
> materialization failures}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29283) Error message is hidden when query from JDBC, especially enabled adaptive execution

2019-10-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-29283:
---

Assignee: Lantao Jin

> Error message is hidden when query from JDBC, especially enabled adaptive 
> execution
> ---
>
> Key: SPARK-29283
> URL: https://issues.apache.org/jira/browse/SPARK-29283
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
>
> When adaptive execution is enabled, the Spark users who connected from JDBC 
> always get adaptive execution error whatever the under root cause is. It's 
> very confused. We have to check the driver log to find out why.
> {code}
> 0: jdbc:hive2://localhost:1> SELECT * FROM testData join testData2 ON key 
> = v;
> SELECT * FROM testData join testData2 ON key = v;
> Error: Error running query: org.apache.spark.SparkException: Adaptive 
> execution failed due to stage materialization failures. (state=,code=0)
> 0: jdbc:hive2://localhost:1> 
> {code}
> For example, a job queried from JDBC failed due to HDFS missing block. User 
> still get the error message {{Adaptive execution failed due to stage 
> materialization failures}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29283) Error message is hidden when query from JDBC, especially enabled adaptive execution

2019-10-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-29283.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25960
https://github.com/apache/spark/pull/25960

> Error message is hidden when query from JDBC, especially enabled adaptive 
> execution
> ---
>
> Key: SPARK-29283
> URL: https://issues.apache.org/jira/browse/SPARK-29283
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
> Fix For: 3.0.0
>
>
> When adaptive execution is enabled, the Spark users who connected from JDBC 
> always get adaptive execution error whatever the under root cause is. It's 
> very confused. We have to check the driver log to find out why.
> {code}
> 0: jdbc:hive2://localhost:1> SELECT * FROM testData join testData2 ON key 
> = v;
> SELECT * FROM testData join testData2 ON key = v;
> Error: Error running query: org.apache.spark.SparkException: Adaptive 
> execution failed due to stage materialization failures. (state=,code=0)
> 0: jdbc:hive2://localhost:1> 
> {code}
> For example, a job queried from JDBC failed due to HDFS missing block. User 
> still get the error message {{Adaptive execution failed due to stage 
> materialization failures}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29496) Add ability to estimate perplexity every X iterations for LDA

2019-10-16 Thread Chris Nardi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nardi updated SPARK-29496:

Summary: Add ability to estimate perplexity every X iterations for LDA  
(was: Add ability to estimate per)

> Add ability to estimate perplexity every X iterations for LDA
> -
>
> Key: SPARK-29496
> URL: https://issues.apache.org/jira/browse/SPARK-29496
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.4
>Reporter: Chris Nardi
>Priority: Major
>
> In gensim, [the LDA 
> model|[https://radimrehurek.com/gensim/models/ldamodel.html]] has a parameter 
> eval_every that allows a user to specify that the model should be evaluated 
> every X iterations to determine its log perplexity. This helps to determine 
> convergence of the model, and whether or not the proper number of iterations 
> has been chosen. Spark has no similar functionality in its implementation of 
> LDA. This should be added, as it appears the only way to achieve this 
> functionality would be to train models of varying numbers of iterations and 
> evaluate each's log perplexity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29496) Add ability to estimate per

2019-10-16 Thread Chris Nardi (Jira)
Chris Nardi created SPARK-29496:
---

 Summary: Add ability to estimate per
 Key: SPARK-29496
 URL: https://issues.apache.org/jira/browse/SPARK-29496
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.4.4
Reporter: Chris Nardi


In gensim, [the LDA 
model|[https://radimrehurek.com/gensim/models/ldamodel.html]] has a parameter 
eval_every that allows a user to specify that the model should be evaluated 
every X iterations to determine its log perplexity. This helps to determine 
convergence of the model, and whether or not the proper number of iterations 
has been chosen. Spark has no similar functionality in its implementation of 
LDA. This should be added, as it appears the only way to achieve this 
functionality would be to train models of varying numbers of iterations and 
evaluate each's log perplexity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29495) Add ability to estimate per

2019-10-16 Thread Chris Nardi (Jira)
Chris Nardi created SPARK-29495:
---

 Summary: Add ability to estimate per
 Key: SPARK-29495
 URL: https://issues.apache.org/jira/browse/SPARK-29495
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.4.4
Reporter: Chris Nardi


In gensim, [the LDA 
model|[https://radimrehurek.com/gensim/models/ldamodel.html]] has a parameter 
eval_every that allows a user to specify that the model should be evaluated 
every X iterations to determine its log perplexity. This helps to determine 
convergence of the model, and whether or not the proper number of iterations 
has been chosen. Spark has no similar functionality in its implementation of 
LDA. This should be added, as it appears the only way to achieve this 
functionality would be to train models of varying numbers of iterations and 
evaluate each's log perplexity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29496) Add ability to estimate perplexity every X iterations for LDA

2019-10-16 Thread Chris Nardi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nardi updated SPARK-29496:

Description: In gensim, [the LDA 
model|[https://radimrehurek.com/gensim/models/ldamodel.html]] has a parameter 
eval_every that allows a user to specify that the model should be evaluated 
every X iterations to determine its log perplexity. This helps to determine 
convergence of the model, and whether or not the proper number of iterations 
has been chosen. Spark has no similar functionality in its implementation of 
LDA. This should be added, as it appears the only way to achieve this 
functionality would be to train models of varying numbers of iterations and 
evaluate each's log perplexity.  (was: In gensim, [the LDA 
model][[https://radimrehurek.com/gensim/models/ldamodel.html]] has a parameter 
eval_every that allows a user to specify that the model should be evaluated 
every X iterations to determine its log perplexity. This helps to determine 
convergence of the model, and whether or not the proper number of iterations 
has been chosen. Spark has no similar functionality in its implementation of 
LDA. This should be added, as it appears the only way to achieve this 
functionality would be to train models of varying numbers of iterations and 
evaluate each's log perplexity.)

> Add ability to estimate perplexity every X iterations for LDA
> -
>
> Key: SPARK-29496
> URL: https://issues.apache.org/jira/browse/SPARK-29496
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.4
>Reporter: Chris Nardi
>Priority: Major
>
> In gensim, [the LDA 
> model|[https://radimrehurek.com/gensim/models/ldamodel.html]] has a parameter 
> eval_every that allows a user to specify that the model should be evaluated 
> every X iterations to determine its log perplexity. This helps to determine 
> convergence of the model, and whether or not the proper number of iterations 
> has been chosen. Spark has no similar functionality in its implementation of 
> LDA. This should be added, as it appears the only way to achieve this 
> functionality would be to train models of varying numbers of iterations and 
> evaluate each's log perplexity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29496) Add ability to estimate perplexity every X iterations for LDA

2019-10-16 Thread Chris Nardi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nardi updated SPARK-29496:

Description: In gensim, [the LDA 
model][[https://radimrehurek.com/gensim/models/ldamodel.html]] has a parameter 
eval_every that allows a user to specify that the model should be evaluated 
every X iterations to determine its log perplexity. This helps to determine 
convergence of the model, and whether or not the proper number of iterations 
has been chosen. Spark has no similar functionality in its implementation of 
LDA. This should be added, as it appears the only way to achieve this 
functionality would be to train models of varying numbers of iterations and 
evaluate each's log perplexity.  (was: In gensim, [the LDA 
model|[https://radimrehurek.com/gensim/models/ldamodel.html]] has a parameter 
eval_every that allows a user to specify that the model should be evaluated 
every X iterations to determine its log perplexity. This helps to determine 
convergence of the model, and whether or not the proper number of iterations 
has been chosen. Spark has no similar functionality in its implementation of 
LDA. This should be added, as it appears the only way to achieve this 
functionality would be to train models of varying numbers of iterations and 
evaluate each's log perplexity.)

> Add ability to estimate perplexity every X iterations for LDA
> -
>
> Key: SPARK-29496
> URL: https://issues.apache.org/jira/browse/SPARK-29496
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.4
>Reporter: Chris Nardi
>Priority: Major
>
> In gensim, [the LDA 
> model][[https://radimrehurek.com/gensim/models/ldamodel.html]] has a 
> parameter eval_every that allows a user to specify that the model should be 
> evaluated every X iterations to determine its log perplexity. This helps to 
> determine convergence of the model, and whether or not the proper number of 
> iterations has been chosen. Spark has no similar functionality in its 
> implementation of LDA. This should be added, as it appears the only way to 
> achieve this functionality would be to train models of varying numbers of 
> iterations and evaluate each's log perplexity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29496) Add ability to estimate perplexity every X iterations for LDA

2019-10-16 Thread Chris Nardi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nardi updated SPARK-29496:

Description: In gensim, [the LDA model 
|[https://radimrehurek.com/gensim/models/ldamodel.html]] has a parameter 
eval_every that allows a user to specify that the model should be evaluated 
every X iterations to determine its log perplexity. This helps to determine 
convergence of the model, and whether or not the proper number of iterations 
has been chosen. Spark has no similar functionality in its implementation of 
LDA. This should be added, as it appears the only way to achieve this 
functionality would be to train models of varying numbers of iterations and 
evaluate each's log perplexity.  (was: In gensim, [the LDA 
model|[https://radimrehurek.com/gensim/models/ldamodel.html]] has a parameter 
eval_every that allows a user to specify that the model should be evaluated 
every X iterations to determine its log perplexity. This helps to determine 
convergence of the model, and whether or not the proper number of iterations 
has been chosen. Spark has no similar functionality in its implementation of 
LDA. This should be added, as it appears the only way to achieve this 
functionality would be to train models of varying numbers of iterations and 
evaluate each's log perplexity.)

> Add ability to estimate perplexity every X iterations for LDA
> -
>
> Key: SPARK-29496
> URL: https://issues.apache.org/jira/browse/SPARK-29496
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.4
>Reporter: Chris Nardi
>Priority: Major
>
> In gensim, [the LDA model 
> |[https://radimrehurek.com/gensim/models/ldamodel.html]] has a parameter 
> eval_every that allows a user to specify that the model should be evaluated 
> every X iterations to determine its log perplexity. This helps to determine 
> convergence of the model, and whether or not the proper number of iterations 
> has been chosen. Spark has no similar functionality in its implementation of 
> LDA. This should be added, as it appears the only way to achieve this 
> functionality would be to train models of varying numbers of iterations and 
> evaluate each's log perplexity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29496) Add ability to estimate perplexity every X iterations for LDA

2019-10-16 Thread Chris Nardi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nardi updated SPARK-29496:

Description: In gensim, [the LDA model | 
[https://radimrehurek.com/gensim/models/ldamodel.html]] has a parameter 
eval_every that allows a user to specify that the model should be evaluated 
every X iterations to determine its log perplexity. This helps to determine 
convergence of the model, and whether or not the proper number of iterations 
has been chosen. Spark has no similar functionality in its implementation of 
LDA. This should be added, as it appears the only way to achieve this 
functionality would be to train models of varying numbers of iterations and 
evaluate each's log perplexity.  (was: In gensim, [the LDA model 
|[https://radimrehurek.com/gensim/models/ldamodel.html]] has a parameter 
eval_every that allows a user to specify that the model should be evaluated 
every X iterations to determine its log perplexity. This helps to determine 
convergence of the model, and whether or not the proper number of iterations 
has been chosen. Spark has no similar functionality in its implementation of 
LDA. This should be added, as it appears the only way to achieve this 
functionality would be to train models of varying numbers of iterations and 
evaluate each's log perplexity.)

> Add ability to estimate perplexity every X iterations for LDA
> -
>
> Key: SPARK-29496
> URL: https://issues.apache.org/jira/browse/SPARK-29496
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.4
>Reporter: Chris Nardi
>Priority: Major
>
> In gensim, [the LDA model | 
> [https://radimrehurek.com/gensim/models/ldamodel.html]] has a parameter 
> eval_every that allows a user to specify that the model should be evaluated 
> every X iterations to determine its log perplexity. This helps to determine 
> convergence of the model, and whether or not the proper number of iterations 
> has been chosen. Spark has no similar functionality in its implementation of 
> LDA. This should be added, as it appears the only way to achieve this 
> functionality would be to train models of varying numbers of iterations and 
> evaluate each's log perplexity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29496) Add ability to estimate perplexity every X iterations for LDA

2019-10-16 Thread Chris Nardi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nardi updated SPARK-29496:

Description: In gensim, [the LDA 
model|[https://radimrehurek.com/gensim/models/ldamodel.html]]  has a parameter 
eval_every that allows a user to specify that the model should be evaluated 
every X iterations to determine its log perplexity. This helps to determine 
convergence of the model, and whether or not the proper number of iterations 
has been chosen. Spark has no similar functionality in its implementation of 
LDA. This should be added, as it appears the only way to achieve this 
functionality would be to train models of varying numbers of iterations and 
evaluate each's log perplexity.  (was: In gensim, [the LDA 
mode|[https://radimrehurek.com/gensim/models/ldamodel.html]]  has a parameter 
eval_every that allows a user to specify that the model should be evaluated 
every X iterations to determine its log perplexity. This helps to determine 
convergence of the model, and whether or not the proper number of iterations 
has been chosen. Spark has no similar functionality in its implementation of 
LDA. This should be added, as it appears the only way to achieve this 
functionality would be to train models of varying numbers of iterations and 
evaluate each's log perplexity.)

> Add ability to estimate perplexity every X iterations for LDA
> -
>
> Key: SPARK-29496
> URL: https://issues.apache.org/jira/browse/SPARK-29496
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.4
>Reporter: Chris Nardi
>Priority: Major
>
> In gensim, [the LDA 
> model|[https://radimrehurek.com/gensim/models/ldamodel.html]]  has a 
> parameter eval_every that allows a user to specify that the model should be 
> evaluated every X iterations to determine its log perplexity. This helps to 
> determine convergence of the model, and whether or not the proper number of 
> iterations has been chosen. Spark has no similar functionality in its 
> implementation of LDA. This should be added, as it appears the only way to 
> achieve this functionality would be to train models of varying numbers of 
> iterations and evaluate each's log perplexity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29496) Add ability to estimate perplexity every X iterations for LDA

2019-10-16 Thread Chris Nardi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nardi updated SPARK-29496:

Description: In gensim, [the LDA 
mode|[https://radimrehurek.com/gensim/models/ldamodel.html]]  has a parameter 
eval_every that allows a user to specify that the model should be evaluated 
every X iterations to determine its log perplexity. This helps to determine 
convergence of the model, and whether or not the proper number of iterations 
has been chosen. Spark has no similar functionality in its implementation of 
LDA. This should be added, as it appears the only way to achieve this 
functionality would be to train models of varying numbers of iterations and 
evaluate each's log perplexity.  (was: In gensim, [the LDA model | 
[https://radimrehurek.com/gensim/models/ldamodel.html]] has a parameter 
eval_every that allows a user to specify that the model should be evaluated 
every X iterations to determine its log perplexity. This helps to determine 
convergence of the model, and whether or not the proper number of iterations 
has been chosen. Spark has no similar functionality in its implementation of 
LDA. This should be added, as it appears the only way to achieve this 
functionality would be to train models of varying numbers of iterations and 
evaluate each's log perplexity.)

> Add ability to estimate perplexity every X iterations for LDA
> -
>
> Key: SPARK-29496
> URL: https://issues.apache.org/jira/browse/SPARK-29496
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.4
>Reporter: Chris Nardi
>Priority: Major
>
> In gensim, [the LDA 
> mode|[https://radimrehurek.com/gensim/models/ldamodel.html]]  has a parameter 
> eval_every that allows a user to specify that the model should be evaluated 
> every X iterations to determine its log perplexity. This helps to determine 
> convergence of the model, and whether or not the proper number of iterations 
> has been chosen. Spark has no similar functionality in its implementation of 
> LDA. This should be added, as it appears the only way to achieve this 
> functionality would be to train models of varying numbers of iterations and 
> evaluate each's log perplexity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29496) Add ability to estimate perplexity every X iterations for LDA

2019-10-16 Thread Chris Nardi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nardi updated SPARK-29496:

Description: In gensim, [the LDA 
model|https://radimrehurek.com/gensim/models/ldamodel.html] has a parameter 
eval_every that allows a user to specify that the model should be evaluated 
every X iterations to determine its log perplexity. This helps to determine 
convergence of the model, and whether or not the proper number of iterations 
has been chosen. Spark has no similar functionality in its implementation of 
LDA. This should be added, as it appears the only way to achieve this 
functionality would be to train models of varying numbers of iterations and 
evaluate each's log perplexity.  (was: In gensim, [the LDA 
model|[https://radimrehurek.com/gensim/models/ldamodel.html] has a parameter 
eval_every that allows a user to specify that the model should be evaluated 
every X iterations to determine its log perplexity. This helps to determine 
convergence of the model, and whether or not the proper number of iterations 
has been chosen. Spark has no similar functionality in its implementation of 
LDA. This should be added, as it appears the only way to achieve this 
functionality would be to train models of varying numbers of iterations and 
evaluate each's log perplexity.)

> Add ability to estimate perplexity every X iterations for LDA
> -
>
> Key: SPARK-29496
> URL: https://issues.apache.org/jira/browse/SPARK-29496
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.4
>Reporter: Chris Nardi
>Priority: Major
>
> In gensim, [the LDA 
> model|https://radimrehurek.com/gensim/models/ldamodel.html] has a parameter 
> eval_every that allows a user to specify that the model should be evaluated 
> every X iterations to determine its log perplexity. This helps to determine 
> convergence of the model, and whether or not the proper number of iterations 
> has been chosen. Spark has no similar functionality in its implementation of 
> LDA. This should be added, as it appears the only way to achieve this 
> functionality would be to train models of varying numbers of iterations and 
> evaluate each's log perplexity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29496) Add ability to estimate perplexity every X iterations for LDA

2019-10-16 Thread Chris Nardi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nardi updated SPARK-29496:

Description: In gensim, [the LDA 
model|[https://radimrehurek.com/gensim/models/ldamodel.html] has a parameter 
eval_every that allows a user to specify that the model should be evaluated 
every X iterations to determine its log perplexity. This helps to determine 
convergence of the model, and whether or not the proper number of iterations 
has been chosen. Spark has no similar functionality in its implementation of 
LDA. This should be added, as it appears the only way to achieve this 
functionality would be to train models of varying numbers of iterations and 
evaluate each's log perplexity.  (was: In gensim, [the LDA 
model|[https://radimrehurek.com/gensim/models/ldamodel.html]]  has a parameter 
eval_every that allows a user to specify that the model should be evaluated 
every X iterations to determine its log perplexity. This helps to determine 
convergence of the model, and whether or not the proper number of iterations 
has been chosen. Spark has no similar functionality in its implementation of 
LDA. This should be added, as it appears the only way to achieve this 
functionality would be to train models of varying numbers of iterations and 
evaluate each's log perplexity.)

> Add ability to estimate perplexity every X iterations for LDA
> -
>
> Key: SPARK-29496
> URL: https://issues.apache.org/jira/browse/SPARK-29496
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.4
>Reporter: Chris Nardi
>Priority: Major
>
> In gensim, [the LDA 
> model|[https://radimrehurek.com/gensim/models/ldamodel.html] has a parameter 
> eval_every that allows a user to specify that the model should be evaluated 
> every X iterations to determine its log perplexity. This helps to determine 
> convergence of the model, and whether or not the proper number of iterations 
> has been chosen. Spark has no similar functionality in its implementation of 
> LDA. This should be added, as it appears the only way to achieve this 
> functionality would be to train models of varying numbers of iterations and 
> evaluate each's log perplexity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29495) Add ability to estimate per

2019-10-16 Thread Chris Nardi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nardi resolved SPARK-29495.
-
Resolution: Duplicate

> Add ability to estimate per
> ---
>
> Key: SPARK-29495
> URL: https://issues.apache.org/jira/browse/SPARK-29495
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.4
>Reporter: Chris Nardi
>Priority: Major
>
> In gensim, [the LDA 
> model|[https://radimrehurek.com/gensim/models/ldamodel.html]] has a parameter 
> eval_every that allows a user to specify that the model should be evaluated 
> every X iterations to determine its log perplexity. This helps to determine 
> convergence of the model, and whether or not the proper number of iterations 
> has been chosen. Spark has no similar functionality in its implementation of 
> LDA. This should be added, as it appears the only way to achieve this 
> functionality would be to train models of varying numbers of iterations and 
> evaluate each's log perplexity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29438) Failed to get state store in stream-stream join

2019-10-16 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953422#comment-16953422
 ] 

Jungtaek Lim commented on SPARK-29438:
--

Any updates here?

> Failed to get state store in stream-stream join
> ---
>
> Key: SPARK-29438
> URL: https://issues.apache.org/jira/browse/SPARK-29438
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4
>Reporter: Genmao Yu
>Priority: Critical
>
> Now, Spark use the `TaskPartitionId` to determine the StateStore path.
> {code:java}
> TaskPartitionId   \ 
> StateStoreVersion  --> StoreProviderId -> StateStore
> StateStoreName/  
> {code}
> In spark stages, the task partition id is determined by the number of tasks. 
> As we said the StateStore file path depends on the task partition id. So if 
> stream-stream join task partition id is changed against last batch, it will 
> get wrong StateStore data or fail with non-exist StateStore data. In some 
> corner cases, it happened. Following is a sample pseudocode:
> {code:java}
> val df3 = streamDf1.join(streamDf2)
> val df5 = streamDf3.join(batchDf4)
> val df = df3.union(df5)
> df.writeStream...start()
> {code}
> A simplified DAG like this:
> {code:java}
> DataSourceV2Scan   Scan Relation DataSourceV2Scan   DataSourceV2Scan
>  (streamDf3)|   (streamDf1)(streamDf2)
>  |  |   | |
>   Exchange(200)  Exchange(200)   Exchange(200) Exchange(200)
>  |  |   | | 
>SortSort | |
>  \  /   \ /
>   \/ \   /
> SortMergeJoinStreamingSymmetricHashJoin
>  \ /
>\ /
>  \ /
> Union
> {code}
> Stream-Steam join task Id will start from 200 to 399 as they are in the same 
> stage with `SortMergeJoin`. But when there is no new incoming data in 
> `streamDf3` in some batch, it will generate a empty LocalRelation, and then 
> the SortMergeJoin will be replaced with a BroadcastHashJoin. In this case, 
> Stream-Steam join task Id will start from 1 to 200. Finally, it will get 
> wrong StateStore path through TaskPartitionId, and failed with error reading 
> state store delta file.
> {code:java}
> LocalTableScan   Scan Relation DataSourceV2Scan   DataSourceV2Scan
>  |  |   | |
> BroadcastExchange   |  Exchange(200) Exchange(200)
>  |  |   | | 
>  |  |   | |
>   \/ \   /
>\ /\ /
>   BroadcastHashJoin StreamingSymmetricHashJoin
>  \ /
>\ /
>  \ /
> Union
> {code}
> In my job, I closed the auto BroadcastJoin feature (set 
> spark.sql.autoBroadcastJoinThreshold=-1) to walk around this bug. We should 
> make the StateStore path determinate but not depends on TaskPartitionId.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9853) Optimize shuffle fetch of contiguous partition IDs

2019-10-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-9853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-9853.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26040
[https://github.com/apache/spark/pull/26040]

> Optimize shuffle fetch of contiguous partition IDs
> --
>
> Key: SPARK-9853
> URL: https://issues.apache.org/jira/browse/SPARK-9853
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Reporter: Matei Alexandru Zaharia
>Assignee: Matei Alexandru Zaharia
>Priority: Minor
> Fix For: 3.0.0
>
>
> On the map side, we should be able to serve a block representing multiple 
> partition IDs in one block manager request



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9853) Optimize shuffle fetch of contiguous partition IDs

2019-10-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-9853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-9853:
--

Assignee: Yuanjian Li  (was: Matei Alexandru Zaharia)

> Optimize shuffle fetch of contiguous partition IDs
> --
>
> Key: SPARK-9853
> URL: https://issues.apache.org/jira/browse/SPARK-9853
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Reporter: Matei Alexandru Zaharia
>Assignee: Yuanjian Li
>Priority: Minor
> Fix For: 3.0.0
>
>
> On the map side, we should be able to serve a block representing multiple 
> partition IDs in one block manager request



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org