date:20180922

[jira] [Commented] (SPARK-25491) pandas_udf(GROUPED_MAP) fails when using ArrayType(ArrayType(DoubleType()))

2018-09-22 Thread Ofer Fridman (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624977#comment-16624977
 ] 

Ofer Fridman commented on SPARK-25491:
--

[~hyukjin.kwon], here is both the exception trace and the pandas versions:

can be reproduce on both  pandas, 0.19.2 and 0.23.4

Full exception trace:
{quote} 

2018-09-23 08:48:30 WARN NativeCodeLoader:62 - Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
2018-09-23 08:48:31 WARN Utils:66 - Service 'SparkUI' could not bind on port 
4040. Attempting port 4041.
[Stage 7:> (93 + 1) / 
100]2018-09-23 08:48:43 ERROR Executor:91 - Exception in task 18.0 in stage 7.0 
(TID 119)
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
 at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:333)
 at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:322)
 at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
 at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:177)
 at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:121)
 at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
 at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
 at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
 Source)
 at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
 at org.apache.spark.scheduler.Task.run(Task.scala:109)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.EOFException
 at java.io.DataInputStream.readInt(DataInputStream.java:392)
 at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:158)
 ... 24 more
2018-09-23 08:48:43 WARN TaskSetManager:66 - Lost task 18.0 in stage 7.0 (TID 
119, localhost, executor driver): org.apache.spark.SparkException: Python 
worker exited unexpectedly (crashed)
 at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:333)
 at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:322)
 at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
 at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:177)
 at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:121)
 at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
 at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
 at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
 Source)
 at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
 at

[jira] [Resolved] (SPARK-25512) Using RowNumbers in SparkR Dataframe

2018-09-22 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25512.
--
Resolution: Invalid

Questions should go to mailing list (see 
https://spark.apache.org/community.html). Let's ask there first and file an 
issue when it's clear if it's an issue in Spark.

> Using RowNumbers in SparkR Dataframe
> 
>
> Key: SPARK-25512
> URL: https://issues.apache.org/jira/browse/SPARK-25512
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Asif Khan
>Priority: Critical
>
> Hi,
> I have a use case , where I have a  SparkR  dataframe and i want to iterate 
> over the dataframe in a for loop using the row numbers  of the dataframe. Is 
> it possible?
> Only solution I have now is to collect() the SparkR dataframe in R dataframe 
> , which brings the entire dataframe on Driver node and then iterate over it 
> using row numbers. But as the for loop executes only on driver node, I don't 
> get the advantage of parallel processing in Spark which was the whole purpose 
> of using Spark. Please Help.
> Thank You,
> Asif Khan



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25511) Map with "null" key not working in spark 2.3

2018-09-22 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624972#comment-16624972
 ] 

Hyukjin Kwon commented on SPARK-25511:
--

ping [~ravi_b_shankar], I would resolve this ticket if there's no argument for 
^.

> Map with "null" key not working in spark 2.3
> 
>
> Key: SPARK-25511
> URL: https://issues.apache.org/jira/browse/SPARK-25511
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.3.1
>Reporter: Ravi Shankar
>Priority: Major
>
> I had a use case where i was creating a histogram of column values through a 
> UDAF in a Map data type. It is basically just a group by count on a column's 
> value that is returned as a Map. I needed to plugin 
> all invalid values for the column as a "null -> count" in the map that was 
> returned. In 2.1.x, this was working fine and i could create a Map with 
> "null" being a key. This is not working in 2.3 and wondering if this is 
> expected and if i have to change my application code: 
>  
> {code:java}
> val myList = List(("a", "1"), ("b", "2"), ("a", "3"), (null, "4"))
> val map = myList.toMap
> val data = List(List("sublime", map))
> val rdd = sc.parallelize(data).map(l ⇒ Row.fromSeq(l.toSeq))
> val datasetSchema = StructType(List(StructField("name", StringType, true), 
> StructField("songs", MapType(StringType, StringType, true), true)))
> val df = spark.createDataFrame(rdd, datasetSchema)
> df.take(5).foreach(println)
> {code}
> Output in spark 2.1.x:
> {code:java}
> scala> df.take(5).foreach(println)
> [sublime,Map(a -> 3, b -> 2, null -> 4)]
> {code}
> Output in spark 2.3.x:
> {code:java}
> 2018-09-21 15:35:25 ERROR Executor:91 - Exception in task 2.0 in stage 14.0 
> (TID 39)
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, name), StringType), true, false) AS 
> name#38
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else newInstance(class org.apache.spark.sql.catalyst.util.ArrayBasedMapData) 
> AS songs#39
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:291)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:589)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:589)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException: Null value appeared in 
> non-nullable field:
> If the schema is inferred from a Scala tuple/case class, or a

[jira] [Updated] (SPARK-25491) pandas_udf(GROUPED_MAP) fails when using ArrayType(ArrayType(DoubleType()))

2018-09-22 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25491:
-
Summary: pandas_udf(GROUPED_MAP) fails when using 
ArrayType(ArrayType(DoubleType()))(was: pandas_udf fails when using 
ArrayType(ArrayType(DoubleType()))  )

> pandas_udf(GROUPED_MAP) fails when using ArrayType(ArrayType(DoubleType()))  
> -
>
> Key: SPARK-25491
> URL: https://issues.apache.org/jira/browse/SPARK-25491
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Linux
> python 2.7.9
> pyspark 2.3.1 (also reproduces on pyspark 2.3.0)
> pyarrow 0.9.0 (working OK when using pyarrow 0.8.0)
>Reporter: Ofer Fridman
>Priority: Major
>
> After upgrading from pyarrow-0.8.0  to pyarrow-0.9.0 using pandas_udf (in 
> PandasUDFType.GROUPED_MAP), results in an error:
> {quote}Caused by: java.io.EOFException
>  at java.io.DataInputStream.readInt(DataInputStream.java:392)
>  at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:158)
>  ... 24 more
> {quote}
> The problem occurs only when using complex type like 
> ArrayType(ArrayType(DoubleType())) usege of ArrayType(DoubleType()) did not 
> reproduce this issue.
> here is a simple example to reproduce this issue:
> {quote}import pandas as pd
> import numpy as np
> from pyspark.sql import SparkSession
> from pyspark.context import SparkContext, SparkConf
> from pyspark.sql.types import *
> import pyspark.sql.functions as sprk_func
> sp_conf = 
> SparkConf().setAppName("stam").setMaster("local[1]").set('spark.driver.memory','4g')
> sc = SparkContext(conf=sp_conf)
> spark = SparkSession(sc)
> pd_data = pd.DataFrame(\{'id':(np.random.rand(20)*10).astype(int)})
> data_df = spark.createDataFrame(pd_data,StructType([StructField('id', 
> IntegerType(), True)]))
> @sprk_func.pandas_udf(StructType([StructField('mat', 
> ArrayType(ArrayType(DoubleType())), True)]), 
> sprk_func.PandasUDFType.GROUPED_MAP)
> def return_mat_group(group):
>  pd_data = pd.DataFrame(\{'mat': np.random.rand(7, 4, 4).tolist()})
>  return pd_data
> data_df.groupby(data_df.id).apply(return_mat_group).show(){quote}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25491) pandas_udf fails when using ArrayType(ArrayType(DoubleType()))

2018-09-22 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624971#comment-16624971
 ] 

Hyukjin Kwon commented on SPARK-25491:
--

Please specify Pandas version as well. Also mind if I ask full exception trace?

> pandas_udf fails when using ArrayType(ArrayType(DoubleType()))  
> 
>
> Key: SPARK-25491
> URL: https://issues.apache.org/jira/browse/SPARK-25491
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Linux
> python 2.7.9
> pyspark 2.3.1 (also reproduces on pyspark 2.3.0)
> pyarrow 0.9.0 (working OK when using pyarrow 0.8.0)
>Reporter: Ofer Fridman
>Priority: Major
>
> After upgrading from pyarrow-0.8.0  to pyarrow-0.9.0 using pandas_udf (in 
> PandasUDFType.GROUPED_MAP), results in an error:
> {quote}Caused by: java.io.EOFException
>  at java.io.DataInputStream.readInt(DataInputStream.java:392)
>  at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:158)
>  ... 24 more
> {quote}
> The problem occurs only when using complex type like 
> ArrayType(ArrayType(DoubleType())) usege of ArrayType(DoubleType()) did not 
> reproduce this issue.
> here is a simple example to reproduce this issue:
> {quote}import pandas as pd
> import numpy as np
> from pyspark.sql import SparkSession
> from pyspark.context import SparkContext, SparkConf
> from pyspark.sql.types import *
> import pyspark.sql.functions as sprk_func
> sp_conf = 
> SparkConf().setAppName("stam").setMaster("local[1]").set('spark.driver.memory','4g')
> sc = SparkContext(conf=sp_conf)
> spark = SparkSession(sc)
> pd_data = pd.DataFrame(\{'id':(np.random.rand(20)*10).astype(int)})
> data_df = spark.createDataFrame(pd_data,StructType([StructField('id', 
> IntegerType(), True)]))
> @sprk_func.pandas_udf(StructType([StructField('mat', 
> ArrayType(ArrayType(DoubleType())), True)]), 
> sprk_func.PandasUDFType.GROUPED_MAP)
> def return_mat_group(group):
>  pd_data = pd.DataFrame(\{'mat': np.random.rand(7, 4, 4).tolist()})
>  return pd_data
> data_df.groupby(data_df.id).apply(return_mat_group).show(){quote}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25506) Spark CSV multiline with CRLF

2018-09-22 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25506.
--
Resolution: Duplicate

> Spark CSV multiline with CRLF
> -
>
> Key: SPARK-25506
> URL: https://issues.apache.org/jira/browse/SPARK-25506
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.2.0, 2.3.1
> Environment: spark 2.2.0 and 2.3.1
> scala 2.11.8
>Reporter: eugen yushin
>Priority: Major
>
> Spark produces empty rows (or ']' when printing via call to `collect`) 
> dealing with '\r' character at the end of each line in CSV file. Note, no 
> fields are escaped in original input file.
> {code:java}
> val multilineDf = sparkSession.read
>   .format("csv")
>   .options(Map("header" -> "true", "inferSchema" -> "false", "escape" -> 
> "\"", "multiLine" -> "true"))
>   .load("src/test/resources/multiLineHeader.csv")
> val df = sparkSession.read
>   .format("csv")
>   .options(Map("header" -> "true", "inferSchema" -> "false", "escape" -> 
> "\""))
>   .load("src/test/resources/multiLineHeader.csv")
> multilineDf.show()
> multilineDf.collect().foreach(println)
> df.show()
> df.collect().foreach(println)
> {code}
> Result:
> {code:java}
> ++-+
> |
> ++-+
> |
> |
> ++-+
> ]
> ]
> +++
> |col1|col2|
> +++
> |   1|   1|
> |   2|   2|
> +++
> [1,1]
> [2,2]
> {code}
> Input file:
> {code:java}
> cat -vt src/test/resources/multiLineHeader.csv
> col1,col2^M
> 1,1^M
> 2,2^M
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25473) PySpark ForeachWriter test fails on Python 3.6 and macOS High Serria

2018-09-22 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25473:


Assignee: Hyukjin Kwon

> PySpark ForeachWriter test fails on Python 3.6 and macOS High Serria
> 
>
> Key: SPARK-25473
> URL: https://issues.apache.org/jira/browse/SPARK-25473
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.5.0
>
>
> {code}
> PYSPARK_PYTHON=python3.6 SPARK_TESTING=1 ./bin/pyspark pyspark.sql.tests 
> SQLTests
> {code}
> {code}
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> /usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py:766:
>  ResourceWarning: subprocess 27563 is still running
>   ResourceWarning, source=self)
> [Stage 0:>  (0 + 1) / 
> 1]objc[27586]: +[__NSPlaceholderDictionary initialize] may have been in 
> progress in another thread when fork() was called.
> objc[27586]: +[__NSPlaceholderDictionary initialize] may have been in 
> progress in another thread when fork() was called. We cannot safely call it 
> or ignore it in the fork() child process. Crashing instead. Set a breakpoint 
> on objc_initializeAfterForkError to debug.
> ERROR
> ==
> ERROR: test_streaming_foreach_with_simple_function 
> (pyspark.sql.tests.SQLTests)
> --
> Traceback (most recent call last):
>   File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o54.processAllAvailable.
> : org.apache.spark.sql.streaming.StreamingQueryException: Writing job aborted.
> === Streaming Query ===
> Identifier: [id = f508d634-407c-4232-806b-70e54b055c42, runId = 
> 08d1435b-5358-4fb6-b167-811584a3163e]
> Current Committed Offsets: {}
> Current Available Offsets: 
> {FileStreamSource[file:/var/folders/71/484zt4z10ks1vydt03bhp6hrgp/T/tmpolebys1s]:
>  {"logOffset":0}}
> Current State: ACTIVE
> Thread State: RUNNABLE
> Logical Plan:
> FileStreamSource[file:/var/folders/71/484zt4z10ks1vydt03bhp6hrgp/T/tmpolebys1s]
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
> Caused by: org.apache.spark.SparkException: Writing job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:91)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783)
>   at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
>   at org.apache.spark.sql.Dataset.collect(Dataset.scala:2783)
>   at 
>

[jira] [Resolved] (SPARK-25473) PySpark ForeachWriter test fails on Python 3.6 and macOS High Serria

2018-09-22 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25473.
--
   Resolution: Fixed
Fix Version/s: 2.5.0

Issue resolved by pull request 22480
[https://github.com/apache/spark/pull/22480]

> PySpark ForeachWriter test fails on Python 3.6 and macOS High Serria
> 
>
> Key: SPARK-25473
> URL: https://issues.apache.org/jira/browse/SPARK-25473
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.5.0
>
>
> {code}
> PYSPARK_PYTHON=python3.6 SPARK_TESTING=1 ./bin/pyspark pyspark.sql.tests 
> SQLTests
> {code}
> {code}
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> /usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py:766:
>  ResourceWarning: subprocess 27563 is still running
>   ResourceWarning, source=self)
> [Stage 0:>  (0 + 1) / 
> 1]objc[27586]: +[__NSPlaceholderDictionary initialize] may have been in 
> progress in another thread when fork() was called.
> objc[27586]: +[__NSPlaceholderDictionary initialize] may have been in 
> progress in another thread when fork() was called. We cannot safely call it 
> or ignore it in the fork() child process. Crashing instead. Set a breakpoint 
> on objc_initializeAfterForkError to debug.
> ERROR
> ==
> ERROR: test_streaming_foreach_with_simple_function 
> (pyspark.sql.tests.SQLTests)
> --
> Traceback (most recent call last):
>   File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o54.processAllAvailable.
> : org.apache.spark.sql.streaming.StreamingQueryException: Writing job aborted.
> === Streaming Query ===
> Identifier: [id = f508d634-407c-4232-806b-70e54b055c42, runId = 
> 08d1435b-5358-4fb6-b167-811584a3163e]
> Current Committed Offsets: {}
> Current Available Offsets: 
> {FileStreamSource[file:/var/folders/71/484zt4z10ks1vydt03bhp6hrgp/T/tmpolebys1s]:
>  {"logOffset":0}}
> Current State: ACTIVE
> Thread State: RUNNABLE
> Logical Plan:
> FileStreamSource[file:/var/folders/71/484zt4z10ks1vydt03bhp6hrgp/T/tmpolebys1s]
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
> Caused by: org.apache.spark.SparkException: Writing job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:91)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783)
>   at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
>   at org.apache.spark.sql.Dataset.collect(Dataset.scala:2783)
>   at 
>

[jira] [Comment Edited] (SPARK-25480) Dynamic partitioning + saveAsTable with multiple partition columns create empty directory

2018-09-22 Thread Christopher Burns (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624913#comment-16624913
 ] 

Christopher Burns edited comment on SPARK-25480 at 9/23/18 2:54 AM:


I can confirm this happens with Spark 2.3 / HDFS 2.7.4 + write.parquet()


was (Author: chris-topher):
I can confirm this happens with Spark 2.3 / HDFS 2.7.4

> Dynamic partitioning + saveAsTable with multiple partition columns create 
> empty directory
> -
>
> Key: SPARK-25480
> URL: https://issues.apache.org/jira/browse/SPARK-25480
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Daniel Mateus Pires
>Priority: Minor
> Attachments: dynamic_partitioning.json
>
>
> We use .saveAsTable and dynamic partitioning as our only way to write data to 
> S3 from Spark.
> When only 1 partition column is defined for a table, .saveAsTable behaves as 
> expected:
>  - with Overwrite mode it will create a table if it doesn't exist and write 
> the data
>  - with Append mode it will append to a given partition
>  - with Overwrite mode if the table exists it will overwrite the partition
> If 2 partition columns are used however, the directory is created on S3 with 
> the SUCCESS file, but no data is actually written
> our solution is to check if the table doesn't exist, and in that case, set 
> the partitioning mode back to static before running saveAsTable:
> {code}
> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
> df.write.mode("overwrite").partitionBy("year", "month").option("path", 
> "s3://hbc-data-warehouse/integration/users_test").saveAsTable("users_test")
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25480) Dynamic partitioning + saveAsTable with multiple partition columns create empty directory

2018-09-22 Thread Christopher Burns (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624913#comment-16624913
 ] 

Christopher Burns commented on SPARK-25480:
---

I can confirm this happens with Spark 2.3 / HDFS 2.7.4

> Dynamic partitioning + saveAsTable with multiple partition columns create 
> empty directory
> -
>
> Key: SPARK-25480
> URL: https://issues.apache.org/jira/browse/SPARK-25480
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Daniel Mateus Pires
>Priority: Minor
> Attachments: dynamic_partitioning.json
>
>
> We use .saveAsTable and dynamic partitioning as our only way to write data to 
> S3 from Spark.
> When only 1 partition column is defined for a table, .saveAsTable behaves as 
> expected:
>  - with Overwrite mode it will create a table if it doesn't exist and write 
> the data
>  - with Append mode it will append to a given partition
>  - with Overwrite mode if the table exists it will overwrite the partition
> If 2 partition columns are used however, the directory is created on S3 with 
> the SUCCESS file, but no data is actually written
> our solution is to check if the table doesn't exist, and in that case, set 
> the partitioning mode back to static before running saveAsTable:
> {code}
> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
> df.write.mode("overwrite").partitionBy("year", "month").option("path", 
> "s3://hbc-data-warehouse/integration/users_test").saveAsTable("users_test")
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25460) DataSourceV2: Structured Streaming does not respect SessionConfigSupport

2018-09-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624909#comment-16624909
 ] 

Apache Spark commented on SPARK-25460:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/22462

> DataSourceV2: Structured Streaming does not respect SessionConfigSupport
> 
>
> Key: SPARK-25460
> URL: https://issues.apache.org/jira/browse/SPARK-25460
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.5.0
>
>
> {{SessionConfigSupport}} allows to support configurations as options:
> {code}
> `spark.datasource.$keyPrefix.xxx` into `xxx`, 
> {code}
> Currently, structured streaming does seem supporting this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25460) DataSourceV2: Structured Streaming does not respect SessionConfigSupport

2018-09-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624908#comment-16624908
 ] 

Apache Spark commented on SPARK-25460:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/22529

> DataSourceV2: Structured Streaming does not respect SessionConfigSupport
> 
>
> Key: SPARK-25460
> URL: https://issues.apache.org/jira/browse/SPARK-25460
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.5.0
>
>
> {{SessionConfigSupport}} allows to support configurations as options:
> {code}
> `spark.datasource.$keyPrefix.xxx` into `xxx`, 
> {code}
> Currently, structured streaming does seem supporting this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25460) DataSourceV2: Structured Streaming does not respect SessionConfigSupport

2018-09-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624907#comment-16624907
 ] 

Apache Spark commented on SPARK-25460:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/22462

> DataSourceV2: Structured Streaming does not respect SessionConfigSupport
> 
>
> Key: SPARK-25460
> URL: https://issues.apache.org/jira/browse/SPARK-25460
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.5.0
>
>
> {{SessionConfigSupport}} allows to support configurations as options:
> {code}
> `spark.datasource.$keyPrefix.xxx` into `xxx`, 
> {code}
> Currently, structured streaming does seem supporting this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25513) Read zipped CSV and JSON files

2018-09-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25513:


Assignee: (was: Apache Spark)

> Read zipped CSV and JSON files
> --
>
> Key: SPARK-25513
> URL: https://issues.apache.org/jira/browse/SPARK-25513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Spark can read compression files if there is compression codec for them. By 
> default, Hadoop provides compressors/decompressors for bzip2, deflate, gzip, 
> lz4 and snappy but they cannot be used directly for reading zip archives. 
> In general zip archives can contain multiple entries but in practice users 
> use zip archives to store only one file. This use case is pretty often in 
> wild. 
> The ticket aims to support reading of zipped CSV and JSON files in multi-line 
> mode when each zip archive contains only one file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25513) Read zipped CSV and JSON files

2018-09-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624818#comment-16624818
 ] 

Apache Spark commented on SPARK-25513:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22528

> Read zipped CSV and JSON files
> --
>
> Key: SPARK-25513
> URL: https://issues.apache.org/jira/browse/SPARK-25513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Spark can read compression files if there is compression codec for them. By 
> default, Hadoop provides compressors/decompressors for bzip2, deflate, gzip, 
> lz4 and snappy but they cannot be used directly for reading zip archives. 
> In general zip archives can contain multiple entries but in practice users 
> use zip archives to store only one file. This use case is pretty often in 
> wild. 
> The ticket aims to support reading of zipped CSV and JSON files in multi-line 
> mode when each zip archive contains only one file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25513) Read zipped CSV and JSON files

2018-09-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25513:


Assignee: Apache Spark

> Read zipped CSV and JSON files
> --
>
> Key: SPARK-25513
> URL: https://issues.apache.org/jira/browse/SPARK-25513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Spark can read compression files if there is compression codec for them. By 
> default, Hadoop provides compressors/decompressors for bzip2, deflate, gzip, 
> lz4 and snappy but they cannot be used directly for reading zip archives. 
> In general zip archives can contain multiple entries but in practice users 
> use zip archives to store only one file. This use case is pretty often in 
> wild. 
> The ticket aims to support reading of zipped CSV and JSON files in multi-line 
> mode when each zip archive contains only one file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25513) Read zipped CSV and JSON files

2018-09-22 Thread Maxim Gekk (JIRA)

Maxim Gekk created SPARK-25513:
--

 Summary: Read zipped CSV and JSON files
 Key: SPARK-25513
 URL: https://issues.apache.org/jira/browse/SPARK-25513
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Spark can read compression files if there is compression codec for them. By 
default, Hadoop provides compressors/decompressors for bzip2, deflate, gzip, 
lz4 and snappy but they cannot be used directly for reading zip archives. 

In general zip archives can contain multiple entries but in practice users use 
zip archives to store only one file. This use case is pretty often in wild. 

The ticket aims to support reading of zipped CSV and JSON files in multi-line 
mode when each zip archive contains only one file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17952) SparkSession createDataFrame method throws exception for nested JavaBeans

2018-09-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17952:


Assignee: Apache Spark

> SparkSession createDataFrame method throws exception for nested JavaBeans
> -
>
> Key: SPARK-17952
> URL: https://issues.apache.org/jira/browse/SPARK-17952
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1, 2.3.0
>Reporter: Amit Baghel
>Assignee: Apache Spark
>Priority: Major
>
> As per latest spark documentation for Java at 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection,
>  
> {quote}
> Nested JavaBeans and List or Array fields are supported though.
> {quote}
> However nested JavaBean is not working. Please see the below code.
> SubCategory class
> {code}
> public class SubCategory implements Serializable{
>   private String id;
>   private String name;
>   
>   public String getId() {
>   return id;
>   }
>   public void setId(String id) {
>   this.id = id;
>   }
>   public String getName() {
>   return name;
>   }
>   public void setName(String name) {
>   this.name = name;
>   }   
> }
> {code}
> Category class
> {code}
> public class Category implements Serializable{
>   private String id;
>   private SubCategory subCategory;
>   
>   public String getId() {
>   return id;
>   }
>   public void setId(String id) {
>   this.id = id;
>   }
>   public SubCategory getSubCategory() {
>   return subCategory;
>   }
>   public void setSubCategory(SubCategory subCategory) {
>   this.subCategory = subCategory;
>   }
> }
> {code}
> SparkSample class
> {code}
> public class SparkSample {
>   public static void main(String[] args) throws IOException { 
> 
>   SparkSession spark = SparkSession
>   .builder()
>   .appName("SparkSample")
>   .master("local")
>   .getOrCreate();
>   //SubCategory
>   SubCategory sub = new SubCategory();
>   sub.setId("sc-111");
>   sub.setName("Sub-1");
>   //Category
>   Category category = new Category();
>   category.setId("s-111");
>   category.setSubCategory(sub);
>   //categoryList
>   List categoryList = new ArrayList();
>   categoryList.add(category);
>//DF
>   Dataset dframe = spark.createDataFrame(categoryList, 
> Category.class);  
>   dframe.show();  
>   }
> }
> {code}
> Above code throws below error.
> {code}
> Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d 
> (of class com.sample.SubCategory)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$class.toStream(Iterator.scala:1322)
>   at scala.collection.AbstractIterator.toStream(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298)
>

[jira] [Assigned] (SPARK-17952) SparkSession createDataFrame method throws exception for nested JavaBeans

2018-09-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17952:


Assignee: (was: Apache Spark)

> SparkSession createDataFrame method throws exception for nested JavaBeans
> -
>
> Key: SPARK-17952
> URL: https://issues.apache.org/jira/browse/SPARK-17952
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1, 2.3.0
>Reporter: Amit Baghel
>Priority: Major
>
> As per latest spark documentation for Java at 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection,
>  
> {quote}
> Nested JavaBeans and List or Array fields are supported though.
> {quote}
> However nested JavaBean is not working. Please see the below code.
> SubCategory class
> {code}
> public class SubCategory implements Serializable{
>   private String id;
>   private String name;
>   
>   public String getId() {
>   return id;
>   }
>   public void setId(String id) {
>   this.id = id;
>   }
>   public String getName() {
>   return name;
>   }
>   public void setName(String name) {
>   this.name = name;
>   }   
> }
> {code}
> Category class
> {code}
> public class Category implements Serializable{
>   private String id;
>   private SubCategory subCategory;
>   
>   public String getId() {
>   return id;
>   }
>   public void setId(String id) {
>   this.id = id;
>   }
>   public SubCategory getSubCategory() {
>   return subCategory;
>   }
>   public void setSubCategory(SubCategory subCategory) {
>   this.subCategory = subCategory;
>   }
> }
> {code}
> SparkSample class
> {code}
> public class SparkSample {
>   public static void main(String[] args) throws IOException { 
> 
>   SparkSession spark = SparkSession
>   .builder()
>   .appName("SparkSample")
>   .master("local")
>   .getOrCreate();
>   //SubCategory
>   SubCategory sub = new SubCategory();
>   sub.setId("sc-111");
>   sub.setName("Sub-1");
>   //Category
>   Category category = new Category();
>   category.setId("s-111");
>   category.setSubCategory(sub);
>   //categoryList
>   List categoryList = new ArrayList();
>   categoryList.add(category);
>//DF
>   Dataset dframe = spark.createDataFrame(categoryList, 
> Category.class);  
>   dframe.show();  
>   }
> }
> {code}
> Above code throws below error.
> {code}
> Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d 
> (of class com.sample.SubCategory)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$class.toStream(Iterator.scala:1322)
>   at scala.collection.AbstractIterator.toStream(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298)
>   at

[jira] [Commented] (SPARK-17952) SparkSession createDataFrame method throws exception for nested JavaBeans

2018-09-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624791#comment-16624791
 ] 

Apache Spark commented on SPARK-17952:
--

User 'michalsenkyr' has created a pull request for this issue:
https://github.com/apache/spark/pull/22527

> SparkSession createDataFrame method throws exception for nested JavaBeans
> -
>
> Key: SPARK-17952
> URL: https://issues.apache.org/jira/browse/SPARK-17952
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1, 2.3.0
>Reporter: Amit Baghel
>Priority: Major
>
> As per latest spark documentation for Java at 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection,
>  
> {quote}
> Nested JavaBeans and List or Array fields are supported though.
> {quote}
> However nested JavaBean is not working. Please see the below code.
> SubCategory class
> {code}
> public class SubCategory implements Serializable{
>   private String id;
>   private String name;
>   
>   public String getId() {
>   return id;
>   }
>   public void setId(String id) {
>   this.id = id;
>   }
>   public String getName() {
>   return name;
>   }
>   public void setName(String name) {
>   this.name = name;
>   }   
> }
> {code}
> Category class
> {code}
> public class Category implements Serializable{
>   private String id;
>   private SubCategory subCategory;
>   
>   public String getId() {
>   return id;
>   }
>   public void setId(String id) {
>   this.id = id;
>   }
>   public SubCategory getSubCategory() {
>   return subCategory;
>   }
>   public void setSubCategory(SubCategory subCategory) {
>   this.subCategory = subCategory;
>   }
> }
> {code}
> SparkSample class
> {code}
> public class SparkSample {
>   public static void main(String[] args) throws IOException { 
> 
>   SparkSession spark = SparkSession
>   .builder()
>   .appName("SparkSample")
>   .master("local")
>   .getOrCreate();
>   //SubCategory
>   SubCategory sub = new SubCategory();
>   sub.setId("sc-111");
>   sub.setName("Sub-1");
>   //Category
>   Category category = new Category();
>   category.setId("s-111");
>   category.setSubCategory(sub);
>   //categoryList
>   List categoryList = new ArrayList();
>   categoryList.add(category);
>//DF
>   Dataset dframe = spark.createDataFrame(categoryList, 
> Category.class);  
>   dframe.show();  
>   }
> }
> {code}
> Above code throws below error.
> {code}
> Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d 
> (of class com.sample.SubCategory)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$class.toStream(Iterator.scala:1322)
>   at scala.collection.AbstractIterator.toStream(Iterator.scala:1336)
>   at 
>

[jira] [Commented] (SPARK-24794) DriverWrapper should have both master addresses in -Dspark.master

2018-09-22 Thread Behroz Sikander (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624790#comment-16624790
 ] 

Behroz Sikander commented on SPARK-24794:
-

Can someone please have a look at this PR?

> DriverWrapper should have both master addresses in -Dspark.master
> -
>
> Key: SPARK-24794
> URL: https://issues.apache.org/jira/browse/SPARK-24794
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.2.1
>Reporter: Behroz Sikander
>Priority: Major
>
> In standalone cluster mode, one could launch a Driver with supervise mode 
> enabled. Spark launches the driver with a JVM argument -Dspark.master which 
> is set to [host and port of current 
> master|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L149].
>  
> During the life of context, the spark masters can switch due to any reason. 
> After that if the driver dies unexpectedly and comes up it tries to connect 
> with the master which was set initially with -Dspark.master but that master 
> is in STANDBY mode. The context tries multiple times to connect to standby 
> and then just kills itself.
>  
> *Suggestion:*
> While launching the driver process, Spark master should use the [spark.master 
> passed as 
> input|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L124]
>  instead of master and port of the current master.
> Log messages that we observe:
>  
> {code:java}
> 2018-07-11 13:03:21,801 INFO appclient-register-master-threadpool-0 
> org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: 
> Connecting to master spark://10.100.100.22:7077..
> .
> 2018-07-11 13:03:21,806 INFO netty-rpc-connection-0 
> org.apache.spark.network.client.TransportClientFactory []: Successfully 
> created connection to /10.100.100.22:7077 after 1 ms (0 ms spent in 
> bootstraps)
> .
> 2018-07-11 13:03:41,802 INFO appclient-register-master-threadpool-0 
> org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: 
> Connecting to master spark://10.100.100.22:7077...
> .
> 2018-07-11 13:04:01,802 INFO appclient-register-master-threadpool-0 
> org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: 
> Connecting to master spark://10.100.100.22:7077...
> .
> 2018-07-11 13:04:21,806 ERROR appclient-registration-retry-thread 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend []: Application 
> has been killed. Reason: All masters are unresponsive! Giving up.{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25465) Refactor Parquet test suites in project Hive

2018-09-22 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25465.
-
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 2.5.0

> Refactor Parquet test suites in project Hive
> 
>
> Key: SPARK-25465
> URL: https://issues.apache.org/jira/browse/SPARK-25465
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.5.0
>
>
> Current the file 
> parquetSuites.scala(https://github.com/apache/spark/blob/f29c2b5287563c0d6f55f936bd5a75707d7b2b1f/sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala)
>  is not recognizable. 
> When I tried to find test suites for built-in Parquet conversions for Hive 
> serde, I can only find 
> HiveParquetSuite(https://github.com/apache/spark/blob/f29c2b5287563c0d6f55f936bd5a75707d7b2b1f/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala)
>  in the first few minutes.
> The file name and test suite naming can be revised.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24486) Slow performance reading ArrayType columns

2018-09-22 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624723#comment-16624723
 ] 

Yuming Wang commented on SPARK-24486:
-

I can reproduce this issue. I'm working on.

> Slow performance reading ArrayType columns
> --
>
> Key: SPARK-24486
> URL: https://issues.apache.org/jira/browse/SPARK-24486
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Luca Canali
>Priority: Minor
>
> We have found an issue of slow performance in one of our applications when 
> running on Spark 2.3.0 (the same workload does not have a performance issue 
> on Spark 2.2.1). We suspect a regression in the area of handling columns of 
> ArrayType. I have built a simplified test case showing a manifestation of the 
> issue to help with troubleshooting:
>  
>  
> {code:java}
> // prepare test data
> val stringListValues=Range(1,3).mkString(",")
> sql(s"select 1 as myid, Array($stringListValues) as myarray from 
> range(2)").repartition(1).write.parquet("file:///tmp/deleteme1")
> // run test
> spark.read.parquet("file:///tmp/deleteme1").limit(1).show(){code}
> Performance measurements:
>  
> On a desktop-size test system, the test runs in about 2 sec using Spark 2.2.1 
> (runtime goes down to subsecond in subsequent runs) and takes close to 20 sec 
> on Spark 2.3.0
>  
> Additional drill-down using Spark task metrics data, show that in Spark 2.2.1 
> only 2 records are read by this workload, while on Spark 2.3.0 all rows in 
> the file are read, which appears anomalous.
> Example:
> {code:java}
> bin/spark-shell --master local[*] --driver-memory 2g --packages 
> ch.cern.sparkmeasure:spark-measure_2.11:0.11
> val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) 
> stageMetrics.runAndMeasure(spark.read.parquet("file:///tmp/deleteme1").limit(1).show())
> {code}
>  
>  
> Selected metrics from Spark 2.3.0 run:
>  
> {noformat}
> elapsedTime => 17849 (18 s)
> sum(numTasks) => 11
> sum(recordsRead) => 2
> sum(bytesRead) => 1136448171 (1083.0 MB){noformat}
>  
>  
> From Spark 2.2.1 run:
>  
> {noformat}
> elapsedTime => 1329 (1 s)
> sum(numTasks) => 2
> sum(recordsRead) => 2
> sum(bytesRead) => 269162610 (256.0 MB)
> {noformat}
>  
> Note: Using Spark built from master (as I write this, June 7th 2018) shows 
> the same behavior as found in Spark 2.3.0
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25512) Using RowNumbers in SparkR Dataframe

2018-09-22 Thread Asif Khan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif Khan updated SPARK-25512:
--
Issue Type: Question  (was: Bug)

> Using RowNumbers in SparkR Dataframe
> 
>
> Key: SPARK-25512
> URL: https://issues.apache.org/jira/browse/SPARK-25512
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Asif Khan
>Priority: Critical
>
> Hi,
> I have a use case , where I have a  SparkR  dataframe and i want to iterate 
> over the dataframe in a for loop using the row numbers  of the dataframe. Is 
> it possible?
> Only solution I have now is to collect() the SparkR dataframe in R dataframe 
> , which brings the entire dataframe on Driver node and then iterate over it 
> using row numbers. But as the for loop executes only on driver node, I don't 
> get the advantage of parallel processing in Spark which was the whole purpose 
> of using Spark. Please Help.
> Thank You,
> Asif Khan



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25512) Using RowNumbers in SparkR Dataframe

2018-09-22 Thread Asif Khan (JIRA)

Asif Khan created SPARK-25512:
-

 Summary: Using RowNumbers in SparkR Dataframe
 Key: SPARK-25512
 URL: https://issues.apache.org/jira/browse/SPARK-25512
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.1
Reporter: Asif Khan


Hi,

I have a use case , where I have a  SparkR  dataframe and i want to iterate 
over the dataframe in a for loop using the row numbers  of the dataframe. Is it 
possible?

Only solution I have now is to collect() the SparkR dataframe in R dataframe , 
which brings the entire dataframe on Driver node and then iterate over it using 
row numbers. But as the for loop executes only on driver node, I don't get the 
advantage of parallel processing in Spark which was the whole purpose of using 
Spark. Please Help.

Thank You,

Asif Khan



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25502) [Spark Job History] Empty Page when page number exceeds the reatinedTask size

2018-09-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624696#comment-16624696
 ] 

Apache Spark commented on SPARK-25502:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/22526

> [Spark Job History] Empty Page when page number exceeds the reatinedTask size 
> --
>
> Key: SPARK-25502
> URL: https://issues.apache.org/jira/browse/SPARK-25502
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> *Steps:*
> 1. Spark installed and running properly.
> 2. spark.ui.retainedTask=10 ( it is default value )
> 3.Launch Spark shell ./spark-shell --master yarn
> 4. Create a spark-shell application with a single job and 50 task
> val rdd = sc.parallelize(1 to 50, 50)
> rdd.count
> 5. Launch Job History Page and go to spark-shell application created above 
> under Incomplete Task
> 6. Right click and got to Job page of the application and from there click 
> and launch Stage Page
> 7. Launch the Stage Id page for the specific Stage Id for the above created 
> job
> 8. Scroll down and check for the task completion Summary
> It Displays pagination panel showing *5000 Pages Jump to 1 Show 100 items in 
> a page* and Go button
> 9. Replace 1 with 2333 page number
> *Actual Result:*
> 2 Pagination Panel displayed
> *Expected Result:*
> Pagination Panel should not display 5000 pages as retainedTask value is 
> 10 and it should display 1000 page only because each page holding 100 
> tasks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25502) [Spark Job History] Empty Page when page number exceeds the reatinedTask size

2018-09-22 Thread shahid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624695#comment-16624695
 ] 

shahid commented on SPARK-25502:


I have raised PR https://github.com/apache/spark/pull/22502

> [Spark Job History] Empty Page when page number exceeds the reatinedTask size 
> --
>
> Key: SPARK-25502
> URL: https://issues.apache.org/jira/browse/SPARK-25502
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> *Steps:*
> 1. Spark installed and running properly.
> 2. spark.ui.retainedTask=10 ( it is default value )
> 3.Launch Spark shell ./spark-shell --master yarn
> 4. Create a spark-shell application with a single job and 50 task
> val rdd = sc.parallelize(1 to 50, 50)
> rdd.count
> 5. Launch Job History Page and go to spark-shell application created above 
> under Incomplete Task
> 6. Right click and got to Job page of the application and from there click 
> and launch Stage Page
> 7. Launch the Stage Id page for the specific Stage Id for the above created 
> job
> 8. Scroll down and check for the task completion Summary
> It Displays pagination panel showing *5000 Pages Jump to 1 Show 100 items in 
> a page* and Go button
> 9. Replace 1 with 2333 page number
> *Actual Result:*
> 2 Pagination Panel displayed
> *Expected Result:*
> Pagination Panel should not display 5000 pages as retainedTask value is 
> 10 and it should display 1000 page only because each page holding 100 
> tasks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25502) [Spark Job History] Empty Page when page number exceeds the reatinedTask size

2018-09-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25502:


Assignee: Apache Spark

> [Spark Job History] Empty Page when page number exceeds the reatinedTask size 
> --
>
> Key: SPARK-25502
> URL: https://issues.apache.org/jira/browse/SPARK-25502
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Apache Spark
>Priority: Major
>
> *Steps:*
> 1. Spark installed and running properly.
> 2. spark.ui.retainedTask=10 ( it is default value )
> 3.Launch Spark shell ./spark-shell --master yarn
> 4. Create a spark-shell application with a single job and 50 task
> val rdd = sc.parallelize(1 to 50, 50)
> rdd.count
> 5. Launch Job History Page and go to spark-shell application created above 
> under Incomplete Task
> 6. Right click and got to Job page of the application and from there click 
> and launch Stage Page
> 7. Launch the Stage Id page for the specific Stage Id for the above created 
> job
> 8. Scroll down and check for the task completion Summary
> It Displays pagination panel showing *5000 Pages Jump to 1 Show 100 items in 
> a page* and Go button
> 9. Replace 1 with 2333 page number
> *Actual Result:*
> 2 Pagination Panel displayed
> *Expected Result:*
> Pagination Panel should not display 5000 pages as retainedTask value is 
> 10 and it should display 1000 page only because each page holding 100 
> tasks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25502) [Spark Job History] Empty Page when page number exceeds the reatinedTask size

2018-09-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25502:


Assignee: (was: Apache Spark)

> [Spark Job History] Empty Page when page number exceeds the reatinedTask size 
> --
>
> Key: SPARK-25502
> URL: https://issues.apache.org/jira/browse/SPARK-25502
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> *Steps:*
> 1. Spark installed and running properly.
> 2. spark.ui.retainedTask=10 ( it is default value )
> 3.Launch Spark shell ./spark-shell --master yarn
> 4. Create a spark-shell application with a single job and 50 task
> val rdd = sc.parallelize(1 to 50, 50)
> rdd.count
> 5. Launch Job History Page and go to spark-shell application created above 
> under Incomplete Task
> 6. Right click and got to Job page of the application and from there click 
> and launch Stage Page
> 7. Launch the Stage Id page for the specific Stage Id for the above created 
> job
> 8. Scroll down and check for the task completion Summary
> It Displays pagination panel showing *5000 Pages Jump to 1 Show 100 items in 
> a page* and Go button
> 9. Replace 1 with 2333 page number
> *Actual Result:*
> 2 Pagination Panel displayed
> *Expected Result:*
> Pagination Panel should not display 5000 pages as retainedTask value is 
> 10 and it should display 1000 page only because each page holding 100 
> tasks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25503) [Spark Job History] Total task message in stage page is ambiguous

2018-09-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624685#comment-16624685
 ] 

Apache Spark commented on SPARK-25503:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/22525

> [Spark Job History] Total task message in stage page is ambiguous
> -
>
> Key: SPARK-25503
> URL: https://issues.apache.org/jira/browse/SPARK-25503
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> *Steps:*
>  1. Spark installed and running properly.
>  2. spark.ui.retainedTask=10 ( it is default value )
>  3.Launch Spark shell ./spark-shell --master yarn
>  4. Create a spark-shell application with a single job and 50 task
>  val rdd = sc.parallelize(1 to 50, 50)
>  rdd.count
>  5. Launch Job History Page and go to spark-shell application created above 
> under Incomplete Task
>  6. Right click and got to Job page of the application and from there click 
> and launch Stage Page
>  7. Launch the Stage Id page for the specific Stage Id for the above created 
> job
>  8. Scroll down and check for the task msg above Pagination Panel
>  It Displays *Task( 10, Showing 50)*
> *Actual Result:*
>  It displayed Task( 10, Showing 50)
> *Expected Result:*
>  Since retainedTask=10 and it should show 10 task
>  So message should be Task( 50, Showing 10)
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25503) [Spark Job History] Total task message in stage page is ambiguous

2018-09-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624684#comment-16624684
 ] 

Apache Spark commented on SPARK-25503:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/22525

> [Spark Job History] Total task message in stage page is ambiguous
> -
>
> Key: SPARK-25503
> URL: https://issues.apache.org/jira/browse/SPARK-25503
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> *Steps:*
>  1. Spark installed and running properly.
>  2. spark.ui.retainedTask=10 ( it is default value )
>  3.Launch Spark shell ./spark-shell --master yarn
>  4. Create a spark-shell application with a single job and 50 task
>  val rdd = sc.parallelize(1 to 50, 50)
>  rdd.count
>  5. Launch Job History Page and go to spark-shell application created above 
> under Incomplete Task
>  6. Right click and got to Job page of the application and from there click 
> and launch Stage Page
>  7. Launch the Stage Id page for the specific Stage Id for the above created 
> job
>  8. Scroll down and check for the task msg above Pagination Panel
>  It Displays *Task( 10, Showing 50)*
> *Actual Result:*
>  It displayed Task( 10, Showing 50)
> *Expected Result:*
>  Since retainedTask=10 and it should show 10 task
>  So message should be Task( 50, Showing 10)
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25503) [Spark Job History] Total task message in stage page is ambiguous

2018-09-22 Thread shahid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624683#comment-16624683
 ] 

shahid commented on SPARK-25503:


I have raised the PR https://github.com/apache/spark/pull/22525

> [Spark Job History] Total task message in stage page is ambiguous
> -
>
> Key: SPARK-25503
> URL: https://issues.apache.org/jira/browse/SPARK-25503
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> *Steps:*
>  1. Spark installed and running properly.
>  2. spark.ui.retainedTask=10 ( it is default value )
>  3.Launch Spark shell ./spark-shell --master yarn
>  4. Create a spark-shell application with a single job and 50 task
>  val rdd = sc.parallelize(1 to 50, 50)
>  rdd.count
>  5. Launch Job History Page and go to spark-shell application created above 
> under Incomplete Task
>  6. Right click and got to Job page of the application and from there click 
> and launch Stage Page
>  7. Launch the Stage Id page for the specific Stage Id for the above created 
> job
>  8. Scroll down and check for the task msg above Pagination Panel
>  It Displays *Task( 10, Showing 50)*
> *Actual Result:*
>  It displayed Task( 10, Showing 50)
> *Expected Result:*
>  Since retainedTask=10 and it should show 10 task
>  So message should be Task( 50, Showing 10)
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25503) [Spark Job History] Total task message in stage page is ambiguous

2018-09-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25503:


Assignee: Apache Spark

> [Spark Job History] Total task message in stage page is ambiguous
> -
>
> Key: SPARK-25503
> URL: https://issues.apache.org/jira/browse/SPARK-25503
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Apache Spark
>Priority: Major
>
> *Steps:*
>  1. Spark installed and running properly.
>  2. spark.ui.retainedTask=10 ( it is default value )
>  3.Launch Spark shell ./spark-shell --master yarn
>  4. Create a spark-shell application with a single job and 50 task
>  val rdd = sc.parallelize(1 to 50, 50)
>  rdd.count
>  5. Launch Job History Page and go to spark-shell application created above 
> under Incomplete Task
>  6. Right click and got to Job page of the application and from there click 
> and launch Stage Page
>  7. Launch the Stage Id page for the specific Stage Id for the above created 
> job
>  8. Scroll down and check for the task msg above Pagination Panel
>  It Displays *Task( 10, Showing 50)*
> *Actual Result:*
>  It displayed Task( 10, Showing 50)
> *Expected Result:*
>  Since retainedTask=10 and it should show 10 task
>  So message should be Task( 50, Showing 10)
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25503) [Spark Job History] Total task message in stage page is ambiguous

2018-09-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25503:


Assignee: (was: Apache Spark)

> [Spark Job History] Total task message in stage page is ambiguous
> -
>
> Key: SPARK-25503
> URL: https://issues.apache.org/jira/browse/SPARK-25503
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> *Steps:*
>  1. Spark installed and running properly.
>  2. spark.ui.retainedTask=10 ( it is default value )
>  3.Launch Spark shell ./spark-shell --master yarn
>  4. Create a spark-shell application with a single job and 50 task
>  val rdd = sc.parallelize(1 to 50, 50)
>  rdd.count
>  5. Launch Job History Page and go to spark-shell application created above 
> under Incomplete Task
>  6. Right click and got to Job page of the application and from there click 
> and launch Stage Page
>  7. Launch the Stage Id page for the specific Stage Id for the above created 
> job
>  8. Scroll down and check for the task msg above Pagination Panel
>  It Displays *Task( 10, Showing 50)*
> *Actual Result:*
>  It displayed Task( 10, Showing 50)
> *Expected Result:*
>  Since retainedTask=10 and it should show 10 task
>  So message should be Task( 50, Showing 10)
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25511) Map with "null" key not working in spark 2.3

2018-09-22 Thread JIRA



[ 
https://issues.apache.org/jira/browse/SPARK-25511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624642#comment-16624642
 ] 

Michal Šenkýř commented on SPARK-25511:
---

Null keys in MapType should have been disallowed in 2.1.x as per 
[Scaladoc|https://github.com/apache/spark/blob/v2.1.0/sql/catalyst/src/main/scala/org/apache/spark/sql/types/MapType.scala#L26],
 so I believe that behavior was a bug.

> Map with "null" key not working in spark 2.3
> 
>
> Key: SPARK-25511
> URL: https://issues.apache.org/jira/browse/SPARK-25511
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.3.1
>Reporter: Ravi Shankar
>Priority: Major
>
> I had a use case where i was creating a histogram of column values through a 
> UDAF in a Map data type. It is basically just a group by count on a column's 
> value that is returned as a Map. I needed to plugin 
> all invalid values for the column as a "null -> count" in the map that was 
> returned. In 2.1.x, this was working fine and i could create a Map with 
> "null" being a key. This is not working in 2.3 and wondering if this is 
> expected and if i have to change my application code: 
>  
> {code:java}
> val myList = List(("a", "1"), ("b", "2"), ("a", "3"), (null, "4"))
> val map = myList.toMap
> val data = List(List("sublime", map))
> val rdd = sc.parallelize(data).map(l ⇒ Row.fromSeq(l.toSeq))
> val datasetSchema = StructType(List(StructField("name", StringType, true), 
> StructField("songs", MapType(StringType, StringType, true), true)))
> val df = spark.createDataFrame(rdd, datasetSchema)
> df.take(5).foreach(println)
> {code}
> Output in spark 2.1.x:
> {code:java}
> scala> df.take(5).foreach(println)
> [sublime,Map(a -> 3, b -> 2, null -> 4)]
> {code}
> Output in spark 2.3.x:
> {code:java}
> 2018-09-21 15:35:25 ERROR Executor:91 - Exception in task 2.0 in stage 14.0 
> (TID 39)
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, name), StringType), true, false) AS 
> name#38
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else newInstance(class org.apache.spark.sql.catalyst.util.ArrayBasedMapData) 
> AS songs#39
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:291)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:589)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:589)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
>

[jira] [Commented] (SPARK-24018) Spark-without-hadoop package fails to create or read parquet files with snappy compression

2018-09-22 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624628#comment-16624628
 ] 

Yuming Wang commented on SPARK-24018:
-

It may be fixed by 
[SPARK-24927|https://issues.apache.org/jira/browse/SPARK-24927].


> Spark-without-hadoop package fails to create or read parquet files with 
> snappy compression
> --
>
> Key: SPARK-24018
> URL: https://issues.apache.org/jira/browse/SPARK-24018
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.0
>Reporter: Jean-Francis Roy
>Priority: Minor
>
> On a brand-new installation of Spark 2.3.0 with a user-provided hadoop-2.8.3, 
> Spark fails to read or write dataframes in parquet format with snappy 
> compression.
> This is due to an incompatibility between the snappy-java version that is 
> required by parquet (parquet is provided in Spark jars but snappy isn't) and 
> the version that is available from hadoop-2.8.3.
>  
> Steps to reproduce:
>  * Download and extract hadoop-2.8.3
>  * Download and extract spark-2.3.0-without-hadoop
>  * export JAVA_HOME, HADOOP_HOME, SPARK_HOME, PATH
>  * Following instructions from 
> [https://spark.apache.org/docs/latest/hadoop-provided.html], set 
> SPARK_DIST_CLASSPATH=$(hadoop classpath) in spark-env.sh
>  * Start a spark-shell, enter the following:
>  
> {code:java}
> import spark.implicits._
> val df = List(1, 2, 3, 4).toDF
> df.write
>   .format("parquet")
>   .option("compression", "snappy")
>   .mode("overwrite")
>   .save("test.parquet")
> {code}
>  
>  
> This fails with the following:
> {noformat}
> java.lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
> at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
> at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
> at 
> org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
> at 
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
> at 
> org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
> at 
> org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
> at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
> at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
> at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
> at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109)
> at 
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:396)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at 
> org.apache.spark.scheduler.Task.run(Task.scala:109)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){noformat}
>  
>   Downloading snappy-java-1.1.2.6.jar and placing it in

[jira] [Commented] (SPARK-25497) limit operation within whole stage codegen should not consume all the inputs

2018-09-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624579#comment-16624579
 ] 

Apache Spark commented on SPARK-25497:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/22524

> limit operation within whole stage codegen should not consume all the inputs
> 
>
> Key: SPARK-25497
> URL: https://issues.apache.org/jira/browse/SPARK-25497
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> This issue was discovered during https://github.com/apache/spark/pull/21738 . 
> It turns out that limit is not whole-stage-codegened correctly and always 
> consume all the inputs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25497) limit operation within whole stage codegen should not consume all the inputs

2018-09-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25497:


Assignee: (was: Apache Spark)

> limit operation within whole stage codegen should not consume all the inputs
> 
>
> Key: SPARK-25497
> URL: https://issues.apache.org/jira/browse/SPARK-25497
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> This issue was discovered during https://github.com/apache/spark/pull/21738 . 
> It turns out that limit is not whole-stage-codegened correctly and always 
> consume all the inputs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25497) limit operation within whole stage codegen should not consume all the inputs

2018-09-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25497:


Assignee: Apache Spark

> limit operation within whole stage codegen should not consume all the inputs
> 
>
> Key: SPARK-25497
> URL: https://issues.apache.org/jira/browse/SPARK-25497
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>
> This issue was discovered during https://github.com/apache/spark/pull/21738 . 
> It turns out that limit is not whole-stage-codegened correctly and always 
> consume all the inputs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25501) Kafka delegation token support

2018-09-22 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624511#comment-16624511
 ] 

Jungtaek Lim commented on SPARK-25501:
--

+1 for this. This should help remedy the needs to pass keytab file to executor 
side.

It might be better to have design doc to make clear how it works and how we 
will integrate to Spark. I'm also curious we have secure way to transmit token 
to tasks/executors.

> Kafka delegation token support
> --
>
> Key: SPARK-25501
> URL: https://issues.apache.org/jira/browse/SPARK-25501
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> In kafka version 1.1 delegation token support is released. As spark updated 
> it's kafka client to 2.0.0 now it's possible to implement delegation token 
> support. Please see description: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10816) EventTime based sessionization

2018-09-22 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623749#comment-16623749
 ] 

Jungtaek Lim edited comment on SPARK-10816 at 9/22/18 6:00 AM:
---

[~msukmanowsky]

Thanks for showing your interest on this! If you are interested on my proposal 
you can even pull my patch and build (though it is marked as WIP even now it 
works... handling state is just a bit suboptimal), and play with custom build. 
(If you read my proposal you may be noticed that the proposal addresses the 
missing part you're seeing from map/flatMapGroupsWithState.)

Your scenario looks like fit to simple gap window, with event time & watermark. 
I guess with my patch it would be represented as SQL statement like:
{code:java}
SELECT userId, session.start AS startTimestampMs, session.end AS 
endTimestampMs, session.end - session.start AS durationMs FROM tbl GROUP BY 
userId, session(event_time, 30 minutes){code}
or DSL in below links.

(Here session.end is defined as gap ends. If you would like to have last event 
timestamp in session, max(event_time) would work.)

In append mode you can only see the sessions which are evicted, and in update 
mode you can see all updated sessions for every batch.

I also added UTs which converts structured sessionization example into session 
window function. Please check it out and let me know if something doesn't work 
as you expect.

Append mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L475-L573]

Update mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L618-L715]

The query will work for both batch and streaming without modification of SQL 
statement or DSL. For batch it doesn't leverage state. 


was (Author: kabhwan):
[~msukmanowsky]

Thanks for showing your interest on this! If you are interested on my proposal 
you can even pull my patch and build (though it is marked as WIP even now it 
works... handling state is just a bit suboptimal), and play with custom build. 
(If you read my proposal you may be noticed that the proposal addresses the 
missing part you're seeing from map/flatMapGroupsWithState.)

Your scenario looks like fit to simple gap window, with event time & watermark. 
I guess with my patch it would be represented as SQL statement like:
{code:java}
SELECT userId, session.start AS startTimestampMs, session.end AS 
endTimestampMs, session.end - session.start AS durationMs FROM tbl GROUP BY 
session(event_time, 30 minutes){code}
or DSL in below links.

(Here session.end is defined as gap ends. If you would like to have last event 
timestamp in session, max(event_time) would work.)

In append mode you can only see the sessions which are evicted, and in update 
mode you can see all updated sessions for every batch.

I also added UTs which converts structured sessionization example into session 
window function. Please check it out and let me know if something doesn't work 
as you expect.

Append mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L475-L573]

Update mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L618-L715]

The query will work for both batch and streaming without modification of SQL 
statement or DSL. For batch it doesn't leverage state. 

> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-10816 Support session window natively.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

41 matches

Mail list logo