[jira] [Updated] (SPARK-24976) Allow None for Decimal type conversion (specific to PyArrow 0.9.0)

2018-07-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24976:
-
Summary: Allow None for Decimal type conversion (specific to PyArrow 0.9.0) 
 (was: Allow None for Decimal type conversion (specific to Arrow 0.9.0))

> Allow None for Decimal type conversion (specific to PyArrow 0.9.0)
> --
>
> Key: SPARK-24976
> URL: https://issues.apache.org/jira/browse/SPARK-24976
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See https://jira.apache.org/jira/browse/ARROW-2432
> If we use Arrow 0.9.0, the the test case (None as decimal) failed as below:
> {code}
> Traceback (most recent call last):
>   File "/.../spark/python/pyspark/sql/tests.py", line 4672, in 
> test_vectorized_udf_null_decimal
> self.assertEquals(df.collect(), res.collect())
>   File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect
> sock_info = self._jdf.collectToPython()
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
> 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o51.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 
> in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 
> (TID 7, localhost, executor driver): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/.../spark/python/pyspark/worker.py", line 320, in main
> process()
>   File "/.../spark/python/pyspark/worker.py", line 315, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream
> batch = _create_batch(series, self._timezone)
>   File "/.../spark/python/pyspark/serializers.py", line 243, in _create_batch
> arrs = [create_array(s, t) for s, t in series]
>   File "/.../spark/python/pyspark/serializers.py", line 241, in create_array
> return pa.Array.from_pandas(s, mask=mask, type=t)
>   File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 77, in pyarrow.lib.check_status
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24976) Allow None for Decimal type conversion (specific to Arrow 0.9.0)

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24976:


Assignee: Apache Spark

> Allow None for Decimal type conversion (specific to Arrow 0.9.0)
> 
>
> Key: SPARK-24976
> URL: https://issues.apache.org/jira/browse/SPARK-24976
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> See https://jira.apache.org/jira/browse/ARROW-2432
> If we use Arrow 0.9.0, the the test case (None as decimal) failed as below:
> {code}
> Traceback (most recent call last):
>   File "/.../spark/python/pyspark/sql/tests.py", line 4672, in 
> test_vectorized_udf_null_decimal
> self.assertEquals(df.collect(), res.collect())
>   File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect
> sock_info = self._jdf.collectToPython()
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
> 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o51.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 
> in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 
> (TID 7, localhost, executor driver): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/.../spark/python/pyspark/worker.py", line 320, in main
> process()
>   File "/.../spark/python/pyspark/worker.py", line 315, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream
> batch = _create_batch(series, self._timezone)
>   File "/.../spark/python/pyspark/serializers.py", line 243, in _create_batch
> arrs = [create_array(s, t) for s, t in series]
>   File "/.../spark/python/pyspark/serializers.py", line 241, in create_array
> return pa.Array.from_pandas(s, mask=mask, type=t)
>   File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 77, in pyarrow.lib.check_status
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24976) Allow None for Decimal type conversion (specific to Arrow 0.9.0)

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24976:


Assignee: (was: Apache Spark)

> Allow None for Decimal type conversion (specific to Arrow 0.9.0)
> 
>
> Key: SPARK-24976
> URL: https://issues.apache.org/jira/browse/SPARK-24976
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See https://jira.apache.org/jira/browse/ARROW-2432
> If we use Arrow 0.9.0, the the test case (None as decimal) failed as below:
> {code}
> Traceback (most recent call last):
>   File "/.../spark/python/pyspark/sql/tests.py", line 4672, in 
> test_vectorized_udf_null_decimal
> self.assertEquals(df.collect(), res.collect())
>   File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect
> sock_info = self._jdf.collectToPython()
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
> 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o51.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 
> in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 
> (TID 7, localhost, executor driver): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/.../spark/python/pyspark/worker.py", line 320, in main
> process()
>   File "/.../spark/python/pyspark/worker.py", line 315, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream
> batch = _create_batch(series, self._timezone)
>   File "/.../spark/python/pyspark/serializers.py", line 243, in _create_batch
> arrs = [create_array(s, t) for s, t in series]
>   File "/.../spark/python/pyspark/serializers.py", line 241, in create_array
> return pa.Array.from_pandas(s, mask=mask, type=t)
>   File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 77, in pyarrow.lib.check_status
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24976) Allow None for Decimal type conversion (specific to Arrow 0.9.0)

2018-07-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563117#comment-16563117
 ] 

Apache Spark commented on SPARK-24976:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/21928

> Allow None for Decimal type conversion (specific to Arrow 0.9.0)
> 
>
> Key: SPARK-24976
> URL: https://issues.apache.org/jira/browse/SPARK-24976
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See https://jira.apache.org/jira/browse/ARROW-2432
> If we use Arrow 0.9.0, the the test case (None as decimal) failed as below:
> {code}
> Traceback (most recent call last):
>   File "/.../spark/python/pyspark/sql/tests.py", line 4672, in 
> test_vectorized_udf_null_decimal
> self.assertEquals(df.collect(), res.collect())
>   File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect
> sock_info = self._jdf.collectToPython()
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
> 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o51.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 
> in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 
> (TID 7, localhost, executor driver): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/.../spark/python/pyspark/worker.py", line 320, in main
> process()
>   File "/.../spark/python/pyspark/worker.py", line 315, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream
> batch = _create_batch(series, self._timezone)
>   File "/.../spark/python/pyspark/serializers.py", line 243, in _create_batch
> arrs = [create_array(s, t) for s, t in series]
>   File "/.../spark/python/pyspark/serializers.py", line 241, in create_array
> return pa.Array.from_pandas(s, mask=mask, type=t)
>   File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 77, in pyarrow.lib.check_status
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`

2018-07-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-24930:
--
 Due Date: (was: 26/Jul/18)
Affects Version/s: (was: 2.2.1)
   (was: 2.3.0)
 Priority: Minor  (was: Major)

>  Exception information is not accurate when using `LOAD DATA LOCAL INPATH`
> --
>
> Key: SPARK-24930
> URL: https://issues.apache.org/jira/browse/SPARK-24930
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Xiaochen Ouyang
>Priority: Minor
>
> # root user create a test.txt file contains a record '123'  in /root/ 
> directory
>  # switch mr user to execute spark-shell --master local
> {code:java}
> scala> spark.version
> res2: String = 2.2.1
> scala> spark.sql("create table t1(id int) partitioned by(area string)");
> 2018-07-26 17:20:37,523 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: 
> Location: hdfs://nameservice/spark/t1 specified for non-external table:t1
> res4: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("load data local inpath '/root/test.txt' into table t1 
> partition(area ='025')")
> org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: 
> /root/test.txt;
>  at 
> org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639)
>  ... 48 elided
> scala>
> {code}
> In fact, the input path exists, but the mr user does not have permission to 
> access the directory `/root/` ,so the message throwed by `AnalysisException` 
> can confuse user.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24976) Allow None for Decimal type conversion (specific to Arrow 0.9.0)

2018-07-30 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-24976:


 Summary: Allow None for Decimal type conversion (specific to Arrow 
0.9.0)
 Key: SPARK-24976
 URL: https://issues.apache.org/jira/browse/SPARK-24976
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Hyukjin Kwon


See https://jira.apache.org/jira/browse/ARROW-2432

If we use Arrow 0.9.0, the the test case (None as decimal) failed as below:

{code}
Traceback (most recent call last):
  File "/.../spark/python/pyspark/sql/tests.py", line 4672, in 
test_vectorized_udf_null_decimal
self.assertEquals(df.collect(), res.collect())
  File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect
sock_info = self._jdf.collectToPython()
  File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
  File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, 
in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o51.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in 
stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 (TID 
7, localhost, executor driver): org.apache.spark.api.python.PythonException: 
Traceback (most recent call last):
  File "/.../spark/python/pyspark/worker.py", line 320, in main
process()
  File "/.../spark/python/pyspark/worker.py", line 315, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream
batch = _create_batch(series, self._timezone)
  File "/.../spark/python/pyspark/serializers.py", line 243, in _create_batch
arrs = [create_array(s, t) for s, t in series]
  File "/.../spark/python/pyspark/serializers.py", line 241, in create_array
return pa.Array.from_pandas(s, mask=mask, type=t)
  File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas
  File "array.pxi", line 177, in pyarrow.lib.array
  File "error.pxi", line 77, in pyarrow.lib.check_status
  File "error.pxi", line 77, in pyarrow.lib.check_status
ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
object of type NoneType but can only handle these types: decimal.Decimal
{code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23334) Fix pandas_udf with return type StringType() to handle str type properly in Python 2.

2018-07-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23334:
-
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-22216

> Fix pandas_udf with return type StringType() to handle str type properly in 
> Python 2.
> -
>
> Key: SPARK-23334
> URL: https://issues.apache.org/jira/browse/SPARK-23334
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Blocker
> Fix For: 2.3.0
>
>
> In Python 2, when pandas_udf tries to return string type value created in the 
> udf with {{".."}}, the execution fails. E.g.,
> {code:java}
> from pyspark.sql.functions import pandas_udf, col
> import pandas as pd
> df = spark.range(10)
> str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string")
> df.select(str_f(col('id'))).show()
> {code}
> raises the following exception:
> {code}
> ...
> java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: 
> expected StringType, got BinaryType
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.(ArrowEvalPythonExec.scala:93)
> ...
> {code}
> Seems like pyarrow ignores {{type}} parameter for {{pa.Array.from_pandas()}} 
> and consider it as binary type when the type is string type and the string 
> values are {{str}} instead of {{unicode}} in Python 2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24820) Fail fast when submitted job contains PartitionPruningRDD in a barrier stage

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24820:


Assignee: (was: Apache Spark)

> Fail fast when submitted job contains PartitionPruningRDD in a barrier stage
> 
>
> Key: SPARK-24820
> URL: https://issues.apache.org/jira/browse/SPARK-24820
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Detect SparkContext.runJob() launch a barrier stage including 
> PartitionPruningRDD.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24820) Fail fast when submitted job contains PartitionPruningRDD in a barrier stage

2018-07-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563109#comment-16563109
 ] 

Apache Spark commented on SPARK-24820:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/21927

> Fail fast when submitted job contains PartitionPruningRDD in a barrier stage
> 
>
> Key: SPARK-24820
> URL: https://issues.apache.org/jira/browse/SPARK-24820
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Detect SparkContext.runJob() launch a barrier stage including 
> PartitionPruningRDD.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24820) Fail fast when submitted job contains PartitionPruningRDD in a barrier stage

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24820:


Assignee: Apache Spark

> Fail fast when submitted job contains PartitionPruningRDD in a barrier stage
> 
>
> Key: SPARK-24820
> URL: https://issues.apache.org/jira/browse/SPARK-24820
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>Priority: Major
>
> Detect SparkContext.runJob() launch a barrier stage including 
> PartitionPruningRDD.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24946) PySpark - Allow np.Arrays and pd.Series in df.approxQuantile

2018-07-30 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563052#comment-16563052
 ] 

Hyukjin Kwon commented on SPARK-24946:
--

Hmmm. mind if I ask a discussion in dev mailing list after Spark 2.4.0 is 
released (since committers and guys are busy on this currently)? This one 
specific case can be trivial but I am worried if we should consider allowing 
all other cases.

> PySpark - Allow np.Arrays and pd.Series in df.approxQuantile
> 
>
> Key: SPARK-24946
> URL: https://issues.apache.org/jira/browse/SPARK-24946
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Paul Westenthanner
>Priority: Minor
>  Labels: DataFrame, beginner, pyspark
>
> As Python user it is convenient to pass a numpy array or pandas series 
> `{{approxQuantile}}(_col_, _probabilities_, _relativeError_)` for the 
> probabilities parameter. 
>  
> Especially for creating cumulative plots (say in 1% steps) it is handy to use 
> `approxQuantile(col, np.arange(0, 1.0, 0.01), relativeError)`.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24975) Spark history server REST API /api/v1/version returns error 404

2018-07-30 Thread shanyu zhao (JIRA)
shanyu zhao created SPARK-24975:
---

 Summary: Spark history server REST API /api/v1/version returns 
error 404
 Key: SPARK-24975
 URL: https://issues.apache.org/jira/browse/SPARK-24975
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1, 2.3.0
Reporter: shanyu zhao


Spark history server REST API provides /api/v1/version, according to doc:

[https://spark.apache.org/docs/latest/monitoring.html]

However, for Spark 2.3, we see:
{code:java}
curl http://localhost:18080/api/v1/version



Error 404 Not Found

HTTP ERROR 404
Problem accessing /api/v1/version. Reason:
 Not Foundhttp://eclipse.org/jetty;>Powered by 
Jetty:// 9.3.z-SNAPSHOT


{code}
On a Spark 2.2 cluster, we see:
{code:java}
curl http://localhost:18080/api/v1/version
{
"spark" : "2.2.0"
}{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23633) Update Pandas UDFs section in sql-programming-guide

2018-07-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23633.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21887
[https://github.com/apache/spark/pull/21887]

>  Update Pandas UDFs section in sql-programming-guide
> 
>
> Key: SPARK-23633
> URL: https://issues.apache.org/jira/browse/SPARK-23633
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> Let's make sure sql-programming-guide is up-to-date before 2.4 release.
> https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#pandas-udfs-aka-vectorized-udfs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23633) Update Pandas UDFs section in sql-programming-guide

2018-07-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23633:


Assignee: Li Jin

>  Update Pandas UDFs section in sql-programming-guide
> 
>
> Key: SPARK-23633
> URL: https://issues.apache.org/jira/browse/SPARK-23633
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> Let's make sure sql-programming-guide is up-to-date before 2.4 release.
> https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#pandas-udfs-aka-vectorized-udfs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24937) Datasource partition table should load empty static partitions

2018-07-30 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24937:

Summary: Datasource partition table should load empty static partitions  
(was: Datasource partition table should load empty partitions)

> Datasource partition table should load empty static partitions
> --
>
> Key: SPARK-24937
> URL: https://issues.apache.org/jira/browse/SPARK-24937
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> spark-sql> CREATE TABLE tbl AS SELECT 1;
> spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING)
>  > USING parquet
>  > PARTITIONED BY (day, hour);
> spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') 
> SELECT * FROM tbl where 1=0;
> spark-sql> SHOW PARTITIONS tbl1;
> spark-sql> CREATE TABLE tbl2 (c1 BIGINT)
>  > PARTITIONED BY (day STRING, hour STRING);
> 18/07/26 22:49:20 WARN HiveMetaStore: Location: 
> file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external 
> table:tbl2
> spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') 
> SELECT * FROM tbl where 1=0;
> 18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2
> 18/07/26 22:49:36 WARN log: Updated size to 0
> spark-sql> SHOW PARTITIONS tbl2;
> day=2018-07-25/hour=01
> spark-sql> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24882) data source v2 API improvement

2018-07-30 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24882:

Description: 
Data source V2 is out for a while, see the SPIP 
[here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
 We have already migrated most of the built-in streaming data sources to the V2 
API, and the file source migration is in progress. During the migration, we 
found several problems and want to address them before we stabilize the V2 API.

To solve these problems, we need to separate responsibilities in the data 
source v2 API, isolate the stateull part of the API, think of better naming of 
some interfaces. Details please see the attached google doc: 
https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing

  was:
Data source V2 is out for a while, see the SPIP 
[here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
 We have already migrated most of the built-in streaming data sources to the V2 
API, and the file source migration is in progress. During the migration, we 
found several problems and want to address them before we stabilize the V2 API.

To solve these problems, we need to separate responsibilities in the data 
source v2 API, isolate the stateull parts. Details please see the attached 
google doc: 
https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing


> data source v2 API improvement
> --
>
> Key: SPARK-24882
> URL: https://issues.apache.org/jira/browse/SPARK-24882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 API, isolate the stateull part of the API, think of better naming 
> of some interfaces. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24882) data source v2 API improvement

2018-07-30 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24882:

Description: 
Data source V2 is out for a while, see the SPIP 
[here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
 We have already migrated most of the built-in streaming data sources to the V2 
API, and the file source migration is in progress. During the migration, we 
found several problems and want to address them before we stabilize the V2 API.

To solve these problems, we need to separate responsibilities in the data 
source v2 API, isolate the stateull parts. Details please see the attached 
google doc: 
https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing

  was:
Data source V2 is out for a while, see the SPIP 
[here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
 We have already migrated most of the built-in streaming data sources to the V2 
API, and the file source migration is in progress. During the migration, we 
found several problems and want to address them before we stabilize the V2 API.

To solve these problems, we need to separate responsibilities in the data 
source v2 read API. Details please see the attached google doc: 
https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing


> data source v2 API improvement
> --
>
> Key: SPARK-24882
> URL: https://issues.apache.org/jira/browse/SPARK-24882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 API, isolate the stateull parts. Details please see the attached 
> google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24882) data source v2 API improvement

2018-07-30 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24882:

Summary: data source v2 API improvement  (was: separate responsibilities of 
the data source v2 read API)

> data source v2 API improvement
> --
>
> Key: SPARK-24882
> URL: https://issues.apache.org/jira/browse/SPARK-24882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 read API. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24952) Support LZMA2 compression by Avro datasource

2018-07-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24952:


Assignee: Maxim Gekk

> Support LZMA2 compression by Avro datasource
> 
>
> Key: SPARK-24952
> URL: https://issues.apache.org/jira/browse/SPARK-24952
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> LZMA2 (XZ) has much more better compression ratio comparing to currently 
> supported snappy and deflate. Underlying Avro library supports the 
> compression codec already. Need to set parameters for the codec and allow 
> users to specify "xz" compression via AvroOptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24952) Support LZMA2 compression by Avro datasource

2018-07-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24952.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21902
[https://github.com/apache/spark/pull/21902]

> Support LZMA2 compression by Avro datasource
> 
>
> Key: SPARK-24952
> URL: https://issues.apache.org/jira/browse/SPARK-24952
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> LZMA2 (XZ) has much more better compression ratio comparing to currently 
> supported snappy and deflate. Underlying Avro library supports the 
> compression codec already. Need to set parameters for the codec and allow 
> users to specify "xz" compression via AvroOptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24972) PivotFirst could not handle pivot columns of complex types

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24972:


Assignee: Apache Spark

> PivotFirst could not handle pivot columns of complex types
> --
>
> Key: SPARK-24972
> URL: https://issues.apache.org/jira/browse/SPARK-24972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Assignee: Apache Spark
>Priority: Minor
>
> {{PivotFirst}} did not handle complex types for pivot columns properly. And 
> as a result, the pivot column could not be matched with any pivot value and 
> it always returned empty result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24972) PivotFirst could not handle pivot columns of complex types

2018-07-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562659#comment-16562659
 ] 

Apache Spark commented on SPARK-24972:
--

User 'maryannxue' has created a pull request for this issue:
https://github.com/apache/spark/pull/21926

> PivotFirst could not handle pivot columns of complex types
> --
>
> Key: SPARK-24972
> URL: https://issues.apache.org/jira/browse/SPARK-24972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Priority: Minor
>
> {{PivotFirst}} did not handle complex types for pivot columns properly. And 
> as a result, the pivot column could not be matched with any pivot value and 
> it always returned empty result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24972) PivotFirst could not handle pivot columns of complex types

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24972:


Assignee: (was: Apache Spark)

> PivotFirst could not handle pivot columns of complex types
> --
>
> Key: SPARK-24972
> URL: https://issues.apache.org/jira/browse/SPARK-24972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Priority: Minor
>
> {{PivotFirst}} did not handle complex types for pivot columns properly. And 
> as a result, the pivot column could not be matched with any pivot value and 
> it always returned empty result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks

2018-07-30 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562638#comment-16562638
 ] 

holdenk commented on SPARK-24579:
-

[~mengxr]How about you just open comments up in general and then turn it off if 
Spam becomes a problem?

> SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
> 
>
> Key: SPARK-24579
> URL: https://issues.apache.org/jira/browse/SPARK-24579
> Project: Spark
>  Issue Type: Epic
>  Components: ML, PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen
> Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange 
> between Apache Spark and DL%2FAI Frameworks .pdf
>
>
> (see attached SPIP pdf for more details)
> At the crossroads of big data and AI, we see both the success of Apache Spark 
> as a unified
> analytics engine and the rise of AI frameworks like TensorFlow and Apache 
> MXNet (incubating).
> Both big data and AI are indispensable components to drive business 
> innovation and there have
> been multiple attempts from both communities to bring them together.
> We saw efforts from AI community to implement data solutions for AI 
> frameworks like tf.data and tf.Transform. However, with 50+ data sources and 
> built-in SQL, DataFrames, and Streaming features, Spark remains the community 
> choice for big data. This is why we saw many efforts to integrate DL/AI 
> frameworks with Spark to leverage its power, for example, TFRecords data 
> source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project 
> Hydrogen, this SPIP takes a different angle at Spark + AI unification.
> None of the integrations are possible without exchanging data between Spark 
> and external DL/AI frameworks. And the performance matters. However, there 
> doesn’t exist a standard way to exchange data and hence implementation and 
> performance optimization fall into pieces. For example, TensorFlowOnSpark 
> uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and 
> save data and pass the RDD records to TensorFlow in Python. And TensorFrames 
> converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s 
> Java API. How can we reduce the complexity?
> The proposal here is to standardize the data exchange interface (or format) 
> between Spark and DL/AI frameworks and optimize data conversion from/to this 
> interface.  So DL/AI frameworks can leverage Spark to load data virtually 
> from anywhere without spending extra effort building complex data solutions, 
> like reading features from a production data warehouse or streaming model 
> inference. Spark users can use DL/AI frameworks without learning specific 
> data APIs implemented there. And developers from both sides can work on 
> performance optimizations independently given the interface itself doesn’t 
> introduce big overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType

2018-07-30 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562632#comment-16562632
 ] 

koert kuipers commented on SPARK-15516:
---

we also ran into this on columns that are not key columns. i am not so sure 
this has anything to do with key columns.

it seems parquet uses StructType.merge, which seems to support type conversion 
for array types, map types, struct types, and decimal types. for any other 
types it simply requires them to be the same. anyone know why?

> Schema merging in driver fails for parquet when merging LongType and 
> IntegerType
> 
>
> Key: SPARK-15516
> URL: https://issues.apache.org/jira/browse/SPARK-15516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Databricks
>Reporter: Hossein Falaki
>Priority: Major
>
> I tried to create a table from partitioned parquet directories that requires 
> schema merging. I get following error:
> {code}
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to merge incompatible data 
> types LongType and IntegerType
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415)
> at org.apache.spark.sql.types.StructType.merge(StructType.scala:333)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829)
> {code}
> cc @rxin and [~mengxr]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24973) Add numIter to Python ClusteringSummary

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24973:


Assignee: Apache Spark

> Add numIter to Python ClusteringSummary 
> 
>
> Key: SPARK-24973
> URL: https://issues.apache.org/jira/browse/SPARK-24973
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> -SPARK-23528- added numIter to ClusteringSummary. Will add numIter to Python 
> version of ClusteringSummary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24973) Add numIter to Python ClusteringSummary

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24973:


Assignee: (was: Apache Spark)

> Add numIter to Python ClusteringSummary 
> 
>
> Key: SPARK-24973
> URL: https://issues.apache.org/jira/browse/SPARK-24973
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> -SPARK-23528- added numIter to ClusteringSummary. Will add numIter to Python 
> version of ClusteringSummary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24973) Add numIter to Python ClusteringSummary

2018-07-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562624#comment-16562624
 ] 

Apache Spark commented on SPARK-24973:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/21925

> Add numIter to Python ClusteringSummary 
> 
>
> Key: SPARK-24973
> URL: https://issues.apache.org/jira/browse/SPARK-24973
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> -SPARK-23528- added numIter to ClusteringSummary. Will add numIter to Python 
> version of ClusteringSummary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.

2018-07-30 Thread andrzej.stankev...@gmail.com (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

andrzej.stankev...@gmail.com updated SPARK-24974:
-
Description: 
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *report_date* and *type* and i have directory structure 
like 
{code:java}
/custom_path/report_date=2018-07-24/type=A/file_1.parquet
{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count
{code}
 

In my query i need to load only files of type *A* and it is just a couple of 
files. But spark load all 19K of files from all partitions into 
SharedInMemoryCache which takes about 60 secs and only after that throws unused 
partitions. 

  was:
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *report_date* and *type* and i have directory structure 
like 
{code:java}
/custom_path/report_date=2018-07-24/type=A/file_1
{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count
{code}
 

In my query i need to load only files of type *A* and it is just a couple of 
files. But spark load all 19K of files from all partitions into 
SharedInMemoryCache which takes about 60 secs and only after that throws unused 
partitions. 


> Spark put all file's paths into SharedInMemoryCache even for unused 
> partitions.
> ---
>
> Key: SPARK-24974
> URL: https://issues.apache.org/jira/browse/SPARK-24974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> SharedInMemoryCache has all  filestatus no matter whether you specify 
> partition columns or not. It causes long load time for queries that use only 
> couple partitions because Spark loads file's paths for files from all 
> partitions.
> I partitioned files by *report_date* and *type* and i have directory 
> structure like 
> {code:java}
> /custom_path/report_date=2018-07-24/type=A/file_1.parquet
> {code}
>  
> I am trying to execute 
> {code:java}
> val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
> "type == 'A'").count
> {code}
>  
> In my query i need to load only files of type *A* and it is just a couple of 
> files. But spark load all 19K of files from all partitions into 
> SharedInMemoryCache which takes about 60 secs and only after that throws 
> unused partitions. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.

2018-07-30 Thread andrzej.stankev...@gmail.com (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

andrzej.stankev...@gmail.com updated SPARK-24974:
-
Description: 
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *type* and i has directory structure like 
{code:java}
report_date=2018-07-24/type=A/file_1
{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count
{code}
 

In my query i need to load only files of type *A* and it is just a couple of 
files. But spark load all 19K of files from all partitions into 
SharedInMemoryCache which takes about 60 secs and only after that throws unused 
partitions. 

  was:
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *type* and i has directory structure like 
{code:java}
report_date=2018-07-24/type=A/file_1

{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count

{code}
 

In my query i need to load only files of type *A* and it is just a couple of 
files. But spark load all 19K of files from all partitions into 
SharedInMemoryCache which takes about 60 secs and only after that throws unused 
partitions. 


> Spark put all file's paths into SharedInMemoryCache even for unused 
> partitions.
> ---
>
> Key: SPARK-24974
> URL: https://issues.apache.org/jira/browse/SPARK-24974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> SharedInMemoryCache has all  filestatus no matter whether you specify 
> partition columns or not. It causes long load time for queries that use only 
> couple partitions because Spark loads file's paths for files from all 
> partitions.
> I partitioned files by *type* and i has directory structure like 
> {code:java}
> report_date=2018-07-24/type=A/file_1
> {code}
>  
> I am trying to execute 
> {code:java}
> val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
> "type == 'A'").count
> {code}
>  
> In my query i need to load only files of type *A* and it is just a couple of 
> files. But spark load all 19K of files from all partitions into 
> SharedInMemoryCache which takes about 60 secs and only after that throws 
> unused partitions. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.

2018-07-30 Thread andrzej.stankev...@gmail.com (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

andrzej.stankev...@gmail.com updated SPARK-24974:
-
Description: 
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *report_date* and *type* and i have directory structure 
like 
{code:java}
/custom_path/report_date=2018-07-24/type=A/file_1
{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count
{code}
 

In my query i need to load only files of type *A* and it is just a couple of 
files. But spark load all 19K of files from all partitions into 
SharedInMemoryCache which takes about 60 secs and only after that throws unused 
partitions. 

  was:
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *type* and i has directory structure like 
{code:java}
report_date=2018-07-24/type=A/file_1
{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count
{code}
 

In my query i need to load only files of type *A* and it is just a couple of 
files. But spark load all 19K of files from all partitions into 
SharedInMemoryCache which takes about 60 secs and only after that throws unused 
partitions. 


> Spark put all file's paths into SharedInMemoryCache even for unused 
> partitions.
> ---
>
> Key: SPARK-24974
> URL: https://issues.apache.org/jira/browse/SPARK-24974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> SharedInMemoryCache has all  filestatus no matter whether you specify 
> partition columns or not. It causes long load time for queries that use only 
> couple partitions because Spark loads file's paths for files from all 
> partitions.
> I partitioned files by *report_date* and *type* and i have directory 
> structure like 
> {code:java}
> /custom_path/report_date=2018-07-24/type=A/file_1
> {code}
>  
> I am trying to execute 
> {code:java}
> val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
> "type == 'A'").count
> {code}
>  
> In my query i need to load only files of type *A* and it is just a couple of 
> files. But spark load all 19K of files from all partitions into 
> SharedInMemoryCache which takes about 60 secs and only after that throws 
> unused partitions. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.

2018-07-30 Thread andrzej.stankev...@gmail.com (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

andrzej.stankev...@gmail.com updated SPARK-24974:
-
Description: 
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *type* and i has directory structure like 
{code:java}
report_date=2018-07-24/type=A/file_1

{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count

{code}
 

In my query i need to load only files of type *A* and it is just a couple of 
files. But spark load all 19K of files from all partitions into 
SharedInMemoryCache which takes about 60 secs and only after that throws unused 
partitions. 

  was:
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *type* and i has directory structure like 
{code:java}
report_date=2018-07-24/type=A/file_1

{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count

{code}
 

In my query i need to load only files of type *A* and it is just a couple of 
files. But spark load all 19K of files from all partitions into 
SharedInMemoryCache which takes about 60 secs and only after that throws unused 
partitions.

 

 


> Spark put all file's paths into SharedInMemoryCache even for unused 
> partitions.
> ---
>
> Key: SPARK-24974
> URL: https://issues.apache.org/jira/browse/SPARK-24974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> SharedInMemoryCache has all  filestatus no matter whether you specify 
> partition columns or not. It causes long load time for queries that use only 
> couple partitions because Spark loads file's paths for files from all 
> partitions.
> I partitioned files by *type* and i has directory structure like 
> {code:java}
> report_date=2018-07-24/type=A/file_1
> {code}
>  
> I am trying to execute 
> {code:java}
> val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
> "type == 'A'").count
> {code}
>  
> In my query i need to load only files of type *A* and it is just a couple of 
> files. But spark load all 19K of files from all partitions into 
> SharedInMemoryCache which takes about 60 secs and only after that throws 
> unused partitions. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.

2018-07-30 Thread andrzej.stankev...@gmail.com (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

andrzej.stankev...@gmail.com updated SPARK-24974:
-
Description: 
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *type* and i has directory structure like 
{code:java}
report_date=2018-07-24/type=A/file_1

{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count

{code}
 

In my query i need to load only files of type *A* and it is just a couple of 
files. But spark load all 19K of files from all partitions into 
SharedInMemoryCache which takes about 60 secs and only after that throws unused 
partitions.

 

 

  was:
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *type* and i has directory structure like 
{code:java}
report_date=2018-07-24/type=A/file_1

{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count

{code}
 

In my query i need to load only files of type A and it is just couple of files. 
But spark load all 19K of files into SharedInMemoryCache which takes about 60 
secs and only after that throws unused partitions.

 

 


> Spark put all file's paths into SharedInMemoryCache even for unused 
> partitions.
> ---
>
> Key: SPARK-24974
> URL: https://issues.apache.org/jira/browse/SPARK-24974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> SharedInMemoryCache has all  filestatus no matter whether you specify 
> partition columns or not. It causes long load time for queries that use only 
> couple partitions because Spark loads file's paths for files from all 
> partitions.
> I partitioned files by *type* and i has directory structure like 
> {code:java}
> report_date=2018-07-24/type=A/file_1
> {code}
>  
> I am trying to execute 
> {code:java}
> val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
> "type == 'A'").count
> {code}
>  
> In my query i need to load only files of type *A* and it is just a couple of 
> files. But spark load all 19K of files from all partitions into 
> SharedInMemoryCache which takes about 60 secs and only after that throws 
> unused partitions.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.

2018-07-30 Thread andrzej.stankev...@gmail.com (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

andrzej.stankev...@gmail.com updated SPARK-24974:
-
Description: 
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by type and i has directory structure like 
{code:java}
report_date=2018-07-24/type=A/file_1

{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count

{code}
 

In my query i need to load only files of type A and it is just couple of files. 
But spark load all 19K of files into SharedInMemoryCache which takes about 60 
secs and only after that throws unused partitions.

 

 

  was:
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by type and i has directory structure like 

{code}

{{report_date=2018-07-24/type=A/file_1}}

{code}

 

I am trying to execute 

{code}

{{val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count}}

{code}

 

In my query i need to load only files of type A and it is just couple of files. 
But spark load all 19K of files into SharedInMemoryCache which takes about 60 
secs and only after that throws unused partitions.

 

 


> Spark put all file's paths into SharedInMemoryCache even for unused 
> partitions.
> ---
>
> Key: SPARK-24974
> URL: https://issues.apache.org/jira/browse/SPARK-24974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> SharedInMemoryCache has all  filestatus no matter whether you specify 
> partition columns or not. It causes long load time for queries that use only 
> couple partitions because Spark loads file's paths for files from all 
> partitions.
> I partitioned files by type and i has directory structure like 
> {code:java}
> report_date=2018-07-24/type=A/file_1
> {code}
>  
> I am trying to execute 
> {code:java}
> val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
> "type == 'A'").count
> {code}
>  
> In my query i need to load only files of type A and it is just couple of 
> files. But spark load all 19K of files into SharedInMemoryCache which takes 
> about 60 secs and only after that throws unused partitions.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.

2018-07-30 Thread andrzej.stankev...@gmail.com (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

andrzej.stankev...@gmail.com updated SPARK-24974:
-
Description: 
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *type* and i has directory structure like 
{code:java}
report_date=2018-07-24/type=A/file_1

{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count

{code}
 

In my query i need to load only files of type A and it is just couple of files. 
But spark load all 19K of files into SharedInMemoryCache which takes about 60 
secs and only after that throws unused partitions.

 

 

  was:
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by type and i has directory structure like 
{code:java}
report_date=2018-07-24/type=A/file_1

{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count

{code}
 

In my query i need to load only files of type A and it is just couple of files. 
But spark load all 19K of files into SharedInMemoryCache which takes about 60 
secs and only after that throws unused partitions.

 

 


> Spark put all file's paths into SharedInMemoryCache even for unused 
> partitions.
> ---
>
> Key: SPARK-24974
> URL: https://issues.apache.org/jira/browse/SPARK-24974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> SharedInMemoryCache has all  filestatus no matter whether you specify 
> partition columns or not. It causes long load time for queries that use only 
> couple partitions because Spark loads file's paths for files from all 
> partitions.
> I partitioned files by *type* and i has directory structure like 
> {code:java}
> report_date=2018-07-24/type=A/file_1
> {code}
>  
> I am trying to execute 
> {code:java}
> val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
> "type == 'A'").count
> {code}
>  
> In my query i need to load only files of type A and it is just couple of 
> files. But spark load all 19K of files into SharedInMemoryCache which takes 
> about 60 secs and only after that throws unused partitions.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.

2018-07-30 Thread andrzej.stankev...@gmail.com (JIRA)
andrzej.stankev...@gmail.com created SPARK-24974:


 Summary: Spark put all file's paths into SharedInMemoryCache even 
for unused partitions.
 Key: SPARK-24974
 URL: https://issues.apache.org/jira/browse/SPARK-24974
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: andrzej.stankev...@gmail.com


SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by type and i has directory structure like 

{code}

{{report_date=2018-07-24/type=A/file_1}}

{code}

 

I am trying to execute 

{code}

{{val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count}}

{code}

 

In my query i need to load only files of type A and it is just couple of files. 
But spark load all 19K of files into SharedInMemoryCache which takes about 60 
secs and only after that throws unused partitions.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24973) Add numIter to Python ClusteringSummary

2018-07-30 Thread Huaxin Gao (JIRA)
Huaxin Gao created SPARK-24973:
--

 Summary: Add numIter to Python ClusteringSummary 
 Key: SPARK-24973
 URL: https://issues.apache.org/jira/browse/SPARK-24973
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.4.0
Reporter: Huaxin Gao


-SPARK-23528- added numIter to ClusteringSummary. Will add numIter to Python 
version of ClusteringSummary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24972) PivotFirst could not handle pivot columns of complex types

2018-07-30 Thread Maryann Xue (JIRA)
Maryann Xue created SPARK-24972:
---

 Summary: PivotFirst could not handle pivot columns of complex types
 Key: SPARK-24972
 URL: https://issues.apache.org/jira/browse/SPARK-24972
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maryann Xue


{{PivotFirst}} did not handle complex types for pivot columns properly. And as 
a result, the pivot column could not be matched with any pivot value and it 
always returned empty result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24963) Integration tests will fail if they run in a namespace not being the default

2018-07-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562592#comment-16562592
 ] 

Apache Spark commented on SPARK-24963:
--

User 'mccheah' has created a pull request for this issue:
https://github.com/apache/spark/pull/21924

> Integration tests will fail if they run in a namespace not being the default
> 
>
> Key: SPARK-24963
> URL: https://issues.apache.org/jira/browse/SPARK-24963
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Priority: Minor
> Fix For: 2.4.0
>
>
> Related discussion is here: 
> [https://github.com/apache/spark/pull/21748#pullrequestreview-141048893]
> If  spark-rbac.yaml is used when tests are used locally, client mode tests 
> will fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24918) Executor Plugin API

2018-07-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562521#comment-16562521
 ] 

Apache Spark commented on SPARK-24918:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/21923

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24918) Executor Plugin API

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24918:


Assignee: (was: Apache Spark)

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24918) Executor Plugin API

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24918:


Assignee: Apache Spark

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Major
>  Labels: memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24720) kafka transaction creates Non-consecutive Offsets (due to transaction offset) making streaming fail when failOnDataLoss=true

2018-07-30 Thread Quentin Ambard (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562464#comment-16562464
 ] 

Quentin Ambard commented on SPARK-24720:


Something is wrong with the way the consumers are closed. I'll submit a 
correction using the existing consumer cache instead of creating a new one. 

> kafka transaction creates Non-consecutive Offsets (due to transaction offset) 
> making streaming fail when failOnDataLoss=true
> 
>
> Key: SPARK-24720
> URL: https://issues.apache.org/jira/browse/SPARK-24720
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Quentin Ambard
>Priority: Major
>
> When kafka transactions are used, sending 1 message to kafka will result to 1 
> offset for the data + 1 offset to mark the transaction.
> When kafka connector for spark streaming read a topic with non-consecutive 
> offset, it leads to a failure. SPARK-17147 fixed this issue for compacted 
> topics.
>  However, SPARK-17147 doesn't fix this issue for kafka transactions: if 1 
> message + 1 transaction commit are in a partition, spark will try to read 
> offsets  [0 2[. offset 0 (containing the message) will be read, but offset 1 
> won't return a value and buffer.hasNext() will be false even after a poll 
> since no data are present for offset 1 (it's the transaction commit)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24961) sort operation causes out of memory

2018-07-30 Thread Markus Breuer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562359#comment-16562359
 ] 

Markus Breuer commented on SPARK-24961:
---

I think it is an issue and I explained why. But I also posted to stackoverflow 
(some days ago) and adressed mailinglist, but got no feedback on this issue. 
Probably there is no easy answer to this issue, but probably my example helps 
to reproduce it.

> sort operation causes out of memory 
> 
>
> Key: SPARK-24961
> URL: https://issues.apache.org/jira/browse/SPARK-24961
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.1
> Environment: Java 1.8u144+
> Windows 10
> Spark 2.3.1 in local mode
> -Xms4g -Xmx4g
> optional: -XX:+UseParallelOldGC 
>Reporter: Markus Breuer
>Priority: Major
>
> A sort operation on large rdd - which does not fit in memory - causes out of 
> memory exception. I made the effect reproducable by an sample, the sample 
> creates large object of about 2mb size. When saving result the oom occurs. I 
> tried several StorageLevels, but if memory is included (MEMORY_AND_DISK, 
> MEMORY_AND_DISK_SER, none) application runs in out of memory. Only DISK_ONLY 
> seems to work.
> When replacing sort() with sortWithinPartitions() no StorageLevel is required 
> and application succeeds.
> {code:java}
> package de.bytefusion.examples;
> import breeze.storage.Storage;
> import de.bytefusion.utils.Options;
> import org.apache.hadoop.io.MapFile;
> import org.apache.hadoop.io.SequenceFile;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapred.SequenceFileOutputFormat;
> import org.apache.spark.api.java.JavaRDD;
> import org.apache.spark.api.java.JavaSparkContext;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.RowFactory;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructType;
> import org.apache.spark.storage.StorageLevel;
> import scala.Tuple2;
> import static org.apache.spark.sql.functions.*;
> import java.util.ArrayList;
> import java.util.List;
> import java.util.UUID;
> import java.util.stream.Collectors;
> import java.util.stream.IntStream;
> public class Example3 {
> public static void main(String... args) {
> // create spark session
> SparkSession spark = SparkSession.builder()
> .appName("example1")
> .master("local[4]")
> .config("spark.driver.maxResultSize","1g")
> .config("spark.driver.memory","512m")
> .config("spark.executor.memory","512m")
> .config("spark.local.dir","d:/temp/spark-tmp")
> .getOrCreate();
> JavaSparkContext sc = 
> JavaSparkContext.fromSparkContext(spark.sparkContext());
> // base to generate huge data
> List list = new ArrayList<>();
> for (int val = 1; val < 1; val++) {
> int valueOf = Integer.valueOf(val);
> list.add(valueOf);
> }
> // create simple rdd of int
> JavaRDD rdd = sc.parallelize(list,200);
> // use map to create large object per row
> JavaRDD rowRDD =
> rdd
> .map(value -> 
> RowFactory.create(String.valueOf(value), 
> createLongText(UUID.randomUUID().toString(), 2 * 1024 * 1024)))
> // no persist => out of memory exception on write()
> // persist MEMORY_AND_DISK => out of memory exception 
> on write()
> // persist MEMORY_AND_DISK_SER => out of memory 
> exception on write()
> // persist(StorageLevel.DISK_ONLY())
> ;
> StructType type = new StructType();
> type = type
> .add("c1", DataTypes.StringType)
> .add( "c2", DataTypes.StringType );
> Dataset df = spark.createDataFrame(rowRDD, type);
> // works
> df.show();
> df = df
> .sort(col("c1").asc() )
> ;
> df.explain();
> // takes a lot of time but works
> df.show();
> // OutOfMemoryError: java heap space
> df
> .write()
> .mode("overwrite")
> .csv("d:/temp/my.csv");
> // OutOfMemoryError: java heap space
> df
> .toJavaRDD()
> .mapToPair(row -> new Tuple2(new Text(row.getString(0)), new 
> Text( row.getString(1
> .saveAsHadoopFile("d:\\temp\\foo", Text.class, Text.class, 
> SequenceFileOutputFormat.class );
> }
> private static String createLongText( String text, int minLength ) {
> StringBuffer sb = new 

[jira] [Resolved] (SPARK-24963) Integration tests will fail if they run in a namespace not being the default

2018-07-30 Thread Matt Cheah (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah resolved SPARK-24963.

   Resolution: Fixed
Fix Version/s: 2.4.0

> Integration tests will fail if they run in a namespace not being the default
> 
>
> Key: SPARK-24963
> URL: https://issues.apache.org/jira/browse/SPARK-24963
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Priority: Minor
> Fix For: 2.4.0
>
>
> Related discussion is here: 
> [https://github.com/apache/spark/pull/21748#pullrequestreview-141048893]
> If  spark-rbac.yaml is used when tests are used locally, client mode tests 
> will fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24963) Integration tests will fail if they run in a namespace not being the default

2018-07-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562319#comment-16562319
 ] 

Apache Spark commented on SPARK-24963:
--

User 'mccheah' has created a pull request for this issue:
https://github.com/apache/spark/pull/21900

> Integration tests will fail if they run in a namespace not being the default
> 
>
> Key: SPARK-24963
> URL: https://issues.apache.org/jira/browse/SPARK-24963
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> Related discussion is here: 
> [https://github.com/apache/spark/pull/21748#pullrequestreview-141048893]
> If  spark-rbac.yaml is used when tests are used locally, client mode tests 
> will fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24963) Integration tests will fail if they run in a namespace not being the default

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24963:


Assignee: (was: Apache Spark)

> Integration tests will fail if they run in a namespace not being the default
> 
>
> Key: SPARK-24963
> URL: https://issues.apache.org/jira/browse/SPARK-24963
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> Related discussion is here: 
> [https://github.com/apache/spark/pull/21748#pullrequestreview-141048893]
> If  spark-rbac.yaml is used when tests are used locally, client mode tests 
> will fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24963) Integration tests will fail if they run in a namespace not being the default

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24963:


Assignee: Apache Spark

> Integration tests will fail if they run in a namespace not being the default
> 
>
> Key: SPARK-24963
> URL: https://issues.apache.org/jira/browse/SPARK-24963
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Assignee: Apache Spark
>Priority: Minor
>
> Related discussion is here: 
> [https://github.com/apache/spark/pull/21748#pullrequestreview-141048893]
> If  spark-rbac.yaml is used when tests are used locally, client mode tests 
> will fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24971) remove SupportsDeprecatedScanRow

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24971:


Assignee: Wenchen Fan  (was: Apache Spark)

> remove SupportsDeprecatedScanRow
> 
>
> Key: SPARK-24971
> URL: https://issues.apache.org/jira/browse/SPARK-24971
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24971) remove SupportsDeprecatedScanRow

2018-07-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562149#comment-16562149
 ] 

Apache Spark commented on SPARK-24971:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21921

> remove SupportsDeprecatedScanRow
> 
>
> Key: SPARK-24971
> URL: https://issues.apache.org/jira/browse/SPARK-24971
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24971) remove SupportsDeprecatedScanRow

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24971:


Assignee: Apache Spark  (was: Wenchen Fan)

> remove SupportsDeprecatedScanRow
> 
>
> Key: SPARK-24971
> URL: https://issues.apache.org/jira/browse/SPARK-24971
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24971) remove SupportsDeprecatedScanRow

2018-07-30 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-24971:
---

 Summary: remove SupportsDeprecatedScanRow
 Key: SPARK-24971
 URL: https://issues.apache.org/jira/browse/SPARK-24971
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24956) Upgrade maven from 3.3.9 to 3.5.4

2018-07-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562141#comment-16562141
 ] 

Apache Spark commented on SPARK-24956:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/21920

> Upgrade maven from 3.3.9 to 3.5.4
> -
>
> Key: SPARK-24956
> URL: https://issues.apache.org/jira/browse/SPARK-24956
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 2.4.0
>
>
> Maven 3.3.9 looks pretty old. It would be good to upgrade this to the latest.
> As suggest in SPARK-24895, the current maven will see a problem with some 
> plugins.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24934) Complex type and binary type in in-memory partition pruning does not work due to missing upper/lower bounds cases

2018-07-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24934:
-
Affects Version/s: 2.3.1

> Complex type and binary type in in-memory partition pruning does not work due 
> to missing upper/lower bounds cases
> -
>
> Key: SPARK-24934
> URL: https://issues.apache.org/jira/browse/SPARK-24934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>  Labels: correctness
> Fix For: 2.3.2, 2.4.0
>
>
> For example, if array is used (where the lower and upper bounds for its 
> column batch are {{null}})), it looks wrongly filtering all data out:
> {code}
> scala> import org.apache.spark.sql.functions
> import org.apache.spark.sql.functions
> scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol")
> df: org.apache.spark.sql.DataFrame = [arrayCol: array]
> scala> 
> df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), 
> functions.lit("b".show()
> ++
> |arrayCol|
> ++
> |  [a, b]|
> ++
> scala> 
> df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"),
>  functions.lit("b".show()
> ++
> |arrayCol|
> ++
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24934) Complex type and binary type in in-memory partition pruning does not work due to missing upper/lower bounds cases

2018-07-30 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562126#comment-16562126
 ] 

Hyukjin Kwon commented on SPARK-24934:
--

I think this has been a bug from the first place. It at least affects 2.3.1. I 
manually tested:

{code}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_162)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.spark.sql.functions
import org.apache.spark.sql.functions

scala>

scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol")
df: org.apache.spark.sql.DataFrame = [arrayCol: array]

scala> 
df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), 
functions.lit("b".show()
++
|arrayCol|
++
|  [a, b]|
++


scala> 
df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"),
 functions.lit("b".show()
++
|arrayCol|
++
++
{code}

> Complex type and binary type in in-memory partition pruning does not work due 
> to missing upper/lower bounds cases
> -
>
> Key: SPARK-24934
> URL: https://issues.apache.org/jira/browse/SPARK-24934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>  Labels: correctness
> Fix For: 2.3.2, 2.4.0
>
>
> For example, if array is used (where the lower and upper bounds for its 
> column batch are {{null}})), it looks wrongly filtering all data out:
> {code}
> scala> import org.apache.spark.sql.functions
> import org.apache.spark.sql.functions
> scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol")
> df: org.apache.spark.sql.DataFrame = [arrayCol: array]
> scala> 
> df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), 
> functions.lit("b".show()
> ++
> |arrayCol|
> ++
> |  [a, b]|
> ++
> scala> 
> df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"),
>  functions.lit("b".show()
> ++
> |arrayCol|
> ++
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24882) separate responsibilities of the data source v2 read API

2018-07-30 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562114#comment-16562114
 ] 

Wenchen Fan commented on SPARK-24882:
-

[~rdblue] I do agree that creating a catalog via reflection and then using 
catalog to create a `ReadSupport` instance is cleaner. But the problem is then 
we need to make `CatatalogSupport` a must-have for data sources instead of an 
optional plugin. How about we rename the old `ReadSupport` to 
`ReadSupportProvider` for data sources that don't have a catalog? It works like 
a dynamic constructor of `ReadSupport` so that Spark can create `ReadSupport` 
by reflection.

For the builder issue, I'm ok with adding a `ScanConfigBuilder` that is mutable 
and can mix in the `SupportsPushdownXYZ` traits, to make `ScanConfig` 
immutable. I think this model is simpler: The `ScanConfigBuilder` tracks all 
the pushed operators, checks the current status and gives feedback to Spark 
about the next operator pushdown. We can design a pure builder-like pushdown 
API for `ScanConfigBuilder`  later. We need to support more operators pushdown 
to evaluate the design, so it seems safer to keep the pushdown API unchanged 
for now. What do you think?

> separate responsibilities of the data source v2 read API
> 
>
> Key: SPARK-24882
> URL: https://issues.apache.org/jira/browse/SPARK-24882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 read API. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24882) separate responsibilities of the data source v2 read API

2018-07-30 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562058#comment-16562058
 ] 

Ryan Blue edited comment on SPARK-24882 at 7/30/18 4:02 PM:


[~cloud_fan], when you say that "ReadSupport is created via reflection and 
should be very light-weight", that raises a red flag for me. As I've mentioned 
in the discussions on table catalog support, I think that we should instantiate 
the catalog through reflection, not the data source. Starting with the data 
source is backward and only works now because we have just one global catalog. 
In my catalog support PR, the catalog is what gets instantiated via reflection. 
In that model, the initialization you need would be in the Table instance's 
life-cycle and the catalog would return tables that implement ReadSupport. So, 
I think we should consider the impact of multiple catalog support.

I think that if we go with the catalog proposal, it would make this change 
cleaner because we would be able to remove the extra DataSourceReader level. 
This would also end up working like a lot of other Table and Scan interfaces. 
Iceberg, for example, has a Table that you call newScan on to get a 
configurable TableScan. That's very similar to how this API would end up: Table 
implements ReadSupport, ReadSupport exposes newScanConfig. Getting a Reader 
that doesn't actually read data just doesn't make sense.

For the builder discussion: Does the current proposal work for the 
{{Filter(Limit(Filter(Scan)))}} case? I don't think it does because 
implementations expect predicates to be pushed just once. I think that these 
cases should probably be handled by a more generic push-down that passes parts 
of the plan tree. If that's the case, then the builder is simpler and immutable.

I'd rather go with the builder for now to set the expectation that ScanConfig 
is immutable. We *could* add an intermediate class, like 
{{ConfigurableScanConfig}}, that exposes the {{SupportsXYZ}} pushdown 
interfaces. Then, we could add a method to {{ConfigurableScanConfig}} to get a 
final, immutable {{ScanConfig}}. But, that's really just a builder with a more 
difficult to implement interface.

In the event that we do find pushdown operations that are incompatible with a 
builder – and not incompatible with both a builder and the current proposal 
like your example – then we can always add a new way to build or configure a 
{{ScanConfig}} later. The important thing is that the ScanConfig is immutable 
to provide the guarantees that you want in this API: that the ScanConfig won't 
change between calls to {{estimateStatistics}} and {{planInputPartitions}}.


was (Author: rdblue):
[~cloud_fan], when you say that "ReadSupport is created via reflection and 
should be very light-weight", that raises a red flag for me. As I've mentioned 
in the discussions on table catalog support, I think that we should instantiate 
the catalog through reflection, not the data source. Starting with the data 
source is backward and only works now because we have just one global catalog. 
In my catalog support PR, the catalog is what gets instantiated via reflection. 
In that model, the initialization you need would be in the Table instance's 
life-cycle and the catalog would return tables that implement ReadSupport. So, 
I think we should consider the impact of multiple catalog support.

I think that if we go with the catalog proposal, it would make this change 
cleaner because we would be able to remove the extra DataSourceReader level. 
This would also end up working like a lot of other Table and Scan interfaces. 
Iceberg, for example, has a Table that you call newScan on to get a 
configurable TableScan. That's very similar to how this API would end up: Table 
implements ReadSupport, ReadSupport exposes newScanConfig. Getting a Reader 
that doesn't really read data just doesn't make sense.

For the builder discussion: Does the current proposal work for the 
{{Filter(Limit(Filter(Scan)))}} case? I don't think it does because 
implementations expect predicates to be pushed just once. I think that these 
cases should probably be handled by a more generic push-down that passes parts 
of the plan tree. If that's the case, then the builder is simpler and immutable.

I'd rather go with the builder for now to set the expectation that ScanConfig 
is immutable. We *could* add an intermediate class, like 
{{ConfigurableScanConfig}}, that exposes the {{SupportsXYZ}} pushdown 
interfaces. Then, we could add a method to {{ConfigurableScanConfig}} to get a 
final, immutable {{ScanConfig}}. But, that's really just a builder with a more 
difficult to implement interface.

In the event that we do find pushdown operations that are incompatible with a 
builder – and not incompatible with both a builder and the current proposal 
like your example – then we can always add a new way to build or 

[jira] [Commented] (SPARK-24882) separate responsibilities of the data source v2 read API

2018-07-30 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562058#comment-16562058
 ] 

Ryan Blue commented on SPARK-24882:
---

[~cloud_fan], when you say that "ReadSupport is created via reflection and 
should be very light-weight", that raises a red flag for me. As I've mentioned 
in the discussions on table catalog support, I think that we should instantiate 
the catalog through reflection, not the data source. Starting with the data 
source is backward and only works now because we have just one global catalog. 
In my catalog support PR, the catalog is what gets instantiated via reflection. 
In that model, the initialization you need would be in the Table instance's 
life-cycle and the catalog would return tables that implement ReadSupport. So, 
I think we should consider the impact of multiple catalog support.

I think that if we go with the catalog proposal, it would make this change 
cleaner because we would be able to remove the extra DataSourceReader level. 
This would also end up working like a lot of other Table and Scan interfaces. 
Iceberg, for example, has a Table that you call newScan on to get a 
configurable TableScan. That's very similar to how this API would end up: Table 
implements ReadSupport, ReadSupport exposes newScanConfig. Getting a Reader 
that doesn't really read data just doesn't make sense.

For the builder discussion: Does the current proposal work for the 
{{Filter(Limit(Filter(Scan)))}} case? I don't think it does because 
implementations expect predicates to be pushed just once. I think that these 
cases should probably be handled by a more generic push-down that passes parts 
of the plan tree. If that's the case, then the builder is simpler and immutable.

I'd rather go with the builder for now to set the expectation that ScanConfig 
is immutable. We *could* add an intermediate class, like 
{{ConfigurableScanConfig}}, that exposes the {{SupportsXYZ}} pushdown 
interfaces. Then, we could add a method to {{ConfigurableScanConfig}} to get a 
final, immutable {{ScanConfig}}. But, that's really just a builder with a more 
difficult to implement interface.

In the event that we do find pushdown operations that are incompatible with a 
builder – and not incompatible with both a builder and the current proposal 
like your example – then we can always add a new way to build or configure a 
{{ScanConfig}} later. The important thing is that the ScanConfig is immutable 
to provide the guarantees that you want in this API: that the ScanConfig won't 
change between calls to {{estimateStatistics}} and {{planInputPartitions}}.

> separate responsibilities of the data source v2 read API
> 
>
> Key: SPARK-24882
> URL: https://issues.apache.org/jira/browse/SPARK-24882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 read API. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24933) SinkProgress should report written rows

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24933:


Assignee: (was: Apache Spark)

> SinkProgress should report written rows
> ---
>
> Key: SPARK-24933
> URL: https://issues.apache.org/jira/browse/SPARK-24933
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Vaclav Kosar
>Priority: Major
>
> SinkProgress should report similar properties like SourceProgress as long as 
> they are available for given Sink. Count of written rows is metric availble 
> for all Sinks. Since relevant progress information is with respect to 
> commited rows, ideal object to carry this info is WriterCommitMessage. For 
> brevity the implementation will focus only on Sinks with API V2 and on Micro 
> Batch mode. Implemention for Continuous mode will be provided at later date.
> h4. Before
> {code}
> {"description":"org.apache.spark.sql.kafka010.KafkaSourceProvider@3c0bd317"}
> {code}
> h4. After
> {code}
> {"description":"org.apache.spark.sql.kafka010.KafkaSourceProvider@3c0bd317","numOutputRows":5000}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24933) SinkProgress should report written rows

2018-07-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562044#comment-16562044
 ] 

Apache Spark commented on SPARK-24933:
--

User 'vackosar' has created a pull request for this issue:
https://github.com/apache/spark/pull/21919

> SinkProgress should report written rows
> ---
>
> Key: SPARK-24933
> URL: https://issues.apache.org/jira/browse/SPARK-24933
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Vaclav Kosar
>Priority: Major
>
> SinkProgress should report similar properties like SourceProgress as long as 
> they are available for given Sink. Count of written rows is metric availble 
> for all Sinks. Since relevant progress information is with respect to 
> commited rows, ideal object to carry this info is WriterCommitMessage. For 
> brevity the implementation will focus only on Sinks with API V2 and on Micro 
> Batch mode. Implemention for Continuous mode will be provided at later date.
> h4. Before
> {code}
> {"description":"org.apache.spark.sql.kafka010.KafkaSourceProvider@3c0bd317"}
> {code}
> h4. After
> {code}
> {"description":"org.apache.spark.sql.kafka010.KafkaSourceProvider@3c0bd317","numOutputRows":5000}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24933) SinkProgress should report written rows

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24933:


Assignee: Apache Spark

> SinkProgress should report written rows
> ---
>
> Key: SPARK-24933
> URL: https://issues.apache.org/jira/browse/SPARK-24933
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Vaclav Kosar
>Assignee: Apache Spark
>Priority: Major
>
> SinkProgress should report similar properties like SourceProgress as long as 
> they are available for given Sink. Count of written rows is metric availble 
> for all Sinks. Since relevant progress information is with respect to 
> commited rows, ideal object to carry this info is WriterCommitMessage. For 
> brevity the implementation will focus only on Sinks with API V2 and on Micro 
> Batch mode. Implemention for Continuous mode will be provided at later date.
> h4. Before
> {code}
> {"description":"org.apache.spark.sql.kafka010.KafkaSourceProvider@3c0bd317"}
> {code}
> h4. After
> {code}
> {"description":"org.apache.spark.sql.kafka010.KafkaSourceProvider@3c0bd317","numOutputRows":5000}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24821) Fail fast when submitted job compute on a subset of all the partitions for a barrier stage

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24821:


Assignee: (was: Apache Spark)

> Fail fast when submitted job compute on a subset of all the partitions for a 
> barrier stage
> --
>
> Key: SPARK-24821
> URL: https://issues.apache.org/jira/browse/SPARK-24821
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Detect SparkContext.runJob() launch a barrier stage with a subset of all the 
> partitions, one example is the `first()` operation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24821) Fail fast when submitted job compute on a subset of all the partitions for a barrier stage

2018-07-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562031#comment-16562031
 ] 

Apache Spark commented on SPARK-24821:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/21918

> Fail fast when submitted job compute on a subset of all the partitions for a 
> barrier stage
> --
>
> Key: SPARK-24821
> URL: https://issues.apache.org/jira/browse/SPARK-24821
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Detect SparkContext.runJob() launch a barrier stage with a subset of all the 
> partitions, one example is the `first()` operation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24821) Fail fast when submitted job compute on a subset of all the partitions for a barrier stage

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24821:


Assignee: Apache Spark

> Fail fast when submitted job compute on a subset of all the partitions for a 
> barrier stage
> --
>
> Key: SPARK-24821
> URL: https://issues.apache.org/jira/browse/SPARK-24821
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>Priority: Major
>
> Detect SparkContext.runJob() launch a barrier stage with a subset of all the 
> partitions, one example is the `first()` operation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24720) kafka transaction creates Non-consecutive Offsets (due to transaction offset) making streaming fail when failOnDataLoss=true

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24720:


Assignee: Apache Spark

> kafka transaction creates Non-consecutive Offsets (due to transaction offset) 
> making streaming fail when failOnDataLoss=true
> 
>
> Key: SPARK-24720
> URL: https://issues.apache.org/jira/browse/SPARK-24720
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Quentin Ambard
>Assignee: Apache Spark
>Priority: Major
>
> When kafka transactions are used, sending 1 message to kafka will result to 1 
> offset for the data + 1 offset to mark the transaction.
> When kafka connector for spark streaming read a topic with non-consecutive 
> offset, it leads to a failure. SPARK-17147 fixed this issue for compacted 
> topics.
>  However, SPARK-17147 doesn't fix this issue for kafka transactions: if 1 
> message + 1 transaction commit are in a partition, spark will try to read 
> offsets  [0 2[. offset 0 (containing the message) will be read, but offset 1 
> won't return a value and buffer.hasNext() will be false even after a poll 
> since no data are present for offset 1 (it's the transaction commit)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24720) kafka transaction creates Non-consecutive Offsets (due to transaction offset) making streaming fail when failOnDataLoss=true

2018-07-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562000#comment-16562000
 ] 

Apache Spark commented on SPARK-24720:
--

User 'QuentinAmbard' has created a pull request for this issue:
https://github.com/apache/spark/pull/21917

> kafka transaction creates Non-consecutive Offsets (due to transaction offset) 
> making streaming fail when failOnDataLoss=true
> 
>
> Key: SPARK-24720
> URL: https://issues.apache.org/jira/browse/SPARK-24720
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Quentin Ambard
>Priority: Major
>
> When kafka transactions are used, sending 1 message to kafka will result to 1 
> offset for the data + 1 offset to mark the transaction.
> When kafka connector for spark streaming read a topic with non-consecutive 
> offset, it leads to a failure. SPARK-17147 fixed this issue for compacted 
> topics.
>  However, SPARK-17147 doesn't fix this issue for kafka transactions: if 1 
> message + 1 transaction commit are in a partition, spark will try to read 
> offsets  [0 2[. offset 0 (containing the message) will be read, but offset 1 
> won't return a value and buffer.hasNext() will be false even after a poll 
> since no data are present for offset 1 (it's the transaction commit)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24720) kafka transaction creates Non-consecutive Offsets (due to transaction offset) making streaming fail when failOnDataLoss=true

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24720:


Assignee: (was: Apache Spark)

> kafka transaction creates Non-consecutive Offsets (due to transaction offset) 
> making streaming fail when failOnDataLoss=true
> 
>
> Key: SPARK-24720
> URL: https://issues.apache.org/jira/browse/SPARK-24720
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Quentin Ambard
>Priority: Major
>
> When kafka transactions are used, sending 1 message to kafka will result to 1 
> offset for the data + 1 offset to mark the transaction.
> When kafka connector for spark streaming read a topic with non-consecutive 
> offset, it leads to a failure. SPARK-17147 fixed this issue for compacted 
> topics.
>  However, SPARK-17147 doesn't fix this issue for kafka transactions: if 1 
> message + 1 transaction commit are in a partition, spark will try to read 
> offsets  [0 2[. offset 0 (containing the message) will be read, but offset 1 
> won't return a value and buffer.hasNext() will be false even after a poll 
> since no data are present for offset 1 (it's the transaction commit)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24958) Report executors' process tree total memory information to heartbeat signals

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24958:


Assignee: Apache Spark

> Report executors' process tree total memory information to heartbeat signals
> 
>
> Key: SPARK-24958
> URL: https://issues.apache.org/jira/browse/SPARK-24958
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Reza Safi
>Assignee: Apache Spark
>Priority: Major
>
> Spark executors' process tree total memory information can be really useful. 
> Currently such information are not available. The goal of this task is to 
> compute such information for each executor, add these information to the 
> heartbeat signals, and compute the peaks at the driver. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24958) Report executors' process tree total memory information to heartbeat signals

2018-07-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561990#comment-16561990
 ] 

Apache Spark commented on SPARK-24958:
--

User 'rezasafi' has created a pull request for this issue:
https://github.com/apache/spark/pull/21916

> Report executors' process tree total memory information to heartbeat signals
> 
>
> Key: SPARK-24958
> URL: https://issues.apache.org/jira/browse/SPARK-24958
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Reza Safi
>Priority: Major
>
> Spark executors' process tree total memory information can be really useful. 
> Currently such information are not available. The goal of this task is to 
> compute such information for each executor, add these information to the 
> heartbeat signals, and compute the peaks at the driver. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24958) Report executors' process tree total memory information to heartbeat signals

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24958:


Assignee: (was: Apache Spark)

> Report executors' process tree total memory information to heartbeat signals
> 
>
> Key: SPARK-24958
> URL: https://issues.apache.org/jira/browse/SPARK-24958
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Reza Safi
>Priority: Major
>
> Spark executors' process tree total memory information can be really useful. 
> Currently such information are not available. The goal of this task is to 
> compute such information for each executor, add these information to the 
> heartbeat signals, and compute the peaks at the driver. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24582) Design: Barrier execution mode

2018-07-30 Thread Jiang Xingbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiang Xingbo resolved SPARK-24582.
--
Resolution: Fixed

> Design: Barrier execution mode
> --
>
> Key: SPARK-24582
> URL: https://issues.apache.org/jira/browse/SPARK-24582
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Jiang Xingbo
>Priority: Major
>
> [~jiangxb1987] and [~cloud_fan] outlined a design sketch in SPARK-24375, 
> which covers some basic scenarios. This story is for a formal design of the 
> barrier execution mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24581) Design: BarrierTaskContext.barrier()

2018-07-30 Thread Jiang Xingbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiang Xingbo resolved SPARK-24581.
--
Resolution: Fixed

> Design: BarrierTaskContext.barrier()
> 
>
> Key: SPARK-24581
> URL: https://issues.apache.org/jira/browse/SPARK-24581
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Jiang Xingbo
>Priority: Major
>
> We need to provide a communication barrier function to users to help 
> coordinate tasks within a barrier stage. This is very similar to MPI_Barrier 
> function in MPI. This story is for its design.
>  
> Requirements:
>  * Low-latency. The tasks should be unblocked soon after all tasks have 
> reached this barrier. The latency is more important than CPU cycles here.
>  * Support unlimited timeout with proper logging. For DL tasks, it might take 
> very long to converge, we should support unlimited timeout with proper 
> logging. So users know why a task is waiting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22814) JDBC support date/timestamp type as partitionColumn

2018-07-30 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22814.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.4.0

> JDBC support date/timestamp type as partitionColumn
> ---
>
> Key: SPARK-22814
> URL: https://issues.apache.org/jira/browse/SPARK-22814
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.2.1
>Reporter: Yuechen Chen
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 2.4.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In spark, you can partition MySQL queries by partitionColumn.
> val df = (spark.read.jdbc(url=jdbcUrl,
> table="employees",
> columnName="emp_no",
> lowerBound=1L,
> upperBound=10L,
> numPartitions=100,
> connectionProperties=connectionProperties))
> display(df)
> But, partitionColumn must be a numeric column from the table.
> However, there are lots of table, which has no primary key, and has some 
> date/timestamp indexes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24954) Fail fast on job submit if run a barrier stage with dynamic resource allocation enabled

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24954:


Assignee: (was: Apache Spark)

> Fail fast on job submit if run a barrier stage with dynamic resource 
> allocation enabled
> ---
>
> Key: SPARK-24954
> URL: https://issues.apache.org/jira/browse/SPARK-24954
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Blocker
>
> Since we explicitly listed "Support running barrier stage with dynamic 
> resource allocation" a Non-Goal in the design doc, we shall fail fast on job 
> submit if running a barrier stage with dynamic resource allocation enabled, 
> to avoid some confusing behaviors (can refer to SPARK-24942 for some 
> examples).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24954) Fail fast on job submit if run a barrier stage with dynamic resource allocation enabled

2018-07-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561985#comment-16561985
 ] 

Apache Spark commented on SPARK-24954:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/21915

> Fail fast on job submit if run a barrier stage with dynamic resource 
> allocation enabled
> ---
>
> Key: SPARK-24954
> URL: https://issues.apache.org/jira/browse/SPARK-24954
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Blocker
>
> Since we explicitly listed "Support running barrier stage with dynamic 
> resource allocation" a Non-Goal in the design doc, we shall fail fast on job 
> submit if running a barrier stage with dynamic resource allocation enabled, 
> to avoid some confusing behaviors (can refer to SPARK-24942 for some 
> examples).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24954) Fail fast on job submit if run a barrier stage with dynamic resource allocation enabled

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24954:


Assignee: Apache Spark

> Fail fast on job submit if run a barrier stage with dynamic resource 
> allocation enabled
> ---
>
> Key: SPARK-24954
> URL: https://issues.apache.org/jira/browse/SPARK-24954
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>Priority: Blocker
>
> Since we explicitly listed "Support running barrier stage with dynamic 
> resource allocation" a Non-Goal in the design doc, we shall fail fast on job 
> submit if running a barrier stage with dynamic resource allocation enabled, 
> to avoid some confusing behaviors (can refer to SPARK-24942 for some 
> examples).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-07-30 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24771.
-
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 2.4.0

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24965) Spark SQL fails when reading a partitioned hive table with different formats per partition

2018-07-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561976#comment-16561976
 ] 

Apache Spark commented on SPARK-24965:
--

User 'krisgeus' has created a pull request for this issue:
https://github.com/apache/spark/pull/21893

> Spark SQL fails when reading a partitioned hive table with different formats 
> per partition
> --
>
> Key: SPARK-24965
> URL: https://issues.apache.org/jira/browse/SPARK-24965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Kris Geusebroek
>Priority: Major
>  Labels: pull-request-available
>
> When a hive parquet partitioned table contains a partition with a different 
> format (avro for example) the select * fails with a read exception (avro file 
> is not a parquet file)
> Selecting in hive acts as expected.
> To support this a new sql syntax needed to be supported also:
>  * ALTER TABLE   SET FILEFORMAT 
> This is included in the same PR since the unittest needs this to setup the 
> testdata.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24965) Spark SQL fails when reading a partitioned hive table with different formats per partition

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24965:


Assignee: Apache Spark

> Spark SQL fails when reading a partitioned hive table with different formats 
> per partition
> --
>
> Key: SPARK-24965
> URL: https://issues.apache.org/jira/browse/SPARK-24965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Kris Geusebroek
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> When a hive parquet partitioned table contains a partition with a different 
> format (avro for example) the select * fails with a read exception (avro file 
> is not a parquet file)
> Selecting in hive acts as expected.
> To support this a new sql syntax needed to be supported also:
>  * ALTER TABLE   SET FILEFORMAT 
> This is included in the same PR since the unittest needs this to setup the 
> testdata.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24965) Spark SQL fails when reading a partitioned hive table with different formats per partition

2018-07-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24965:


Assignee: (was: Apache Spark)

> Spark SQL fails when reading a partitioned hive table with different formats 
> per partition
> --
>
> Key: SPARK-24965
> URL: https://issues.apache.org/jira/browse/SPARK-24965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Kris Geusebroek
>Priority: Major
>  Labels: pull-request-available
>
> When a hive parquet partitioned table contains a partition with a different 
> format (avro for example) the select * fails with a read exception (avro file 
> is not a parquet file)
> Selecting in hive acts as expected.
> To support this a new sql syntax needed to be supported also:
>  * ALTER TABLE   SET FILEFORMAT 
> This is included in the same PR since the unittest needs this to setup the 
> testdata.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24934) Complex type and binary type in in-memory partition pruning does not work due to missing upper/lower bounds cases

2018-07-30 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561944#comment-16561944
 ] 

Thomas Graves commented on SPARK-24934:
---

what is the real affected versions here?  Since it went into spark 2.3.2 does 
it affect 2.3.1?

> Complex type and binary type in in-memory partition pruning does not work due 
> to missing upper/lower bounds cases
> -
>
> Key: SPARK-24934
> URL: https://issues.apache.org/jira/browse/SPARK-24934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>  Labels: correctness
> Fix For: 2.3.2, 2.4.0
>
>
> For example, if array is used (where the lower and upper bounds for its 
> column batch are {{null}})), it looks wrongly filtering all data out:
> {code}
> scala> import org.apache.spark.sql.functions
> import org.apache.spark.sql.functions
> scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol")
> df: org.apache.spark.sql.DataFrame = [arrayCol: array]
> scala> 
> df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), 
> functions.lit("b".show()
> ++
> |arrayCol|
> ++
> |  [a, b]|
> ++
> scala> 
> df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"),
>  functions.lit("b".show()
> ++
> |arrayCol|
> ++
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24970) Spark Kinesis streaming application fails to recover from streaming checkpoint due to ProvisionedThroughputExceededException

2018-07-30 Thread bruce_zhao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bruce_zhao updated SPARK-24970:
---
Description: 
 

We're using Spark streaming to consume Kinesis data, and found that it reads 
more data from Kinesis and is easy to touch 
ProvisionedThroughputExceededException *when it recovers from streaming 
checkpoint*. 

 

Normally, it's a WARN in spark log. But when we have multiple streaming 
applications (i.e., 5 applications) to consume the same Kinesis stream, the 
situation becomes serious. *The application will fail to recover due to the 
following exception in driver.*  And one application failure will also affect 
the other running applications. 

  
{panel:title=Exception}
org.apache.spark.SparkException: Job aborted due to stage failure: 
{color:#ff}*Task 5 in stage 7.0 failed 4 times, most recent 
failure*:{color} Lost task 5.3 in stage 7.0 (TID 128, 
ip-172-31-14-36.ap-northeast-1.compute.internal, executor 1): 
org.apache.spark.SparkException: Gave up after 3 retries while getting records 
using shard iterator, last exception: at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$retryOrTimeout$2.apply(KinesisBackedBlockRDD.scala:288)
 at scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:282)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecordsAndNextKinesisIterator(KinesisBackedBlockRDD.scala:223)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecords(KinesisBackedBlockRDD.scala:207)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:162)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:133)
 at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) at 
scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
 at 
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954) at 
org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)

Caused by: 
*{color:#ff}com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException:
 Rate exceeded for shard shardId- in stream rellfsstream-an under 
account 1{color}.* (Service: AmazonKinesis; Status Code: 400; Error 
Code: ProvisionedThroughputExceededException; Request ID: 
d3520677-060e-14c4-8014-2886b6b75f03) at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1257)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1029)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:741)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:715)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:697)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:665)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:647)
 at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:511) at 
com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2219)
 at 
com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2195)
 at 
com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetRecords(AmazonKinesisClient.java:1004)
 at 
com.amazonaws.services.kinesis.AmazonKinesisClient.getRecords(AmazonKinesisClient.java:980)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$2.apply(KinesisBackedBlockRDD.scala:224)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$2.apply(KinesisBackedBlockRDD.scala:224)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:269)
{panel}
 

After check the source code, we found it calls getBlockFromKinesis() to recover 
data and in this function it accesses Kinesis directly to read data. As all 
partitions in the BlockRDD will access Kinesis, and AWS Kinesis only supports 5 
concurrency reads per shard per second, it will touch 
ProvisionedThroughputExceededException easily. Even the code does some retries, 
it's still easy to fail when 

[jira] [Assigned] (SPARK-24967) Use internal.Logging instead for logging

2018-07-30 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24967:
---

Assignee: Hyukjin Kwon

> Use internal.Logging instead for logging
> 
>
> Key: SPARK-24967
> URL: https://issues.apache.org/jira/browse/SPARK-24967
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.4.0
>
>
> Looks AvroFileFormat directly uses {{getLogger}}. Should better use 
> {{internal.Logging}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24967) Use internal.Logging instead for logging

2018-07-30 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24967.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21914
[https://github.com/apache/spark/pull/21914]

> Use internal.Logging instead for logging
> 
>
> Key: SPARK-24967
> URL: https://issues.apache.org/jira/browse/SPARK-24967
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.4.0
>
>
> Looks AvroFileFormat directly uses {{getLogger}}. Should better use 
> {{internal.Logging}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24957) Decimal arithmetic can lead to wrong values using codegen

2018-07-30 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24957.
-
   Resolution: Fixed
Fix Version/s: 2.3.2
   2.4.0

Issue resolved by pull request 21910
[https://github.com/apache/spark/pull/21910]

> Decimal arithmetic can lead to wrong values using codegen
> -
>
> Key: SPARK-24957
> URL: https://issues.apache.org/jira/browse/SPARK-24957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: David Vogelbacher
>Assignee: Marco Gaido
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.0, 2.3.2
>
>
> I noticed a bug when doing arithmetic on a dataframe containing decimal 
> values with codegen enabled.
> I tried to narrow it down on a small repro and got this (executed in 
> spark-shell):
> {noformat}
> scala> val df = Seq(
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("11.88"))
>  | ).toDF("text", "number")
> df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]
> scala> val df_grouped_1 = 
> df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
> df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_1.collect()
> res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])
> scala> val df_grouped_2 = 
> df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_2.collect()
> res1: Array[org.apache.spark.sql.Row] = 
> Array([a,11948571.4285714285714285714286])
> scala> val df_total_sum = 
> df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)]
> scala> df_total_sum.collect()
> res2: Array[org.apache.spark.sql.Row] = Array([11.94857142857143])
> {noformat}
> The results of {{df_grouped_1}} and {{df_total_sum}} are correct, whereas the 
> result of {{df_grouped_2}} is clearly incorrect (it is the value of the 
> correct result times {{10^14}}).
> When codegen is disabled all results are correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24957) Decimal arithmetic can lead to wrong values using codegen

2018-07-30 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24957:
---

Assignee: Marco Gaido

> Decimal arithmetic can lead to wrong values using codegen
> -
>
> Key: SPARK-24957
> URL: https://issues.apache.org/jira/browse/SPARK-24957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: David Vogelbacher
>Assignee: Marco Gaido
>Priority: Major
>  Labels: correctness
> Fix For: 2.3.2, 2.4.0
>
>
> I noticed a bug when doing arithmetic on a dataframe containing decimal 
> values with codegen enabled.
> I tried to narrow it down on a small repro and got this (executed in 
> spark-shell):
> {noformat}
> scala> val df = Seq(
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("11.88"))
>  | ).toDF("text", "number")
> df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]
> scala> val df_grouped_1 = 
> df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
> df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_1.collect()
> res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])
> scala> val df_grouped_2 = 
> df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_2.collect()
> res1: Array[org.apache.spark.sql.Row] = 
> Array([a,11948571.4285714285714285714286])
> scala> val df_total_sum = 
> df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)]
> scala> df_total_sum.collect()
> res2: Array[org.apache.spark.sql.Row] = Array([11.94857142857143])
> {noformat}
> The results of {{df_grouped_1}} and {{df_total_sum}} are correct, whereas the 
> result of {{df_grouped_2}} is clearly incorrect (it is the value of the 
> correct result times {{10^14}}).
> When codegen is disabled all results are correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24957) Decimal arithmetic can lead to wrong values using codegen

2018-07-30 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24957:

Labels: correctness  (was: )

> Decimal arithmetic can lead to wrong values using codegen
> -
>
> Key: SPARK-24957
> URL: https://issues.apache.org/jira/browse/SPARK-24957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: David Vogelbacher
>Priority: Major
>  Labels: correctness
>
> I noticed a bug when doing arithmetic on a dataframe containing decimal 
> values with codegen enabled.
> I tried to narrow it down on a small repro and got this (executed in 
> spark-shell):
> {noformat}
> scala> val df = Seq(
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("11.88"))
>  | ).toDF("text", "number")
> df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]
> scala> val df_grouped_1 = 
> df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
> df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_1.collect()
> res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])
> scala> val df_grouped_2 = 
> df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_2.collect()
> res1: Array[org.apache.spark.sql.Row] = 
> Array([a,11948571.4285714285714285714286])
> scala> val df_total_sum = 
> df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)]
> scala> df_total_sum.collect()
> res2: Array[org.apache.spark.sql.Row] = Array([11.94857142857143])
> {noformat}
> The results of {{df_grouped_1}} and {{df_total_sum}} are correct, whereas the 
> result of {{df_grouped_2}} is clearly incorrect (it is the value of the 
> correct result times {{10^14}}).
> When codegen is disabled all results are correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-07-30 Thread Jackey Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561798#comment-16561798
 ] 

Jackey Lee edited comment on SPARK-24630 at 7/30/18 11:38 AM:
--

For Stream Table DDL,we have a better way to deal with(such as the following 
example).

Firstly, we used a more compatible way to handle Stream Table, user no need to 
learn other more syntax;

Secondly, the Stream Table, kafka_sql_test, can be used to not only Source but 
also Sink. Such as, user A can write Stream Data into kafka_sql_test, and user 
B can read from kafka_sql_test. This is a closed loop in data processing, and 
users can run the stream processing of the entire business smoothly through 
SQLStreaming.
Thirdly, the Stream Table created by this sql just seems like Hive Table. User 
even can run batch queries on this Stream Table.
 Finally, using this way to deal with STREAM Table, fewer code will be added or 
changed in spark source.
{code:sql}
create table kafka_sql_test using kafka 
options(
table.isstreaming = 'true',
subscribe = 'topic', 
kafka.bootstrap.servers = 'localhost:9092')
{code}


was (Author: jackey lee):
For Stream Table DDL,we have a better way to deal with(such as the following 
example).

Firstly, we used a more compatible way to handle Stream Table, user no need to 
learn other more syntax;
 Secondly, the Stream Table, kafka_sql_test, can be used to not only Source but 
also Sink. Such as, user A can write Stream Data into kafka_sql_test, and user 
B can read from kafka_sql_test. This is a closed loop in data processing, and 
users can run the stream processing of the entire business smoothly through 
SQLStreaming.
 Finally, using this way to deal with STREAM Table, fewer code will be added or 
changed in spark source.
{code:sql}
create table kafka_sql_test using kafka 
options(
table.isstreaming = 'true',
subscribe = 'topic', 
kafka.bootstrap.servers = 'localhost:9092')
{code}

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-07-30 Thread Jackey Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561798#comment-16561798
 ] 

Jackey Lee commented on SPARK-24630:


For Stream Table DDL,we have a better way to deal with(such as the following 
example).

Firstly, we used a more compatible way to handle Stream Table, user no need to 
learn other more syntax;
 Secondly, the Stream Table, kafka_sql_test, can be used to not only Source but 
also Sink. Such as, user A can write Stream Data into kafka_sql_test, and user 
B can read from kafka_sql_test. This is a closed loop in data processing, and 
users can run the stream processing of the entire business smoothly through 
SQLStreaming.
 Finally, using this way to deal with STREAM Table, fewer code will be added or 
changed in spark source.
{code:sql}
create table kafka_sql_test using kafka 
options(
table.isstreaming = 'true',
subscribe = 'topic', 
kafka.bootstrap.servers = 'localhost:9092')
{code}

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-07-30 Thread Genmao Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561762#comment-16561762
 ] 

Genmao Yu edited comment on SPARK-24630 at 7/30/18 11:27 AM:
-

Practice to add the StreamSQL DDL, like this:
{code:java}
spark-sql> create source stream service_logs with 
("type"="kafka","bootstrap.servers"="x.x.x.x:9092", "topic"="service_log"); 
Time taken: 1.833 seconds 
spark-sql> desc service_logs;
key binary NULL 
value binary NULL 
topic string NULL 
partition int NULL 
offset bigint NULL 
timestamp timestamp NULL 
timestampType int NULL 
Time taken: 0.394 seconds, Fetched 7 row(s) 
spark-sql> create sink stream analysis with ("type"="kafka", 
"outputMode"="complete", "bootstrap.servers"="x.x.x.x:9092", 
"topic"="analysis", "checkpointLocation"="hdfs:///tmp/cp0"); 
Time taken: 0.027 seconds 
spark-sql> insert into analysis select key, value, count(*) from service_logs 
group by key, value; 
Time taken: 0.355 seconds


{code}


was (Author: unclegen):
Try to add the StreamSQL DDL, like this:
{code:java}
spark-sql> create source stream service_logs with 
("type"="kafka","bootstrap.servers"="x.x.x.x:9092", "topic"="service_log"); 
Time taken: 1.833 seconds 
spark-sql> desc service_logs;
key binary NULL 
value binary NULL 
topic string NULL 
partition int NULL 
offset bigint NULL 
timestamp timestamp NULL 
timestampType int NULL 
Time taken: 0.394 seconds, Fetched 7 row(s) 
spark-sql> create sink stream analysis with ("type"="kafka", 
"outputMode"="complete", "bootstrap.servers"="x.x.x.x:9092", 
"topic"="analysis", "checkpointLocation"="hdfs:///tmp/cp0"); 
Time taken: 0.027 seconds 
spark-sql> insert into analysis select key, value, count(*) from service_logs 
group by key, value; 
Time taken: 0.355 seconds


{code}

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24946) PySpark - Allow np.Arrays and pd.Series in df.approxQuantile

2018-07-30 Thread Paul Westenthanner (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561784#comment-16561784
 ] 

Paul Westenthanner commented on SPARK-24946:


Yes I agree that it's rather sugar than necessary, however I'd be happy to 
implement it.

I agree that we should just check if it's iterable and since to the py4j a list 
is passed we might just do something like

{{percentiles_list =  [is_valid(v) for v in input_iterable] where is_valid 
checks if all values are floats between 0 and 1.}}

 

If you agree that this could be useful I'll be happy to create a pull request

> PySpark - Allow np.Arrays and pd.Series in df.approxQuantile
> 
>
> Key: SPARK-24946
> URL: https://issues.apache.org/jira/browse/SPARK-24946
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Paul Westenthanner
>Priority: Minor
>  Labels: DataFrame, beginner, pyspark
>
> As Python user it is convenient to pass a numpy array or pandas series 
> `{{approxQuantile}}(_col_, _probabilities_, _relativeError_)` for the 
> probabilities parameter. 
>  
> Especially for creating cumulative plots (say in 1% steps) it is handy to use 
> `approxQuantile(col, np.arange(0, 1.0, 0.01), relativeError)`.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24970) Spark Kinesis streaming application fails to recover from streaming checkpoint due to ProvisionedThroughputExceededException

2018-07-30 Thread bruce_zhao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bruce_zhao updated SPARK-24970:
---
Description: 
 

We're using Spark streaming to consume Kinesis data, and found that it reads 
more data from Kinesis and is easy to touch 
ProvisionedThroughputExceededException *when it recovers from streaming 
checkpoint*. 

 

Normally, it's a WARN in spark log. But when we have multiple streaming 
applications (i.e., 5 applications) to consume the same Kinesis stream, the 
situation becomes serious. *The application will fail to recover due to the 
following exception in driver.*  And one application failure will also affect 
the other running applications. 

  
{panel:title=Exception}
org.apache.spark.SparkException: Job aborted due to stage failure: 
{color:#ff}*Task 5 in stage 7.0 failed 4 times, most recent 
failure*:{color} Lost task 5.3 in stage 7.0 (TID 128, 
ip-172-31-14-36.ap-northeast-1.compute.internal, executor 1): 
org.apache.spark.SparkException: Gave up after 3 retries while getting records 
using shard iterator, last exception: at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$retryOrTimeout$2.apply(KinesisBackedBlockRDD.scala:288)
 at scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:282)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecordsAndNextKinesisIterator(KinesisBackedBlockRDD.scala:223)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecords(KinesisBackedBlockRDD.scala:207)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:162)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:133)
 at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) at 
scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
 at 
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954) at 
org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) 
Caused by: 
*{color:#ff}com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException:
 Rate exceeded for shard shardId- in stream rellfsstream-an under 
account 1{color}.* (Service: AmazonKinesis; Status Code: 400; Error 
Code: ProvisionedThroughputExceededException; Request ID: 
d3520677-060e-14c4-8014-2886b6b75f03) at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1257)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1029)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:741)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:715)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:697)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:665)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:647)
 at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:511) at 
com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2219)
 at 
com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2195)
 at 
com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetRecords(AmazonKinesisClient.java:1004)
 at 
com.amazonaws.services.kinesis.AmazonKinesisClient.getRecords(AmazonKinesisClient.java:980)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$2.apply(KinesisBackedBlockRDD.scala:224)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$2.apply(KinesisBackedBlockRDD.scala:224)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:269)
{panel}
 

After check the source code, we found it calls getBlockFromKinesis() to recover 
data and in this function it accesses Kinesis directly to read data. As all 
partitions in the BlockRDD will access Kinesis, and AWS Kinesis only supports 5 
concurrency reads per shard per second, it will touch 
ProvisionedThroughputExceededException easily. Even the code does some retries, 
it's still easy to fail when 

[jira] [Updated] (SPARK-24970) Spark Kinesis streaming application fails to recover from streaming checkpoint due to ProvisionedThroughputExceededException

2018-07-30 Thread bruce_zhao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bruce_zhao updated SPARK-24970:
---
Description: 
 

We're using Spark streaming to consume Kinesis data, and found that it reads 
more data from Kinesis and is easy to touch 
ProvisionedThroughputExceededException *when it recovers from streaming 
checkpoint*. 

 

Normally, it's a WARN in spark log. But when we have multiple streaming 
applications (i.e., 5 applications) to consume the same Kinesis stream, the 
situation becomes serious. *The application will fail to recover due to the 
following exception in driver.*  And one application failure will also affect 
the other running applications. 

 

 
{panel:title=Exception}
org.apache.spark.SparkException: Job aborted due to stage failure: 
{color:#ff}*Task 5 in stage 7.0 failed 4 times, most recent 
failure*:{color} Lost task 5.3 in stage 7.0 (TID 128, 
ip-172-31-14-36.ap-northeast-1.compute.internal, executor 1): 
org.apache.spark.SparkException: Gave up after 3 retries while getting records 
using shard iterator, last exception: at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$retryOrTimeout$2.apply(KinesisBackedBlockRDD.scala:288)
 at scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:282)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecordsAndNextKinesisIterator(KinesisBackedBlockRDD.scala:223)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecords(KinesisBackedBlockRDD.scala:207)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:162)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:133)
 at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) at 
scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
 at 
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954) at 
org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) 
Caused by: 
*{color:#ff}com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException:
 Rate exceeded for shard shardId- in stream rellfsstream-an under 
account 1{color}.* (Service: AmazonKinesis; Status Code: 400; Error 
Code: ProvisionedThroughputExceededException; Request ID: 
d3520677-060e-14c4-8014-2886b6b75f03) at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1257)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1029)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:741)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:715)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:697)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:665)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:647)
 at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:511) at 
com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2219)
 at 
com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2195)
 at 
com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetRecords(AmazonKinesisClient.java:1004)
 at 
com.amazonaws.services.kinesis.AmazonKinesisClient.getRecords(AmazonKinesisClient.java:980)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$2.apply(KinesisBackedBlockRDD.scala:224)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$2.apply(KinesisBackedBlockRDD.scala:224)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:269)
{panel}
 

 

After check the source code, we found it calls getBlockFromKinesis() to recover 
data and in this function it accesses Kinesis directly to read data. As all 
partitions in the BlockRDD will access Kinesis, and AWS Kinesis only supports 5 
concurrency reads per shard per second, it will touch 
ProvisionedThroughputExceededException easily. Even the code does some retries, 
it's still easy to fail 

[jira] [Created] (SPARK-24970) Spark Kinesis streaming application fails to recover from streaming checkpoint due to ProvisionedThroughputExceededException

2018-07-30 Thread bruce_zhao (JIRA)
bruce_zhao created SPARK-24970:
--

 Summary: Spark Kinesis streaming application fails to recover from 
streaming checkpoint due to ProvisionedThroughputExceededException
 Key: SPARK-24970
 URL: https://issues.apache.org/jira/browse/SPARK-24970
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.2.0
Reporter: bruce_zhao


We're using Spark streaming to consume Kinesis data, and found that it reads 
more data from Kinesis and is easy to touch 
ProvisionedThroughputExceededException *when it recovers from streaming 
checkpoint*. 

 

Normally, it's a WARN in spark log. But when we have multiple streaming 
applications (i.e., 5 applications) to consume the same Kinesis stream, the 
situation becomes serious. *The application will fail to recover due to the 
following exception in driver.*  And one application failure will also affect 
the other running applications. 

 

 
{panel:title=Exception}


org.apache.spark.SparkException: Job aborted due to stage failure: 
{color:#FF}*Task 5 in stage 7.0 failed 4 times, most recent 
failure*:{color} Lost task 5.3 in stage 7.0 (TID 128, 
ip-172-31-14-36.ap-northeast-1.compute.internal, executor 1): 
org.apache.spark.SparkException: Gave up after 3 retries while getting records 
using shard iterator, last exception: at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$retryOrTimeout$2.apply(KinesisBackedBlockRDD.scala:288)
 at scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:282)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecordsAndNextKinesisIterator(KinesisBackedBlockRDD.scala:223)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecords(KinesisBackedBlockRDD.scala:207)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:162)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:133)
 at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) at 
scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
 at 
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954) at 
org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) 
Caused by: 
*{color:#FF}com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException:
 Rate exceeded for shard shardId- in stream rellfsstream-an under 
account 1{color}.* (Service: AmazonKinesis; Status Code: 400; Error 
Code: ProvisionedThroughputExceededException; Request ID: 
d3520677-060e-14c4-8014-2886b6b75f03) at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1257)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1029)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:741)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:715)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:697)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:665)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:647)
 at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:511) at 
com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2219)
 at 
com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2195)
 at 
com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetRecords(AmazonKinesisClient.java:1004)
 at 
com.amazonaws.services.kinesis.AmazonKinesisClient.getRecords(AmazonKinesisClient.java:980)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$2.apply(KinesisBackedBlockRDD.scala:224)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$2.apply(KinesisBackedBlockRDD.scala:224)
 at 
org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:269)
{panel}
 

 

After check the source code, we found it calls getBlockFromKinesis() to recover 
data and in this function it accesses Kinesis directly to read 

[jira] [Updated] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-07-30 Thread Genmao Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Genmao Yu updated SPARK-24630:
--
Attachment: (was: image-2018-07-30-18-48-38-352.png)

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-07-30 Thread Genmao Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Genmao Yu updated SPARK-24630:
--
Attachment: (was: image-2018-07-30-18-06-30-506.png)

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-07-30 Thread Genmao Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561762#comment-16561762
 ] 

Genmao Yu edited comment on SPARK-24630 at 7/30/18 10:53 AM:
-

Try to add the StreamSQL DDL, like this:
{code:java}
spark-sql> create source stream service_logs with 
("type"="kafka","bootstrap.servers"="x.x.x.x:9092", "topic"="service_log"); 
Time taken: 1.833 seconds 
spark-sql> desc service_logs;
key binary NULL 
value binary NULL 
topic string NULL 
partition int NULL 
offset bigint NULL 
timestamp timestamp NULL 
timestampType int NULL 
Time taken: 0.394 seconds, Fetched 7 row(s) 
spark-sql> create sink stream analysis with ("type"="kafka", 
"outputMode"="complete", "bootstrap.servers"="x.x.x.x:9092", 
"topic"="analysis", "checkpointLocation"="hdfs:///tmp/cp0"); 
Time taken: 0.027 seconds 
spark-sql> insert into analysis select key, value, count(*) from service_logs 
group by key, value; 
Time taken: 0.355 seconds


{code}


was (Author: unclegen):
Try to add the StreamSQL DDL, like this:

!image-2018-07-30-18-48-38-352.png!

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf, 
> image-2018-07-30-18-06-30-506.png, image-2018-07-30-18-48-38-352.png
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-07-30 Thread Genmao Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561762#comment-16561762
 ] 

Genmao Yu commented on SPARK-24630:
---

Try to add the StreamSQL DDL, like this:

!image-2018-07-30-18-48-38-352.png!

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf, 
> image-2018-07-30-18-06-30-506.png, image-2018-07-30-18-48-38-352.png
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-07-30 Thread Genmao Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Genmao Yu updated SPARK-24630:
--
Attachment: image-2018-07-30-18-48-38-352.png

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf, 
> image-2018-07-30-18-06-30-506.png, image-2018-07-30-18-48-38-352.png
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-07-30 Thread Genmao Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Genmao Yu updated SPARK-24630:
--
Attachment: image-2018-07-30-18-06-30-506.png

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf, image-2018-07-30-18-06-30-506.png
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >