[jira] [Assigned] (SPARK-24966) Fix the precedence rule for set operations.

2018-07-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24966:


Assignee: Apache Spark

> Fix the precedence rule for set operations.
> ---
>
> Key: SPARK-24966
> URL: https://issues.apache.org/jira/browse/SPARK-24966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Assignee: Apache Spark
>Priority: Major
>
> Currently the set operations INTERSECT, UNION and EXCEPT are assigned the 
> same precedence. We need to change to make sure INTERSECT is given higher 
> precedence than UNION and EXCEPT. UNION and EXCEPT should be evaluated in the 
> order they appear in the query from left to right. 
> Given this will result in a change in behavior, we need to keep it under a 
> config.
> Here is a reference :
> https://docs.microsoft.com/en-us/sql/t-sql/language-elements/set-operators-except-and-intersect-transact-sql?view=sql-server-2017



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24966) Fix the precedence rule for set operations.

2018-07-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564772#comment-16564772
 ] 

Apache Spark commented on SPARK-24966:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/21941

> Fix the precedence rule for set operations.
> ---
>
> Key: SPARK-24966
> URL: https://issues.apache.org/jira/browse/SPARK-24966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Priority: Major
>
> Currently the set operations INTERSECT, UNION and EXCEPT are assigned the 
> same precedence. We need to change to make sure INTERSECT is given higher 
> precedence than UNION and EXCEPT. UNION and EXCEPT should be evaluated in the 
> order they appear in the query from left to right. 
> Given this will result in a change in behavior, we need to keep it under a 
> config.
> Here is a reference :
> https://docs.microsoft.com/en-us/sql/t-sql/language-elements/set-operators-except-and-intersect-transact-sql?view=sql-server-2017



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24966) Fix the precedence rule for set operations.

2018-07-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24966:


Assignee: (was: Apache Spark)

> Fix the precedence rule for set operations.
> ---
>
> Key: SPARK-24966
> URL: https://issues.apache.org/jira/browse/SPARK-24966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Priority: Major
>
> Currently the set operations INTERSECT, UNION and EXCEPT are assigned the 
> same precedence. We need to change to make sure INTERSECT is given higher 
> precedence than UNION and EXCEPT. UNION and EXCEPT should be evaluated in the 
> order they appear in the query from left to right. 
> Given this will result in a change in behavior, we need to keep it under a 
> config.
> Here is a reference :
> https://docs.microsoft.com/en-us/sql/t-sql/language-elements/set-operators-except-and-intersect-transact-sql?view=sql-server-2017



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24951) Table valued functions should throw AnalysisException instead of IllegalArgumentException

2018-07-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24951.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> Table valued functions should throw AnalysisException instead of 
> IllegalArgumentException
> -
>
> Key: SPARK-24951
> URL: https://issues.apache.org/jira/browse/SPARK-24951
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
> Fix For: 2.4.0
>
>
> When arguments don't match, TVFs currently throw IllegalArgumentException, 
> inconsistent with other functions, which throw AnalysisException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24311) Refactor HDFSBackedStateStoreProvider to remove duplicated logic between operations on delta file and snapshot file

2018-07-31 Thread Jungtaek Lim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-24311.
--
Resolution: Won't Fix

As I got feedback on 
[https://github.com/apache/spark/pull/21357#issuecomment-409427970] the patch 
is unlikely accepted. Closing this for now.

> Refactor HDFSBackedStateStoreProvider to remove duplicated logic between 
> operations on delta file and snapshot file
> ---
>
> Key: SPARK-24311
> URL: https://issues.apache.org/jira/browse/SPARK-24311
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> The structure of delta file and snapshot file is same, but the operations are 
> defined as individual methods which incurs duplicated logic between delta 
> file and snapshot file.
> We can refactor to remove duplicated logic to ensure readability, as well as 
> guaranteeing that the structure of delta file and snapshot file keeps same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24893) Remove the entire CaseWhen if all the outputs are semantic equivalence

2018-07-31 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24893.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21852
[https://github.com/apache/spark/pull/21852]

> Remove the entire CaseWhen if all the outputs are semantic equivalence
> --
>
> Key: SPARK-24893
> URL: https://issues.apache.org/jira/browse/SPARK-24893
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> Similar to [SPARK-24890], if all the outputs of `CaseWhen` are semantic 
> equivalence, `CaseWhen` can be removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24986) OOM in BufferHolder during writes to a stream

2018-07-31 Thread Sanket Reddy (JIRA)
Sanket Reddy created SPARK-24986:


 Summary: OOM in BufferHolder during writes to a stream
 Key: SPARK-24986
 URL: https://issues.apache.org/jira/browse/SPARK-24986
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0, 2.2.0, 2.1.0
Reporter: Sanket Reddy


We have seen out of memory exception while running one of our prod jobs. We 
expect the memory allocation to be managed by unified memory manager during run 
time.

So the buffer which is growing during write is somewhat like this if the 
rowlength is constant then the buffer does not grow… it keeps resetting and 
writing the values to  the buffer… if the rows are variable and it is skewed 
and has huge stuff to be written this happens and i think the estimator which 
requests for initial execution memory does not account for this i think… 
Checking for underlying heap before growing the global buffer might be a viable 
option

java.lang.OutOfMemoryError: Java heap space
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73)
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter.initialize(UnsafeArrayWriter.java:61)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_1$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$1.apply(AggregationIterator.scala:232)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$1.apply(AggregationIterator.scala:221)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:159)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1075)
at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1091)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1129)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1132)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:513)
at 
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:329)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1966)
at 
org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:270)
18/06/11 21:18:41 ERROR SparkUncaughtExceptionHandler: [Container in shutdown] 
Uncaught exception in thread Thread[stdout writer for 
Python/bin/python3.6,5,main]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24985) Executing SQL with "Full Outer Join" on top of large tables when there is data skew met OOM

2018-07-31 Thread sheperd huang (JIRA)
sheperd huang created SPARK-24985:
-

 Summary: Executing SQL with "Full Outer Join" on top of large 
tables when there is data skew met OOM
 Key: SPARK-24985
 URL: https://issues.apache.org/jira/browse/SPARK-24985
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: sheperd huang


When we run SQL with "Full Outer Join" on large tables when there is data skew, 
we found it's quite easy to hit OOM. We once thought we hit 
https://issues.apache.org/jira/browse/SPARK-13450. But taking a look at fix in 
[https://github.com/apache/spark/pull/16909,] we found that PR hasn't handled 
the "Full Outer Join" case.

The root cause of the OOM is there are a lot of rows with the same key.

See below code:
{code:java}
private def findMatchingRows(matchingKey: InternalRow): Unit = {
  leftMatches.clear()
  rightMatches.clear()
  leftIndex = 0
  rightIndex = 0
  while (leftRowKey != null && keyOrdering.compare(leftRowKey, matchingKey) == 
0){
  leftMatches += leftRow.copy()
  advancedLeft()
}
  while (rightRowKey != null && keyOrdering.compare(rightRowKey, matchingKey) 
== 0) {
 rightMatches += rightRow.copy()
 advancedRight()
}
{code}
It seems we haven't limited the data added to leftMatches and rightMatches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-07-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564570#comment-16564570
 ] 

Apache Spark commented on SPARK-23874:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/21939

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-31 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564562#comment-16564562
 ] 

Saisai Shao commented on SPARK-24615:
-

Leveraging dynamic allocation to tear down and bring up desired executor is a 
non-goal here, we will address it in feature, currently we're still focusing on 
using static allocation like spark.executor.gpus.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-31 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564562#comment-16564562
 ] 

Saisai Shao edited comment on SPARK-24615 at 8/1/18 12:35 AM:
--

Leveraging dynamic allocation to tear down and bring up desired executor is a 
non-goal here, we will address it in future, currently we're still focusing on 
using static allocation like spark.executor.gpus.


was (Author: jerryshao):
Leveraging dynamic allocation to tear down and bring up desired executor is a 
non-goal here, we will address it in feature, currently we're still focusing on 
using static allocation like spark.executor.gpus.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-31 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564558#comment-16564558
 ] 

Erik Erlandson commented on SPARK-24615:


Am I understanding correctly that this can't assign executors to desired 
resources without resorting to Dynamic Allocation to tear down an Executor and 
reallocate it somewhere else?

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24976) Allow None for Decimal type conversion (specific to PyArrow 0.9.0)

2018-07-31 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-24976.
--
   Resolution: Fixed
Fix Version/s: 2.3.2
   2.4.0

Issue resolved by pull request 21928
[https://github.com/apache/spark/pull/21928]

> Allow None for Decimal type conversion (specific to PyArrow 0.9.0)
> --
>
> Key: SPARK-24976
> URL: https://issues.apache.org/jira/browse/SPARK-24976
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0, 2.3.2
>
>
> See https://jira.apache.org/jira/browse/ARROW-2432
> If we use Arrow 0.9.0, the the test case (None as decimal) failed as below:
> {code}
> Traceback (most recent call last):
>   File "/.../spark/python/pyspark/sql/tests.py", line 4672, in 
> test_vectorized_udf_null_decimal
> self.assertEquals(df.collect(), res.collect())
>   File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect
> sock_info = self._jdf.collectToPython()
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
> 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o51.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 
> in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 
> (TID 7, localhost, executor driver): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/.../spark/python/pyspark/worker.py", line 320, in main
> process()
>   File "/.../spark/python/pyspark/worker.py", line 315, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream
> batch = _create_batch(series, self._timezone)
>   File "/.../spark/python/pyspark/serializers.py", line 243, in _create_batch
> arrs = [create_array(s, t) for s, t in series]
>   File "/.../spark/python/pyspark/serializers.py", line 241, in create_array
> return pa.Array.from_pandas(s, mask=mask, type=t)
>   File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 77, in pyarrow.lib.check_status
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24976) Allow None for Decimal type conversion (specific to PyArrow 0.9.0)

2018-07-31 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-24976:


Assignee: Hyukjin Kwon

> Allow None for Decimal type conversion (specific to PyArrow 0.9.0)
> --
>
> Key: SPARK-24976
> URL: https://issues.apache.org/jira/browse/SPARK-24976
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> See https://jira.apache.org/jira/browse/ARROW-2432
> If we use Arrow 0.9.0, the the test case (None as decimal) failed as below:
> {code}
> Traceback (most recent call last):
>   File "/.../spark/python/pyspark/sql/tests.py", line 4672, in 
> test_vectorized_udf_null_decimal
> self.assertEquals(df.collect(), res.collect())
>   File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect
> sock_info = self._jdf.collectToPython()
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
> 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o51.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 
> in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 
> (TID 7, localhost, executor driver): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/.../spark/python/pyspark/worker.py", line 320, in main
> process()
>   File "/.../spark/python/pyspark/worker.py", line 315, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream
> batch = _create_batch(series, self._timezone)
>   File "/.../spark/python/pyspark/serializers.py", line 243, in _create_batch
> arrs = [create_array(s, t) for s, t in series]
>   File "/.../spark/python/pyspark/serializers.py", line 241, in create_array
> return pa.Array.from_pandas(s, mask=mask, type=t)
>   File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 77, in pyarrow.lib.check_status
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19394) "assertion failed: Expected hostname" on macOS when self-assigned IP contains a percent sign

2018-07-31 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564532#comment-16564532
 ] 

Yuming Wang commented on SPARK-19394:
-

Try to add {{::1             localhost}} to /etc/hosts.

> "assertion failed: Expected hostname" on macOS when self-assigned IP contains 
> a percent sign
> 
>
> Key: SPARK-19394
> URL: https://issues.apache.org/jira/browse/SPARK-19394
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> See [this question on 
> StackOverflow|http://stackoverflow.com/q/41914586/1305344].
> {quote}
> So when I am not connected to internet, spark shell fails to load in local 
> mode. I am running Apache Spark 2.1.0 downloaded from internet, running on my 
> Mac. So I run ./bin/spark-shell and it gives me the error below.
> So I have read the Spark code and it is using Java's 
> InetAddress.getLocalHost() to find the localhost's IP address. So when I am 
> connected to internet, I get back an IPv4 with my local hostname.
> scala> InetAddress.getLocalHost
> res9: java.net.InetAddress = AliKheyrollahis-MacBook-Pro.local/192.168.1.26
> but the key is, when disconnected, I get an IPv6 with a percentage in the 
> values (it is scoped):
> scala> InetAddress.getLocalHost
> res10: java.net.InetAddress = 
> AliKheyrollahis-MacBook-Pro.local/fe80:0:0:0:2b9a:4521:a301:e9a5%10
> And this IP is the same as the one you see in the error message. I feel my 
> problem is that it throws Spark since it cannot handle %10 in the result.
> ...
> 17/01/28 22:03:28 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
> http://fe80:0:0:0:2b9a:4521:a301:e9a5%10:4040
> 17/01/28 22:03:28 INFO Executor: Starting executor ID driver on host localhost
> 17/01/28 22:03:28 INFO Executor: Using REPL class URI: 
> spark://fe80:0:0:0:2b9a:4521:a301:e9a5%10:56107/classes
> 17/01/28 22:03:28 ERROR SparkContext: Error initializing SparkContext.
> java.lang.AssertionError: assertion failed: Expected hostname
> at scala.Predef$.assert(Predef.scala:170)
> at org.apache.spark.util.Utils$.checkHost(Utils.scala:931)
> at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:31)
> at org.apache.spark.executor.Executor.(Executor.scala:121)
> at 
> org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59)
> at 
> org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156)
> at org.apache.spark.SparkContext.(SparkContext.scala:509)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
> at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
> at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
> at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24977) input_file_name() result can't save and use for partitionBy()

2018-07-31 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564490#comment-16564490
 ] 

kevin yu commented on SPARK-24977:
--

Hello Srinivasarao: Can you show the steps you encountered the problem? I just 
did a quick test, seems work fine, but not sure it is the same as yours.

 

scala> spark.read.textFile("file:///etc/passwd")

res3: org.apache.spark.sql.Dataset[String] = [value: string]

scala> res3.select(input_file_name() as "input", expr("10 as 
col2")).write.partitionBy("input").saveAsTable("passwd3")

18/07/31 16:11:59 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException

 

scala> spark.sql("select * from passwd3").show

++--+

|col2|             input|

++--+

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

++--+

only showing top 20 rows

 

> input_file_name() result can't save and use for partitionBy()
> -
>
> Key: SPARK-24977
> URL: https://issues.apache.org/jira/browse/SPARK-24977
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.3.1
>Reporter: Srinivasarao Padala
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24984) Spark Streaming with xml data

2018-07-31 Thread Kavya (JIRA)
Kavya created SPARK-24984:
-

 Summary: Spark Streaming with xml data 
 Key: SPARK-24984
 URL: https://issues.apache.org/jira/browse/SPARK-24984
 Project: Spark
  Issue Type: New Feature
  Components: DStreams
Affects Versions: 2.3.1
Reporter: Kavya
 Fix For: 0.8.2


Hi,

We have Kinesis producer  which reads xml data . But m currently unable to 
parse the xml using spark streaming. I'm able to read the data , but unable to 
parse into schema

 

I'm able to read the data via 
spark.option("format","com..xml").load() which provides the 
schema. But I dont want to read via files. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24983) Collapsing multiple project statements with dependent When-Otherwise statements on the same column can OOM the driver

2018-07-31 Thread David Vogelbacher (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Vogelbacher updated SPARK-24983:
--
Description: 
I noticed that writing a spark job that includes many sequential 
{{when-otherwise}} statements on the same column can easily OOM the driver 
while generating the optimized plan because the project node will grow 
exponentially in size.

Example:
{noformat}
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> val df = Seq("a", "b", "c", "1").toDF("text")
df: org.apache.spark.sql.DataFrame = [text: string]

scala> var dfCaseWhen = df.filter($"text" =!= lit("0"))
dfCaseWhen: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [text: 
string]

scala> for( a <- 1 to 5) {
 | dfCaseWhen = dfCaseWhen.withColumn("text", when($"text" === 
lit(a.toString), lit("r" + a.toString)).otherwise($"text"))
 | }

scala> dfCaseWhen.queryExecution.analyzed
res6: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [CASE WHEN (text#12 = 5) THEN r5 ELSE text#12 END AS text#14]
+- Project [CASE WHEN (text#10 = 4) THEN r4 ELSE text#10 END AS text#12]
   +- Project [CASE WHEN (text#8 = 3) THEN r3 ELSE text#8 END AS text#10]
  +- Project [CASE WHEN (text#6 = 2) THEN r2 ELSE text#6 END AS text#8]
 +- Project [CASE WHEN (text#3 = 1) THEN r1 ELSE text#3 END AS text#6]
+- Filter NOT (text#3 = 0)
   +- Project [value#1 AS text#3]
  +- LocalRelation [value#1]

scala> dfCaseWhen.queryExecution.optimizedPlan
res5: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) 
THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE 
value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 
ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 
END END END = 4) THEN r4 ELSE CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) 
THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE 
value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 
ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 
END END END END = 5) THEN r5 ELSE CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN 
(value#1 = 1) THEN r1 ELSE va...
{noformat}

As one can see the optimized plan grows exponentially in the number of 
{{when-otherwise}} statements here.

I can see that this comes from the {{CollapseProject}} optimizer rule.
Maybe we should put a limit on the resulting size of the project node after 
collapsing and only collapse if we stay under the limit.

  was:
Hi,
I noticed that writing a spark job that includes many sequential when-otherwise 
statements on the same column can easily OOM the driver while generating the 
optimized plan because the project node will grow exponentially in size.

Example:
{noformat}
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> val df = Seq("a", "b", "c", "1").toDF("text")
df: org.apache.spark.sql.DataFrame = [text: string]

scala> var dfCaseWhen = df.filter($"text" =!= lit("0"))
dfCaseWhen: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [text: 
string]

scala> for( a <- 1 to 5) {
 | dfCaseWhen = dfCaseWhen.withColumn("text", when($"text" === 
lit(a.toString), lit("r" + a.toString)).otherwise($"text"))
 | }

scala> dfCaseWhen.queryExecution.analyzed
res6: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [CASE WHEN (text#12 = 5) THEN r5 ELSE text#12 END AS text#14]
+- Project [CASE WHEN (text#10 = 4) THEN r4 ELSE text#10 END AS text#12]
   +- Project [CASE WHEN (text#8 = 3) THEN r3 ELSE text#8 END AS text#10]
  +- Project [CASE WHEN (text#6 = 2) THEN r2 ELSE text#6 END AS text#8]
 +- Project [CASE WHEN (text#3 = 1) THEN r1 ELSE text#3 END AS text#6]
+- Filter NOT (text#3 = 0)
   +- Project [value#1 AS text#3]
  +- LocalRelation [value#1]

scala> dfCaseWhen.queryExecution.optimizedPlan
res5: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) 
THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE 
value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 
ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 
END END END = 4) THEN r4 ELSE CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) 
THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE 
value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 
ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 
END END END END = 5) THEN r5 ELSE CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN 
(value#1 = 1) 

[jira] [Created] (SPARK-24983) Collapsing multiple project statements with dependent When-Otherwise statements on the same column can OOM the driver

2018-07-31 Thread David Vogelbacher (JIRA)
David Vogelbacher created SPARK-24983:
-

 Summary: Collapsing multiple project statements with dependent 
When-Otherwise statements on the same column can OOM the driver
 Key: SPARK-24983
 URL: https://issues.apache.org/jira/browse/SPARK-24983
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 2.3.1
Reporter: David Vogelbacher


Hi,
I noticed that writing a spark job that includes many sequential when-otherwise 
statements on the same column can easily OOM the driver while generating the 
optimized plan because the project node will grow exponentially in size.

Example:
{noformat}
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> val df = Seq("a", "b", "c", "1").toDF("text")
df: org.apache.spark.sql.DataFrame = [text: string]

scala> var dfCaseWhen = df.filter($"text" =!= lit("0"))
dfCaseWhen: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [text: 
string]

scala> for( a <- 1 to 5) {
 | dfCaseWhen = dfCaseWhen.withColumn("text", when($"text" === 
lit(a.toString), lit("r" + a.toString)).otherwise($"text"))
 | }

scala> dfCaseWhen.queryExecution.analyzed
res6: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [CASE WHEN (text#12 = 5) THEN r5 ELSE text#12 END AS text#14]
+- Project [CASE WHEN (text#10 = 4) THEN r4 ELSE text#10 END AS text#12]
   +- Project [CASE WHEN (text#8 = 3) THEN r3 ELSE text#8 END AS text#10]
  +- Project [CASE WHEN (text#6 = 2) THEN r2 ELSE text#6 END AS text#8]
 +- Project [CASE WHEN (text#3 = 1) THEN r1 ELSE text#3 END AS text#6]
+- Filter NOT (text#3 = 0)
   +- Project [value#1 AS text#3]
  +- LocalRelation [value#1]

scala> dfCaseWhen.queryExecution.optimizedPlan
res5: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) 
THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE 
value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 
ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 
END END END = 4) THEN r4 ELSE CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) 
THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE 
value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 
ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 
END END END END = 5) THEN r5 ELSE CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN 
(value#1 = 1) THEN r1 ELSE va...
{noformat}

As one can see the optimized plan grows exponentially in the number of 
{{when-otherwise}} statements here.

I can see that this comes from the {{CollapseProject}} optimizer rule.
Maybe we should put a limit on the resulting size of the project node after 
collapsing and only collapse if we stay under the limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24982) UDAF resolution should not throw java.lang.AssertionError

2018-07-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24982:


Assignee: Reynold Xin  (was: Apache Spark)

> UDAF resolution should not throw java.lang.AssertionError
> -
>
> Key: SPARK-24982
> URL: https://issues.apache.org/jira/browse/SPARK-24982
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> See udaf.sql.out:
>  
> {code:java}
> – !query 3
>  SELECT default.myDoubleAvg(int_col1, 3) as my_avg from t1
>  – !query 3 schema
>  struct<>
>  – !query 3 output
>  java.lang.AssertionError
>  assertion failed: Incorrect number of children1{code}
>  
> We should never throw AssertionError unless there is a bug in the system 
> itself.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24982) UDAF resolution should not throw java.lang.AssertionError

2018-07-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564364#comment-16564364
 ] 

Apache Spark commented on SPARK-24982:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21938

> UDAF resolution should not throw java.lang.AssertionError
> -
>
> Key: SPARK-24982
> URL: https://issues.apache.org/jira/browse/SPARK-24982
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> See udaf.sql.out:
>  
> {code:java}
> – !query 3
>  SELECT default.myDoubleAvg(int_col1, 3) as my_avg from t1
>  – !query 3 schema
>  struct<>
>  – !query 3 output
>  java.lang.AssertionError
>  assertion failed: Incorrect number of children1{code}
>  
> We should never throw AssertionError unless there is a bug in the system 
> itself.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24982) UDAF resolution should not throw java.lang.AssertionError

2018-07-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24982:


Assignee: Apache Spark  (was: Reynold Xin)

> UDAF resolution should not throw java.lang.AssertionError
> -
>
> Key: SPARK-24982
> URL: https://issues.apache.org/jira/browse/SPARK-24982
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Major
>
> See udaf.sql.out:
>  
> {code:java}
> – !query 3
>  SELECT default.myDoubleAvg(int_col1, 3) as my_avg from t1
>  – !query 3 schema
>  struct<>
>  – !query 3 output
>  java.lang.AssertionError
>  assertion failed: Incorrect number of children1{code}
>  
> We should never throw AssertionError unless there is a bug in the system 
> itself.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24982) UDAF resolution should not throw java.lang.AssertionError

2018-07-31 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-24982:
---

Assignee: Reynold Xin

> UDAF resolution should not throw java.lang.AssertionError
> -
>
> Key: SPARK-24982
> URL: https://issues.apache.org/jira/browse/SPARK-24982
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> See udaf.sql.out:
>  
> {code:java}
> – !query 3
>  SELECT default.myDoubleAvg(int_col1, 3) as my_avg from t1
>  – !query 3 schema
>  struct<>
>  – !query 3 output
>  java.lang.AssertionError
>  assertion failed: Incorrect number of children1{code}
>  
> We should never throw AssertionError unless there is a bug in the system 
> itself.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24882) data source v2 API improvement

2018-07-31 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564362#comment-16564362
 ] 

Ryan Blue edited comment on SPARK-24882 at 7/31/18 9:02 PM:


{quote}the problem is then we need to make `CatalogSupport` a must-have for 
data sources instead of an optional plugin
{quote}
Data sources are read and write implementations. Catalog support should be a 
layer above read/write implementation that is used to provide CTAS and other 
table-level support.

If you're interested in the anonymous table use case from the email discussion, 
I posted a suggestion there to add an {{anonymousTable}} function to 
{{DataSourceV2}}. That allows a source instantiated directly through v1-style 
reflection to provide a {{Table}} based on an options map. Then that table 
would implement {{ReadSupport}} and {{WriteSupport}} as I've suggested in this 
thread. That would preserve the ability to instantiate a source directly and 
use it, and would center around a {{Table}} that implements the read and write 
traits.

An alternative to the {{anonymousTable}} method is what I did in the WIP pull 
request for CTAS. In that PR, I created two ways to work with {{DataSourceV2}}: 
through the existing {{DataSourceV2Relation}} and through a new 
{{TableV2Relation}}. The first is for {{DataSourceV2}} instances that implement 
the read and write traits, while the latter is for {{Table}} objects that 
implement them. Either way works, though it would be cleaner to just use 
{{Table}}.

 

Thanks for the builder update! Immutability is the most important part, but I'd 
still prefer a builder interface with default methods instead of the mix-in 
traits.


was (Author: rdblue):
{quote}the problem is then we need to make `CatalogSupport` a must-have for 
data sources instead of an optional plugin
{quote}
Data sources are read and write implementations. Catalog support should be a 
layer above read/write implementation that is used to provide CTAS and other 
table-level support. If you're interested in the anonymous table use case from 
the email discussion, I posted a suggestion there to add an {{anonymousTable}} 
function to {{DataSourceV2}}. That allows a source instantiated directly 
through v1-style reflection to provide a {{Table}} based on an options map. 
Then that table would implement {{ReadSupport}} and {{WriteSupport}} as I've 
suggested in this thread. That would preserve the ability to instantiate a 
source directly and use it, and would center around a {{Table}} that implements 
the read and write traits.

An alternative to the {{anonymousTable}} method is what I did in the WIP pull 
request for CTAS. In that PR, I created two ways to work with {{DataSourceV2}}: 
through the existing {{DataSourceV2Relation}} and through a new 
{{TableV2Relation}}. The first is for {{DataSourceV2}} instances that implement 
the read and write traits, while the latter is for {{Table}} objects that 
implement them. Either way works, though it would be cleaner to just use 
{{Table}}.

 

Thanks for the builder update! Immutability is the most important part, but I'd 
still prefer a builder interface with default methods instead of the mix-in 
traits.

> data source v2 API improvement
> --
>
> Key: SPARK-24882
> URL: https://issues.apache.org/jira/browse/SPARK-24882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 API, isolate the stateull part of the API, think of better naming 
> of some interfaces. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24882) data source v2 API improvement

2018-07-31 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564362#comment-16564362
 ] 

Ryan Blue commented on SPARK-24882:
---

{quote}the problem is then we need to make `CatalogSupport` a must-have for 
data sources instead of an optional plugin
{quote}
Data sources are read and write implementations. Catalog support should be a 
layer above read/write implementation that is used to provide CTAS and other 
table-level support. If you're interested in the anonymous table use case from 
the email discussion, I posted a suggestion there to add an {{anonymousTable}} 
function to {{DataSourceV2}}. That allows a source instantiated directly 
through v1-style reflection to provide a {{Table}} based on an options map. 
Then that table would implement {{ReadSupport}} and {{WriteSupport}} as I've 
suggested in this thread. That would preserve the ability to instantiate a 
source directly and use it, and would center around a {{Table}} that implements 
the read and write traits.

An alternative to the {{anonymousTable}} method is what I did in the WIP pull 
request for CTAS. In that PR, I created two ways to work with {{DataSourceV2}}: 
through the existing {{DataSourceV2Relation}} and through a new 
{{TableV2Relation}}. The first is for {{DataSourceV2}} instances that implement 
the read and write traits, while the latter is for {{Table}} objects that 
implement them. Either way works, though it would be cleaner to just use 
{{Table}}.

 

Thanks for the builder update! Immutability is the most important part, but I'd 
still prefer a builder interface with default methods instead of the mix-in 
traits.

> data source v2 API improvement
> --
>
> Key: SPARK-24882
> URL: https://issues.apache.org/jira/browse/SPARK-24882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 API, isolate the stateull part of the API, think of better naming 
> of some interfaces. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24973) Add numIter to Python ClusteringSummary

2018-07-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-24973:
-

Assignee: Huaxin Gao

> Add numIter to Python ClusteringSummary 
> 
>
> Key: SPARK-24973
> URL: https://issues.apache.org/jira/browse/SPARK-24973
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> -SPARK-23528- added numIter to ClusteringSummary. Will add numIter to Python 
> version of ClusteringSummary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0

2018-07-31 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18057:
-
Summary: Update structured streaming kafka from 0.10.0.1 to 2.0.0  (was: 
Update structured streaming kafka from 0.10.0.1 to 1.1.0)

> Update structured streaming kafka from 0.10.0.1 to 2.0.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
> Fix For: 2.4.0
>
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0

2018-07-31 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-18057.
--
   Resolution: Fixed
 Assignee: Ted Yu
Fix Version/s: 2.4.0

> Update structured streaming kafka from 0.10.0.1 to 2.0.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Assignee: Ted Yu
>Priority: Major
> Fix For: 2.4.0
>
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23914) High-order function: array_union(x, y) → array

2018-07-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564227#comment-16564227
 ] 

Apache Spark commented on SPARK-23914:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21937

> High-order function: array_union(x, y) → array
> --
>
> Key: SPARK-23914
> URL: https://issues.apache.org/jira/browse/SPARK-23914
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array of the elements in the union of x and y, without duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24982) UDAF resolution should not throw java.lang.AssertionError

2018-07-31 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-24982:

Description: 
See udaf.sql.out:

 
{code:java}
– !query 3
 SELECT default.myDoubleAvg(int_col1, 3) as my_avg from t1
 – !query 3 schema
 struct<>
 – !query 3 output
 java.lang.AssertionError
 assertion failed: Incorrect number of children1{code}
 

We should never throw AssertionError unless there is a bug in the system itself.

 

 

  was:
See udaf.sql.out:

 

-- !query 3
SELECT default.myDoubleAvg(int_col1, 3) as my_avg from t1
-- !query 3 schema
struct<>
-- !query 3 output
java.lang.AssertionError
assertion failed: Incorrect number of children1

 

We should never throw AssertionError unless there is a bug in the system itself.

 

 


> UDAF resolution should not throw java.lang.AssertionError
> -
>
> Key: SPARK-24982
> URL: https://issues.apache.org/jira/browse/SPARK-24982
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Reynold Xin
>Priority: Major
>
> See udaf.sql.out:
>  
> {code:java}
> – !query 3
>  SELECT default.myDoubleAvg(int_col1, 3) as my_avg from t1
>  – !query 3 schema
>  struct<>
>  – !query 3 output
>  java.lang.AssertionError
>  assertion failed: Incorrect number of children1{code}
>  
> We should never throw AssertionError unless there is a bug in the system 
> itself.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24982) UDAF resolution should not throw java.lang.AssertionError

2018-07-31 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-24982:
---

 Summary: UDAF resolution should not throw java.lang.AssertionError
 Key: SPARK-24982
 URL: https://issues.apache.org/jira/browse/SPARK-24982
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1, 2.3.0
Reporter: Reynold Xin


See udaf.sql.out:

 

-- !query 3
SELECT default.myDoubleAvg(int_col1, 3) as my_avg from t1
-- !query 3 schema
struct<>
-- !query 3 output
java.lang.AssertionError
assertion failed: Incorrect number of children1

 

We should never throw AssertionError unless there is a bug in the system itself.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24981) ShutdownHook timeout causes job to fail when succeeded when SparkContext stop() not called by user program

2018-07-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564175#comment-16564175
 ] 

Apache Spark commented on SPARK-24981:
--

User 'hthuynh2' has created a pull request for this issue:
https://github.com/apache/spark/pull/21936

> ShutdownHook timeout causes job to fail when succeeded when SparkContext 
> stop() not called by user program
> --
>
> Key: SPARK-24981
> URL: https://issues.apache.org/jira/browse/SPARK-24981
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Hieu Tri Huynh
>Priority: Minor
>
> When user does not stop the SparkContext at the end of their program, 
> ShutdownHookManger will stop the SparkContext. However, each shutdown hook is 
> only given 10s to run, it will be interrupted and cancelled after that given 
> time. In case stopping spark context takes longer than 10s, 
> InterruptedException will be thrown, and the job will fail even though it 
> succeeded before. An example of this is shown below.
> I think there are a few ways to fix this, below are the 2 ways that I have 
> now: 
> 1. After user program finished, we can check if user program stoped 
> SparkContext or not. If user didn't stop the SparkContext, we can stop it 
> before finishing the userThread. By doing so, SparkContext.stop() can take as 
> much time as it needed.
> 2. We can just catch the InterruptedException thrown by ShutdownHookManger 
> while we are stopping the SparkContext, and ignoring all the things that we 
> haven't stopped inside the SparkContext. Since we are shutting down, I think 
> it will be okay to ignore those things.
>  
> {code:java}
> 18/07/31 17:11:49 ERROR Utils: Uncaught exception in thread pool-4-thread-1
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Thread.join(Thread.java:1249)
>   at java.lang.Thread.join(Thread.java:1323)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:136)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1922)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1360)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1921)
>   at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:573)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 18/07/31 17:11:49 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
> java.util.concurrent.TimeoutException
> java.util.concurrent.TimeoutException
>   at 

[jira] [Assigned] (SPARK-24981) ShutdownHook timeout causes job to fail when succeeded when SparkContext stop() not called by user program

2018-07-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24981:


Assignee: Apache Spark

> ShutdownHook timeout causes job to fail when succeeded when SparkContext 
> stop() not called by user program
> --
>
> Key: SPARK-24981
> URL: https://issues.apache.org/jira/browse/SPARK-24981
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Hieu Tri Huynh
>Assignee: Apache Spark
>Priority: Minor
>
> When user does not stop the SparkContext at the end of their program, 
> ShutdownHookManger will stop the SparkContext. However, each shutdown hook is 
> only given 10s to run, it will be interrupted and cancelled after that given 
> time. In case stopping spark context takes longer than 10s, 
> InterruptedException will be thrown, and the job will fail even though it 
> succeeded before. An example of this is shown below.
> I think there are a few ways to fix this, below are the 2 ways that I have 
> now: 
> 1. After user program finished, we can check if user program stoped 
> SparkContext or not. If user didn't stop the SparkContext, we can stop it 
> before finishing the userThread. By doing so, SparkContext.stop() can take as 
> much time as it needed.
> 2. We can just catch the InterruptedException thrown by ShutdownHookManger 
> while we are stopping the SparkContext, and ignoring all the things that we 
> haven't stopped inside the SparkContext. Since we are shutting down, I think 
> it will be okay to ignore those things.
>  
> {code:java}
> 18/07/31 17:11:49 ERROR Utils: Uncaught exception in thread pool-4-thread-1
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Thread.join(Thread.java:1249)
>   at java.lang.Thread.join(Thread.java:1323)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:136)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1922)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1360)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1921)
>   at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:573)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 18/07/31 17:11:49 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
> java.util.concurrent.TimeoutException
> java.util.concurrent.TimeoutException
>   at java.util.concurrent.FutureTask.get(FutureTask.java:205)
>   at 
> 

[jira] [Assigned] (SPARK-24981) ShutdownHook timeout causes job to fail when succeeded when SparkContext stop() not called by user program

2018-07-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24981:


Assignee: (was: Apache Spark)

> ShutdownHook timeout causes job to fail when succeeded when SparkContext 
> stop() not called by user program
> --
>
> Key: SPARK-24981
> URL: https://issues.apache.org/jira/browse/SPARK-24981
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Hieu Tri Huynh
>Priority: Minor
>
> When user does not stop the SparkContext at the end of their program, 
> ShutdownHookManger will stop the SparkContext. However, each shutdown hook is 
> only given 10s to run, it will be interrupted and cancelled after that given 
> time. In case stopping spark context takes longer than 10s, 
> InterruptedException will be thrown, and the job will fail even though it 
> succeeded before. An example of this is shown below.
> I think there are a few ways to fix this, below are the 2 ways that I have 
> now: 
> 1. After user program finished, we can check if user program stoped 
> SparkContext or not. If user didn't stop the SparkContext, we can stop it 
> before finishing the userThread. By doing so, SparkContext.stop() can take as 
> much time as it needed.
> 2. We can just catch the InterruptedException thrown by ShutdownHookManger 
> while we are stopping the SparkContext, and ignoring all the things that we 
> haven't stopped inside the SparkContext. Since we are shutting down, I think 
> it will be okay to ignore those things.
>  
> {code:java}
> 18/07/31 17:11:49 ERROR Utils: Uncaught exception in thread pool-4-thread-1
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Thread.join(Thread.java:1249)
>   at java.lang.Thread.join(Thread.java:1323)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:136)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1922)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1360)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1921)
>   at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:573)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 18/07/31 17:11:49 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
> java.util.concurrent.TimeoutException
> java.util.concurrent.TimeoutException
>   at java.util.concurrent.FutureTask.get(FutureTask.java:205)
>   at 
> 

[jira] [Commented] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12

2018-07-31 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564172#comment-16564172
 ] 

Stavros Kontopoulos commented on SPARK-14643:
-

Cool thanks!

> Remove overloaded methods which become ambiguous in Scala 2.12
> --
>
> Key: SPARK-14643
> URL: https://issues.apache.org/jira/browse/SPARK-14643
> Project: Spark
>  Issue Type: Task
>  Components: Build, Project Infra
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>
> Spark 1.x's Dataset API runs into subtle source incompatibility problems for 
> Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a 
> nutshell, the current API has overloaded methods whose signatures are 
> ambiguous when resolving calls that use the Java 8 lambda syntax (only if 
> Spark is build against Scala 2.12).
> This issue is somewhat subtle, so there's a full writeup at 
> https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing
>  which describes the exact circumstances under which the current APIs are 
> problematic. The writeup also proposes a solution which involves the removal 
> of certain overloads only in Scala 2.12 builds of Spark and the introduction 
> of implicit conversions for retaining source compatibility.
> We don't need to implement any of these changes until we add Scala 2.12 
> support since the changes must only be applied when building against Scala 
> 2.12 and will be done via traits + shims which are mixed in via 
> per-Scala-version source directories (like how we handle the 
> Scala-version-specific parts of the REPL). For now, this JIRA acts as a 
> placeholder so that the parent JIRA reflects the complete set of tasks which 
> need to be finished for 2.12 support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12

2018-07-31 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564148#comment-16564148
 ] 

Stavros Kontopoulos edited comment on SPARK-14643 at 7/31/18 6:38 PM:
--

[~srowen] Right now there is no such change in the implementation. As noted in 
the description of the PR,

we are not going to fix this for now, and as [~lrytz] mentions in the doc:

"The proposed plan is to NOT do the `ACC_SYNTHETIC` for now. Java users would 
continue to use the 2.11 artifacts. If 2.11 support is ever dropped, I assume 
this would happen in a major release that is also allowed to break binary 
compatibility, so then the conflicting overload could be removed."

What is implemented is what is described in the action plan in the document. I 
though we reached an agreement there (as I see from the comments).

If we were going to implement this there a couple of options mentioned in the 
doc:
 * A change to the Scala compiler, which we can do in 2.12.7 ([prototype 
here|https://github.com/scala/scala/compare/2.12.x...lrytz:synthetic?expand=1]).
 The Scala 2.11 compiler does not need to change.
 * A Scala compiler plugin could do the same for existing compiler releases
 * A post-processing step in the build could modify the class files using ASM

The latter does not depend on the scala compiler but dont know the implications.

 

 


was (Author: skonto):
[~srowen] Right now there is no such change in the implementation. As noted in 
the description of the PR,

we are not going to fix this for now, and as [~lrytz] mentions in the doc:

"The proposed plan is to NOT do the `ACC_SYNTHETIC` for now. Java users would 
continue to use the 2.11 artifacts. If 2.11 support is ever dropped, I assume 
this would happen in a major release that is also allowed to break binary 
compatibility, so then the conflicting overload could be removed."

What is implemented is what is described in the action plan in the document.

 

> Remove overloaded methods which become ambiguous in Scala 2.12
> --
>
> Key: SPARK-14643
> URL: https://issues.apache.org/jira/browse/SPARK-14643
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>
> Spark 1.x's Dataset API runs into subtle source incompatibility problems for 
> Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a 
> nutshell, the current API has overloaded methods whose signatures are 
> ambiguous when resolving calls that use the Java 8 lambda syntax (only if 
> Spark is build against Scala 2.12).
> This issue is somewhat subtle, so there's a full writeup at 
> https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing
>  which describes the exact circumstances under which the current APIs are 
> problematic. The writeup also proposes a solution which involves the removal 
> of certain overloads only in Scala 2.12 builds of Spark and the introduction 
> of implicit conversions for retaining source compatibility.
> We don't need to implement any of these changes until we add Scala 2.12 
> support since the changes must only be applied when building against Scala 
> 2.12 and will be done via traits + shims which are mixed in via 
> per-Scala-version source directories (like how we handle the 
> Scala-version-specific parts of the REPL). For now, this JIRA acts as a 
> placeholder so that the parent JIRA reflects the complete set of tasks which 
> need to be finished for 2.12 support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12

2018-07-31 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564148#comment-16564148
 ] 

Stavros Kontopoulos edited comment on SPARK-14643 at 7/31/18 6:38 PM:
--

[~srowen] Right now there is no such change in the implementation. As noted in 
the description of the PR,

we are not going to fix this for now, and as [~lrytz] mentions in the doc:

"The proposed plan is to NOT do the `ACC_SYNTHETIC` for now. Java users would 
continue to use the 2.11 artifacts. If 2.11 support is ever dropped, I assume 
this would happen in a major release that is also allowed to break binary 
compatibility, so then the conflicting overload could be removed."

What is implemented is what is described in the action plan in the document. I 
thought we reached an agreement there (as I see from the comments).

If we were going to implement this there a couple of options mentioned in the 
doc:
 * A change to the Scala compiler, which we can do in 2.12.7 ([prototype 
here|https://github.com/scala/scala/compare/2.12.x...lrytz:synthetic?expand=1]).
 The Scala 2.11 compiler does not need to change.
 * A Scala compiler plugin could do the same for existing compiler releases
 * A post-processing step in the build could modify the class files using ASM

The latter does not depend on the scala compiler but dont know the implications.

 

 


was (Author: skonto):
[~srowen] Right now there is no such change in the implementation. As noted in 
the description of the PR,

we are not going to fix this for now, and as [~lrytz] mentions in the doc:

"The proposed plan is to NOT do the `ACC_SYNTHETIC` for now. Java users would 
continue to use the 2.11 artifacts. If 2.11 support is ever dropped, I assume 
this would happen in a major release that is also allowed to break binary 
compatibility, so then the conflicting overload could be removed."

What is implemented is what is described in the action plan in the document. I 
though we reached an agreement there (as I see from the comments).

If we were going to implement this there a couple of options mentioned in the 
doc:
 * A change to the Scala compiler, which we can do in 2.12.7 ([prototype 
here|https://github.com/scala/scala/compare/2.12.x...lrytz:synthetic?expand=1]).
 The Scala 2.11 compiler does not need to change.
 * A Scala compiler plugin could do the same for existing compiler releases
 * A post-processing step in the build could modify the class files using ASM

The latter does not depend on the scala compiler but dont know the implications.

 

 

> Remove overloaded methods which become ambiguous in Scala 2.12
> --
>
> Key: SPARK-14643
> URL: https://issues.apache.org/jira/browse/SPARK-14643
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>
> Spark 1.x's Dataset API runs into subtle source incompatibility problems for 
> Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a 
> nutshell, the current API has overloaded methods whose signatures are 
> ambiguous when resolving calls that use the Java 8 lambda syntax (only if 
> Spark is build against Scala 2.12).
> This issue is somewhat subtle, so there's a full writeup at 
> https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing
>  which describes the exact circumstances under which the current APIs are 
> problematic. The writeup also proposes a solution which involves the removal 
> of certain overloads only in Scala 2.12 builds of Spark and the introduction 
> of implicit conversions for retaining source compatibility.
> We don't need to implement any of these changes until we add Scala 2.12 
> support since the changes must only be applied when building against Scala 
> 2.12 and will be done via traits + shims which are mixed in via 
> per-Scala-version source directories (like how we handle the 
> Scala-version-specific parts of the REPL). For now, this JIRA acts as a 
> placeholder so that the parent JIRA reflects the complete set of tasks which 
> need to be finished for 2.12 support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24609) PySpark/SparkR doc doesn't explain RandomForestClassifier.featureSubsetStrategy well

2018-07-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24609.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21788
[https://github.com/apache/spark/pull/21788]

> PySpark/SparkR doc doesn't explain 
> RandomForestClassifier.featureSubsetStrategy well
> 
>
> Key: SPARK-24609
> URL: https://issues.apache.org/jira/browse/SPARK-24609
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiangrui Meng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 2.4.0
>
>
> In Scala doc 
> ([https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.ml.classification.RandomForestClassifier)],
>  we have:
>  
> {quote}The number of features to consider for splits at each tree node. 
> Supported options:
>  * "auto": Choose automatically for task: If numTrees == 1, set to "all." If 
> numTrees > 1 (forest), set to "sqrt" for classification and to "onethird" for 
> regression.
>  * "all": use all features
>  * "onethird": use 1/3 of the features
>  * "sqrt": use sqrt(number of features)
>  * "log2": use log2(number of features)
>  * "n": when n is in the range (0, 1.0], use n * number of features. When n 
> is in the range (1, number of features), use n features. (default = "auto")
> These various settings are based on the following references:
>  * log2: tested in Breiman (2001)
>  * sqrt: recommended by Breiman manual for random forests
>  * The defaults of sqrt (classification) and onethird (regression) match the 
> R randomForest package.{quote}
>  
> The entire paragraph is missing in PySpark doc 
> ([https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.RandomForestClassifier.featureSubsetStrategy]).
>  And same issue for SparkR 
> (https://github.com/apache/spark/blob/master/R/pkg/R/mllib_tree.R#L365).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24609) PySpark/SparkR doc doesn't explain RandomForestClassifier.featureSubsetStrategy well

2018-07-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-24609:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> PySpark/SparkR doc doesn't explain 
> RandomForestClassifier.featureSubsetStrategy well
> 
>
> Key: SPARK-24609
> URL: https://issues.apache.org/jira/browse/SPARK-24609
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiangrui Meng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.4.0
>
>
> In Scala doc 
> ([https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.ml.classification.RandomForestClassifier)],
>  we have:
>  
> {quote}The number of features to consider for splits at each tree node. 
> Supported options:
>  * "auto": Choose automatically for task: If numTrees == 1, set to "all." If 
> numTrees > 1 (forest), set to "sqrt" for classification and to "onethird" for 
> regression.
>  * "all": use all features
>  * "onethird": use 1/3 of the features
>  * "sqrt": use sqrt(number of features)
>  * "log2": use log2(number of features)
>  * "n": when n is in the range (0, 1.0], use n * number of features. When n 
> is in the range (1, number of features), use n features. (default = "auto")
> These various settings are based on the following references:
>  * log2: tested in Breiman (2001)
>  * sqrt: recommended by Breiman manual for random forests
>  * The defaults of sqrt (classification) and onethird (regression) match the 
> R randomForest package.{quote}
>  
> The entire paragraph is missing in PySpark doc 
> ([https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.RandomForestClassifier.featureSubsetStrategy]).
>  And same issue for SparkR 
> (https://github.com/apache/spark/blob/master/R/pkg/R/mllib_tree.R#L365).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24609) PySpark/SparkR doc doesn't explain RandomForestClassifier.featureSubsetStrategy well

2018-07-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-24609:
-

Assignee: zhengruifeng

> PySpark/SparkR doc doesn't explain 
> RandomForestClassifier.featureSubsetStrategy well
> 
>
> Key: SPARK-24609
> URL: https://issues.apache.org/jira/browse/SPARK-24609
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiangrui Meng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 2.4.0
>
>
> In Scala doc 
> ([https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.ml.classification.RandomForestClassifier)],
>  we have:
>  
> {quote}The number of features to consider for splits at each tree node. 
> Supported options:
>  * "auto": Choose automatically for task: If numTrees == 1, set to "all." If 
> numTrees > 1 (forest), set to "sqrt" for classification and to "onethird" for 
> regression.
>  * "all": use all features
>  * "onethird": use 1/3 of the features
>  * "sqrt": use sqrt(number of features)
>  * "log2": use log2(number of features)
>  * "n": when n is in the range (0, 1.0], use n * number of features. When n 
> is in the range (1, number of features), use n features. (default = "auto")
> These various settings are based on the following references:
>  * log2: tested in Breiman (2001)
>  * sqrt: recommended by Breiman manual for random forests
>  * The defaults of sqrt (classification) and onethird (regression) match the 
> R randomForest package.{quote}
>  
> The entire paragraph is missing in PySpark doc 
> ([https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.RandomForestClassifier.featureSubsetStrategy]).
>  And same issue for SparkR 
> (https://github.com/apache/spark/blob/master/R/pkg/R/mllib_tree.R#L365).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24980) add support for pandas/arrow etc for python2.7 and pypy builds

2018-07-31 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564154#comment-16564154
 ] 

shane knapp commented on SPARK-24980:
-

[~hyukjin.kwon] thought you might be interested in this!  ;)

> add support for pandas/arrow etc for python2.7 and pypy builds
> --
>
> Key: SPARK-24980
> URL: https://issues.apache.org/jira/browse/SPARK-24980
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> since we have full support for python3.4 via anaconda, it's time to create 
> similar environments for 2.7 and pypy 2.5.1.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12

2018-07-31 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564148#comment-16564148
 ] 

Stavros Kontopoulos edited comment on SPARK-14643 at 7/31/18 6:34 PM:
--

[~srowen] Right now there is no such change in the implementation. As noted in 
the description of the PR,

we are not going to fix this for now, and as [~lrytz] mentions in the doc:

"The proposed plan is to NOT do the `ACC_SYNTHETIC` for now. Java users would 
continue to use the 2.11 artifacts. If 2.11 support is ever dropped, I assume 
this would happen in a major release that is also allowed to break binary 
compatibility, so then the conflicting overload could be removed."

What is implemented is what is described in the action plan in the document.

 


was (Author: skonto):
[~srowen] Right now there is no such change in the implementation. As noted in 
the description of the PR,

we are not going to fix this, and as [~lrytz] mentions in the doc:

"The proposed plan is to NOT do the `ACC_SYNTHETIC` for now. Java users would 
continue to use the 2.11 artifacts. If 2.11 support is ever dropped, I assume 
this would happen in a major release that is also allowed to break binary 
compatibility, so then the conflicting overload could be removed."

What is implemented is what is described in the action plan in the document.

 

> Remove overloaded methods which become ambiguous in Scala 2.12
> --
>
> Key: SPARK-14643
> URL: https://issues.apache.org/jira/browse/SPARK-14643
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>
> Spark 1.x's Dataset API runs into subtle source incompatibility problems for 
> Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a 
> nutshell, the current API has overloaded methods whose signatures are 
> ambiguous when resolving calls that use the Java 8 lambda syntax (only if 
> Spark is build against Scala 2.12).
> This issue is somewhat subtle, so there's a full writeup at 
> https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing
>  which describes the exact circumstances under which the current APIs are 
> problematic. The writeup also proposes a solution which involves the removal 
> of certain overloads only in Scala 2.12 builds of Spark and the introduction 
> of implicit conversions for retaining source compatibility.
> We don't need to implement any of these changes until we add Scala 2.12 
> support since the changes must only be applied when building against Scala 
> 2.12 and will be done via traits + shims which are mixed in via 
> per-Scala-version source directories (like how we handle the 
> Scala-version-specific parts of the REPL). For now, this JIRA acts as a 
> placeholder so that the parent JIRA reflects the complete set of tasks which 
> need to be finished for 2.12 support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12

2018-07-31 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564148#comment-16564148
 ] 

Stavros Kontopoulos edited comment on SPARK-14643 at 7/31/18 6:34 PM:
--

[~srowen] Right now there is no such change in the implementation. As noted in 
the description of the PR,

we are not going to fix this, and as [~lrytz] mentions in the doc:

"The proposed plan is to NOT do the `ACC_SYNTHETIC` for now. Java users would 
continue to use the 2.11 artifacts. If 2.11 support is ever dropped, I assume 
this would happen in a major release that is also allowed to break binary 
compatibility, so then the conflicting overload could be removed."

What is implemented is what is described in the action plan in the document.

 


was (Author: skonto):
[~srowen] Right now there is no such change in the implementation. As noted in 
the description of the PR,

we are not going to fix this, as [~lrytz] said:

"The proposed plan is to NOT do the `ACC_SYNTHETIC` for now. Java users would 
continue to use the 2.11 artifacts. If 2.11 support is ever dropped, I assume 
this would happen in a major release that is also allowed to break binary 
compatibility, so then the conflicting overload could be removed."

 

> Remove overloaded methods which become ambiguous in Scala 2.12
> --
>
> Key: SPARK-14643
> URL: https://issues.apache.org/jira/browse/SPARK-14643
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>
> Spark 1.x's Dataset API runs into subtle source incompatibility problems for 
> Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a 
> nutshell, the current API has overloaded methods whose signatures are 
> ambiguous when resolving calls that use the Java 8 lambda syntax (only if 
> Spark is build against Scala 2.12).
> This issue is somewhat subtle, so there's a full writeup at 
> https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing
>  which describes the exact circumstances under which the current APIs are 
> problematic. The writeup also proposes a solution which involves the removal 
> of certain overloads only in Scala 2.12 builds of Spark and the introduction 
> of implicit conversions for retaining source compatibility.
> We don't need to implement any of these changes until we add Scala 2.12 
> support since the changes must only be applied when building against Scala 
> 2.12 and will be done via traits + shims which are mixed in via 
> per-Scala-version source directories (like how we handle the 
> Scala-version-specific parts of the REPL). For now, this JIRA acts as a 
> placeholder so that the parent JIRA reflects the complete set of tasks which 
> need to be finished for 2.12 support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12

2018-07-31 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564148#comment-16564148
 ] 

Stavros Kontopoulos commented on SPARK-14643:
-

[~srowen] Right now there is no such change in the implementation. As noted in 
the description of the PR,

we are not going to fix this, as [~lrytz] said:

"The proposed plan is to NOT do the `ACC_SYNTHETIC` for now. Java users would 
continue to use the 2.11 artifacts. If 2.11 support is ever dropped, I assume 
this would happen in a major release that is also allowed to break binary 
compatibility, so then the conflicting overload could be removed."

 

> Remove overloaded methods which become ambiguous in Scala 2.12
> --
>
> Key: SPARK-14643
> URL: https://issues.apache.org/jira/browse/SPARK-14643
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>
> Spark 1.x's Dataset API runs into subtle source incompatibility problems for 
> Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a 
> nutshell, the current API has overloaded methods whose signatures are 
> ambiguous when resolving calls that use the Java 8 lambda syntax (only if 
> Spark is build against Scala 2.12).
> This issue is somewhat subtle, so there's a full writeup at 
> https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing
>  which describes the exact circumstances under which the current APIs are 
> problematic. The writeup also proposes a solution which involves the removal 
> of certain overloads only in Scala 2.12 builds of Spark and the introduction 
> of implicit conversions for retaining source compatibility.
> We don't need to implement any of these changes until we add Scala 2.12 
> support since the changes must only be applied when building against Scala 
> 2.12 and will be done via traits + shims which are mixed in via 
> per-Scala-version source directories (like how we handle the 
> Scala-version-specific parts of the REPL). For now, this JIRA acts as a 
> placeholder so that the parent JIRA reflects the complete set of tasks which 
> need to be finished for 2.12 support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24773) support reading AVRO logical types - Timestamp with different precisions

2018-07-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564145#comment-16564145
 ] 

Apache Spark commented on SPARK-24773:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/21935

> support reading AVRO logical types - Timestamp with different precisions
> 
>
> Key: SPARK-24773
> URL: https://issues.apache.org/jira/browse/SPARK-24773
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24773) support reading AVRO logical types - Timestamp with different precisions

2018-07-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24773:


Assignee: (was: Apache Spark)

> support reading AVRO logical types - Timestamp with different precisions
> 
>
> Key: SPARK-24773
> URL: https://issues.apache.org/jira/browse/SPARK-24773
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24773) support reading AVRO logical types - Timestamp with different precisions

2018-07-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24773:


Assignee: Apache Spark

> support reading AVRO logical types - Timestamp with different precisions
> 
>
> Key: SPARK-24773
> URL: https://issues.apache.org/jira/browse/SPARK-24773
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24973) Add numIter to Python ClusteringSummary

2018-07-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24973.
---
Resolution: Duplicate

I agree, though I think this should be thought of as one big issue.

> Add numIter to Python ClusteringSummary 
> 
>
> Key: SPARK-24973
> URL: https://issues.apache.org/jira/browse/SPARK-24973
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> -SPARK-23528- added numIter to ClusteringSummary. Will add numIter to Python 
> version of ClusteringSummary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24981) ShutdownHook timeout causes job to fail when succeeded when SparkContext stop() not called by user program

2018-07-31 Thread Hieu Tri Huynh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hieu Tri Huynh updated SPARK-24981:
---
Priority: Minor  (was: Major)

> ShutdownHook timeout causes job to fail when succeeded when SparkContext 
> stop() not called by user program
> --
>
> Key: SPARK-24981
> URL: https://issues.apache.org/jira/browse/SPARK-24981
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Hieu Tri Huynh
>Priority: Minor
>
> When user does not stop the SparkContext at the end of their program, 
> ShutdownHookManger will stop the SparkContext. However, each shutdown hook is 
> only given 10s to run, it will be interrupted and cancelled after that given 
> time. In case stopping spark context takes longer than 10s, 
> InterruptedException will be thrown, and the job will fail even though it 
> succeeded before. An example of this is shown below.
> I think there are a few ways to fix this, below are the 2 ways that I have 
> now: 
> 1. After user program finished, we can check if user program stoped 
> SparkContext or not. If user didn't stop the SparkContext, we can stop it 
> before finishing the userThread. By doing so, SparkContext.stop() can take as 
> much time as it needed.
> 2. We can just catch the InterruptedException thrown by ShutdownHookManger 
> while we are stopping the SparkContext, and ignoring all the things that we 
> haven't stopped inside the SparkContext. Since we are shutting down, I think 
> it will be okay to ignore those things.
>  
> {code:java}
> 18/07/31 17:11:49 ERROR Utils: Uncaught exception in thread pool-4-thread-1
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Thread.join(Thread.java:1249)
>   at java.lang.Thread.join(Thread.java:1323)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:136)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1922)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1360)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1921)
>   at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:573)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 18/07/31 17:11:49 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
> java.util.concurrent.TimeoutException
> java.util.concurrent.TimeoutException
>   at java.util.concurrent.FutureTask.get(FutureTask.java:205)
>   at 
> 

[jira] [Updated] (SPARK-24981) ShutdownHook timeout causes job to fail when succeeded when SparkContext stop() not called by user program

2018-07-31 Thread Hieu Tri Huynh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hieu Tri Huynh updated SPARK-24981:
---
Summary: ShutdownHook timeout causes job to fail when succeeded when 
SparkContext stop() not called by user program  (was: ShutdownHook timeout 
causes job to fail when succeeded when SparkContext stop() not called)

> ShutdownHook timeout causes job to fail when succeeded when SparkContext 
> stop() not called by user program
> --
>
> Key: SPARK-24981
> URL: https://issues.apache.org/jira/browse/SPARK-24981
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Hieu Tri Huynh
>Priority: Major
>
> When user does not stop the SparkContext at the end of their program, 
> ShutdownHookManger will stop the SparkContext. However, each shutdown hook is 
> only given 10s to run, it will be interrupted and cancelled after that given 
> time. In case stopping spark context takes longer than 10s, 
> InterruptedException will be thrown, and the job will fail even though it 
> succeeded before. An example of this is shown below.
> I think there are a few ways to fix this, below are the 2 ways that I have 
> now: 
> 1. After user program finished, we can check if user program stoped 
> SparkContext or not. If user didn't stop the SparkContext, we can stop it 
> before finishing the userThread. By doing so, SparkContext.stop() can take as 
> much time as it needed.
> 2. We can just catch the InterruptedException thrown by ShutdownHookManger 
> while we are stopping the SparkContext, and ignoring all the things that we 
> haven't stopped inside the SparkContext. Since we are shutting down, I think 
> it will be okay to ignore those things.
>  
> {code:java}
> 18/07/31 17:11:49 ERROR Utils: Uncaught exception in thread pool-4-thread-1
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Thread.join(Thread.java:1249)
>   at java.lang.Thread.join(Thread.java:1323)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:136)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1922)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1360)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1921)
>   at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:573)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 18/07/31 17:11:49 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
> 

[jira] [Created] (SPARK-24981) ShutdownHook timeout causes job to fail when succeeded when SparkContext stop() not called

2018-07-31 Thread Hieu Tri Huynh (JIRA)
Hieu Tri Huynh created SPARK-24981:
--

 Summary: ShutdownHook timeout causes job to fail when succeeded 
when SparkContext stop() not called
 Key: SPARK-24981
 URL: https://issues.apache.org/jira/browse/SPARK-24981
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Hieu Tri Huynh


When user does not stop the SparkContext at the end of their program, 
ShutdownHookManger will stop the SparkContext. However, each shutdown hook is 
only given 10s to run, it will be interrupted and cancelled after that given 
time. In case stopping spark context takes longer than 10s, 
InterruptedException will be thrown, and the job will fail even though it 
succeeded before. An example of this is shown below.

I think there are a few ways to fix this, below are the 2 ways that I have now: 

1. After user program finished, we can check if user program stoped 
SparkContext or not. If user didn't stop the SparkContext, we can stop it 
before finishing the userThread. By doing so, SparkContext.stop() can take as 
much time as it needed.

2. We can just catch the InterruptedException thrown by ShutdownHookManger 
while we are stopping the SparkContext, and ignoring all the things that we 
haven't stopped inside the SparkContext. Since we are shutting down, I think it 
will be okay to ignore those things.

 
{code:java}
18/07/31 17:11:49 ERROR Utils: Uncaught exception in thread pool-4-thread-1
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1249)
at java.lang.Thread.join(Thread.java:1323)
at 
org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:136)
at 
org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
at 
org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at 
org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219)
at 
org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1922)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1360)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1921)
at 
org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:573)
at 
org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
at 
org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
18/07/31 17:11:49 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
java.util.concurrent.TimeoutException
java.util.concurrent.TimeoutException
at java.util.concurrent.FutureTask.get(FutureTask.java:205)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24287) Spark -packages option should support classifier, no-transitive, and custom conf

2018-07-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-24287:
--
Issue Type: Improvement  (was: Bug)

This covers a lot of the same ground as 
https://issues.apache.org/jira/browse/SPARK-20075

> Spark -packages option should support classifier, no-transitive, and custom 
> conf
> 
>
> Key: SPARK-24287
> URL: https://issues.apache.org/jira/browse/SPARK-24287
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Priority: Major
>
> We should extend Spark's -package option to support:
>  # Turn-off transitive dependency on a given artifact(like spark-avro)
>  #  Resolving a given artifact with classifier (like avro-mapred-1.7.4-h2.jar
>  #  Resolving a given artifact with custom ivy conf
>  # Excluding particular transitive dependencies from a given artifact. We 
> currently only have top-level exclusion rule applies for all artifacts.
> We have tested this patch internally and it greatly increases the flexibility 
> when user uses -packages option 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24951) Table valued functions should throw AnalysisException instead of IllegalArgumentException

2018-07-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564075#comment-16564075
 ] 

Apache Spark commented on SPARK-24951:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21934

> Table valued functions should throw AnalysisException instead of 
> IllegalArgumentException
> -
>
> Key: SPARK-24951
> URL: https://issues.apache.org/jira/browse/SPARK-24951
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> When arguments don't match, TVFs currently throw IllegalArgumentException, 
> inconsistent with other functions, which throw AnalysisException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24951) Table valued functions should throw AnalysisException instead of IllegalArgumentException

2018-07-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24951:


Assignee: Reynold Xin  (was: Apache Spark)

> Table valued functions should throw AnalysisException instead of 
> IllegalArgumentException
> -
>
> Key: SPARK-24951
> URL: https://issues.apache.org/jira/browse/SPARK-24951
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> When arguments don't match, TVFs currently throw IllegalArgumentException, 
> inconsistent with other functions, which throw AnalysisException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24951) Table valued functions should throw AnalysisException instead of IllegalArgumentException

2018-07-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24951:


Assignee: Apache Spark  (was: Reynold Xin)

> Table valued functions should throw AnalysisException instead of 
> IllegalArgumentException
> -
>
> Key: SPARK-24951
> URL: https://issues.apache.org/jira/browse/SPARK-24951
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Major
>
> When arguments don't match, TVFs currently throw IllegalArgumentException, 
> inconsistent with other functions, which throw AnalysisException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24971) remove SupportsDeprecatedScanRow

2018-07-31 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24971:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-22386

> remove SupportsDeprecatedScanRow
> 
>
> Key: SPARK-24971
> URL: https://issues.apache.org/jira/browse/SPARK-24971
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.

2018-07-31 Thread andrzej.stankev...@gmail.com (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

andrzej.stankev...@gmail.com updated SPARK-24974:
-
Description: 
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *report_date* and *type* and i have directory structure 
like 
{code:java}
/custom_path/report_date=2018-07-24/type=A/file_1.parquet
{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count
{code}
 

In my query i need to load only files of type *A* and it is just a couple of 
files. But spark load all 19K of files from all partitions into 
SharedInMemoryCache which takes about 60 secs and only after that throws unused 
partitions. 

 

This could be related to [https://jira.apache.org/jira/browse/SPARK-17994] 

  was:
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *report_date* and *type* and i have directory structure 
like 
{code:java}
/custom_path/report_date=2018-07-24/type=A/file_1.parquet
{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count
{code}
 

In my query i need to load only files of type *A* and it is just a couple of 
files. But spark load all 19K of files from all partitions into 
SharedInMemoryCache which takes about 60 secs and only after that throws unused 
partitions. 


> Spark put all file's paths into SharedInMemoryCache even for unused 
> partitions.
> ---
>
> Key: SPARK-24974
> URL: https://issues.apache.org/jira/browse/SPARK-24974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> SharedInMemoryCache has all  filestatus no matter whether you specify 
> partition columns or not. It causes long load time for queries that use only 
> couple partitions because Spark loads file's paths for files from all 
> partitions.
> I partitioned files by *report_date* and *type* and i have directory 
> structure like 
> {code:java}
> /custom_path/report_date=2018-07-24/type=A/file_1.parquet
> {code}
>  
> I am trying to execute 
> {code:java}
> val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
> "type == 'A'").count
> {code}
>  
> In my query i need to load only files of type *A* and it is just a couple of 
> files. But spark load all 19K of files from all partitions into 
> SharedInMemoryCache which takes about 60 secs and only after that throws 
> unused partitions. 
>  
> This could be related to [https://jira.apache.org/jira/browse/SPARK-17994] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24980) add support for pandas/arrow etc for python2.7 and pypy builds

2018-07-31 Thread shane knapp (JIRA)
shane knapp created SPARK-24980:
---

 Summary: add support for pandas/arrow etc for python2.7 and pypy 
builds
 Key: SPARK-24980
 URL: https://issues.apache.org/jira/browse/SPARK-24980
 Project: Spark
  Issue Type: Improvement
  Components: Build, PySpark
Affects Versions: 2.3.1
Reporter: shane knapp
Assignee: shane knapp


since we have full support for python3.4 via anaconda, it's time to create 
similar environments for 2.7 and pypy 2.5.1.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12

2018-07-31 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563998#comment-16563998
 ] 

Sean Owen commented on SPARK-14643:
---

[~skonto] is the conclusion that there's a change involving the ACC_SYNTHETIC 
flag that will resolve this issue? Seems OK to implement, to me, if it works. 
That would be the only other last item open for 2.12 support.

> Remove overloaded methods which become ambiguous in Scala 2.12
> --
>
> Key: SPARK-14643
> URL: https://issues.apache.org/jira/browse/SPARK-14643
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>
> Spark 1.x's Dataset API runs into subtle source incompatibility problems for 
> Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a 
> nutshell, the current API has overloaded methods whose signatures are 
> ambiguous when resolving calls that use the Java 8 lambda syntax (only if 
> Spark is build against Scala 2.12).
> This issue is somewhat subtle, so there's a full writeup at 
> https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing
>  which describes the exact circumstances under which the current APIs are 
> problematic. The writeup also proposes a solution which involves the removal 
> of certain overloads only in Scala 2.12 builds of Spark and the introduction 
> of implicit conversions for retaining source compatibility.
> We don't need to implement any of these changes until we add Scala 2.12 
> support since the changes must only be applied when building against Scala 
> 2.12 and will be done via traits + shims which are mixed in via 
> per-Scala-version source directories (like how we handle the 
> Scala-version-specific parts of the REPL). For now, this JIRA acts as a 
> placeholder so that the parent JIRA reflects the complete set of tasks which 
> need to be finished for 2.12 support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14540) Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner

2018-07-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14540:
--
Labels: release-notes  (was: )

> Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner
> 
>
> Key: SPARK-14540
> URL: https://issues.apache.org/jira/browse/SPARK-14540
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>Priority: Major
>  Labels: release-notes
>
> Using https://github.com/JoshRosen/spark/tree/build-for-2.12, I tried running 
> ClosureCleanerSuite with Scala 2.12 and ran into two bad test failures:
> {code}
> [info] - toplevel return statements in closures are identified at cleaning 
> time *** FAILED *** (32 milliseconds)
> [info]   Expected exception 
> org.apache.spark.util.ReturnStatementInClosureException to be thrown, but no 
> exception was thrown. (ClosureCleanerSuite.scala:57)
> {code}
> and
> {code}
> [info] - user provided closures are actually cleaned *** FAILED *** (56 
> milliseconds)
> [info]   Expected ReturnStatementInClosureException, but got 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task not 
> serializable: java.io.NotSerializableException: java.lang.Object
> [info]- element of array (index: 0)
> [info]- array (class "[Ljava.lang.Object;", size: 1)
> [info]- field (class "java.lang.invoke.SerializedLambda", name: 
> "capturedArgs", type: "class [Ljava.lang.Object;")
> [info]- object (class "java.lang.invoke.SerializedLambda", 
> SerializedLambda[capturingClass=class 
> org.apache.spark.util.TestUserClosuresActuallyCleaned$, 
> functionalInterfaceMethod=scala/runtime/java8/JFunction1$mcII$sp.apply$mcII$sp:(I)I,
>  implementation=invokeStatic 
> org/apache/spark/util/TestUserClosuresActuallyCleaned$.org$apache$spark$util$TestUserClosuresActuallyCleaned$$$anonfun$69:(Ljava/lang/Object;I)I,
>  instantiatedMethodType=(I)I, numCaptured=1])
> [info]- element of array (index: 0)
> [info]- array (class "[Ljava.lang.Object;", size: 1)
> [info]- field (class "java.lang.invoke.SerializedLambda", name: 
> "capturedArgs", type: "class [Ljava.lang.Object;")
> [info]- object (class "java.lang.invoke.SerializedLambda", 
> SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, 
> functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> org/apache/spark/rdd/RDD.org$apache$spark$rdd$RDD$$$anonfun$20$adapted:(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  
> instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  numCaptured=1])
> [info]- field (class "org.apache.spark.rdd.MapPartitionsRDD", name: 
> "f", type: "interface scala.Function3")
> [info]- object (class "org.apache.spark.rdd.MapPartitionsRDD", 
> MapPartitionsRDD[2] at apply at Transformer.scala:22)
> [info]- field (class "scala.Tuple2", name: "_1", type: "class 
> java.lang.Object")
> [info]- root object (class "scala.Tuple2", (MapPartitionsRDD[2] at 
> apply at 
> Transformer.scala:22,org.apache.spark.SparkContext$$Lambda$957/431842435@6e803685)).
> [info]   This means the closure provided by user is not actually cleaned. 
> (ClosureCleanerSuite.scala:78)
> {code}
> We'll need to figure out a closure cleaning strategy which works for 2.12 
> lambdas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24917) Sending a partition over netty results in 2x memory usage

2018-07-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24917:


Assignee: (was: Apache Spark)

> Sending a partition over netty results in 2x memory usage
> -
>
> Key: SPARK-24917
> URL: https://issues.apache.org/jira/browse/SPARK-24917
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.2
>Reporter: Vincent
>Priority: Major
>
> Hello
> while investigating some OOM errors in Spark 2.2 [(here's my call 
> stack)|https://image.ibb.co/hHa2R8/sparkOOM.png], I find the following 
> behavior happening, which I think is weird:
>  * a request happens to send a partition over network
>  * this partition is 1.9 GB and is persisted in memory
>  * this partition is apparently stored in a ByteBufferBlockData, that is made 
> of a ChunkedByteBuffer, which is a list of (lots of) ByteBuffer of 4 MB each.
>  * the call to toNetty() is supposed to only wrap all the arrays and not 
> allocate any memory
>  * yet the call stack shows that netty is allocating memory and is trying to 
> consolidate all the chunks into one big 1.9GB array
>  * this means that at this point the memory footprint is 2x the size of the 
> actual partition (which is huge when the partition is 1.9GB)
> Is this transient allocation expected?
> After digging, it turns out that the actual copy is due to [this 
> method|https://github.com/netty/netty/blob/4.0/buffer/src/main/java/io/netty/buffer/Unpooled.java#L260]
>  in netty. If my initial buffer is made of more than DEFAULT_MAX_COMPONENTS 
> (16) components it will trigger a re-allocation of all the buffer. This netty 
> issue was fixed in this recent change : 
> [https://github.com/netty/netty/commit/9b95b8ee628983e3e4434da93fffb893edff4aa2]
>  
> As a result, is it possible to benefit from this change somehow in spark 2.2 
> and above? I don't know how the netty dependencies are handled for spark
>  
> NB: it seems this ticket: [https://jira.apache.org/jira/browse/SPARK-24307] 
> kinda changed the approach for spark 2.4 by bypassing netty buffer 
> altogether. However as it is written in the ticket, this approach *still* 
> needs to have the *entire* block serialized in memory, so this would be a 
> downgrade from fixing the netty issue when your buffer in <  2GB
>  
> Thanks!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24917) Sending a partition over netty results in 2x memory usage

2018-07-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24917:


Assignee: Apache Spark

> Sending a partition over netty results in 2x memory usage
> -
>
> Key: SPARK-24917
> URL: https://issues.apache.org/jira/browse/SPARK-24917
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.2
>Reporter: Vincent
>Assignee: Apache Spark
>Priority: Major
>
> Hello
> while investigating some OOM errors in Spark 2.2 [(here's my call 
> stack)|https://image.ibb.co/hHa2R8/sparkOOM.png], I find the following 
> behavior happening, which I think is weird:
>  * a request happens to send a partition over network
>  * this partition is 1.9 GB and is persisted in memory
>  * this partition is apparently stored in a ByteBufferBlockData, that is made 
> of a ChunkedByteBuffer, which is a list of (lots of) ByteBuffer of 4 MB each.
>  * the call to toNetty() is supposed to only wrap all the arrays and not 
> allocate any memory
>  * yet the call stack shows that netty is allocating memory and is trying to 
> consolidate all the chunks into one big 1.9GB array
>  * this means that at this point the memory footprint is 2x the size of the 
> actual partition (which is huge when the partition is 1.9GB)
> Is this transient allocation expected?
> After digging, it turns out that the actual copy is due to [this 
> method|https://github.com/netty/netty/blob/4.0/buffer/src/main/java/io/netty/buffer/Unpooled.java#L260]
>  in netty. If my initial buffer is made of more than DEFAULT_MAX_COMPONENTS 
> (16) components it will trigger a re-allocation of all the buffer. This netty 
> issue was fixed in this recent change : 
> [https://github.com/netty/netty/commit/9b95b8ee628983e3e4434da93fffb893edff4aa2]
>  
> As a result, is it possible to benefit from this change somehow in spark 2.2 
> and above? I don't know how the netty dependencies are handled for spark
>  
> NB: it seems this ticket: [https://jira.apache.org/jira/browse/SPARK-24307] 
> kinda changed the approach for spark 2.4 by bypassing netty buffer 
> altogether. However as it is written in the ticket, this approach *still* 
> needs to have the *entire* block serialized in memory, so this would be a 
> downgrade from fixing the netty issue when your buffer in <  2GB
>  
> Thanks!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24917) Sending a partition over netty results in 2x memory usage

2018-07-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563918#comment-16563918
 ] 

Apache Spark commented on SPARK-24917:
--

User 'vincent-grosbois' has created a pull request for this issue:
https://github.com/apache/spark/pull/21933

> Sending a partition over netty results in 2x memory usage
> -
>
> Key: SPARK-24917
> URL: https://issues.apache.org/jira/browse/SPARK-24917
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.2
>Reporter: Vincent
>Priority: Major
>
> Hello
> while investigating some OOM errors in Spark 2.2 [(here's my call 
> stack)|https://image.ibb.co/hHa2R8/sparkOOM.png], I find the following 
> behavior happening, which I think is weird:
>  * a request happens to send a partition over network
>  * this partition is 1.9 GB and is persisted in memory
>  * this partition is apparently stored in a ByteBufferBlockData, that is made 
> of a ChunkedByteBuffer, which is a list of (lots of) ByteBuffer of 4 MB each.
>  * the call to toNetty() is supposed to only wrap all the arrays and not 
> allocate any memory
>  * yet the call stack shows that netty is allocating memory and is trying to 
> consolidate all the chunks into one big 1.9GB array
>  * this means that at this point the memory footprint is 2x the size of the 
> actual partition (which is huge when the partition is 1.9GB)
> Is this transient allocation expected?
> After digging, it turns out that the actual copy is due to [this 
> method|https://github.com/netty/netty/blob/4.0/buffer/src/main/java/io/netty/buffer/Unpooled.java#L260]
>  in netty. If my initial buffer is made of more than DEFAULT_MAX_COMPONENTS 
> (16) components it will trigger a re-allocation of all the buffer. This netty 
> issue was fixed in this recent change : 
> [https://github.com/netty/netty/commit/9b95b8ee628983e3e4434da93fffb893edff4aa2]
>  
> As a result, is it possible to benefit from this change somehow in spark 2.2 
> and above? I don't know how the netty dependencies are handled for spark
>  
> NB: it seems this ticket: [https://jira.apache.org/jira/browse/SPARK-24307] 
> kinda changed the approach for spark 2.4 by bypassing netty buffer 
> altogether. However as it is written in the ticket, this approach *still* 
> needs to have the *entire* block serialized in memory, so this would be a 
> downgrade from fixing the netty issue when your buffer in <  2GB
>  
> Thanks!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24979) add AnalysisHelper#resolveOperatorsUp

2018-07-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24979:


Assignee: Apache Spark  (was: Wenchen Fan)

> add AnalysisHelper#resolveOperatorsUp
> -
>
> Key: SPARK-24979
> URL: https://issues.apache.org/jira/browse/SPARK-24979
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24979) add AnalysisHelper#resolveOperatorsUp

2018-07-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563896#comment-16563896
 ] 

Apache Spark commented on SPARK-24979:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21932

> add AnalysisHelper#resolveOperatorsUp
> -
>
> Key: SPARK-24979
> URL: https://issues.apache.org/jira/browse/SPARK-24979
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24979) add AnalysisHelper#resolveOperatorsUp

2018-07-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24979:


Assignee: Wenchen Fan  (was: Apache Spark)

> add AnalysisHelper#resolveOperatorsUp
> -
>
> Key: SPARK-24979
> URL: https://issues.apache.org/jira/browse/SPARK-24979
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24979) add AnalysisHelper#resolveOperatorsUp

2018-07-31 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24979:

Summary: add AnalysisHelper#resolveOperatorsUp  (was: add 
resolveOperatorsUp)

> add AnalysisHelper#resolveOperatorsUp
> -
>
> Key: SPARK-24979
> URL: https://issues.apache.org/jira/browse/SPARK-24979
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24979) add resolveOperatorsUp

2018-07-31 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-24979:
---

 Summary: add resolveOperatorsUp
 Key: SPARK-24979
 URL: https://issues.apache.org/jira/browse/SPARK-24979
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24536) Query with nonsensical LIMIT hits AssertionError

2018-07-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24536.
-
   Resolution: Fixed
Fix Version/s: 2.4.0
   2.3.2

> Query with nonsensical LIMIT hits AssertionError
> 
>
> Key: SPARK-24536
> URL: https://issues.apache.org/jira/browse/SPARK-24536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Alexander Behm
>Priority: Trivial
>  Labels: beginner, spree
> Fix For: 2.3.2, 2.4.0
>
>
> SELECT COUNT(1) FROM t LIMIT CAST(NULL AS INT)
> fails in the QueryPlanner with:
> {code}
> java.lang.AssertionError: assertion failed: No plan for GlobalLimit null
> {code}
> I think this issue should be caught earlier during semantic analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-31 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563730#comment-16563730
 ] 

Saisai Shao edited comment on SPARK-24615 at 7/31/18 2:16 PM:
--

Hi [~tgraves], I think eval() might unnecessarily break the lineage which could 
execute in one stage, for example: data transforming -> training -> 
transforming, this could possibly run in one stage, using eval will break into 
several stages, I'm not sure if it is the good choice. Also if we use eval to 
break the lineage, how do we store the intermediate data, like shuffle, or in 
memory/ on disk?

Yes, how to break the boundaries is hard for user to know, but currently I 
cannot figure out a good solution, unless we use eval() to explicitly separate 
them. To solve the conflicts, failing might be one choice. In the SQL or DF 
area, I don't think we have to expose such low level RDD APIs to user, maybe 
some hints should be enough (though I haven't thought about it).

Currently in my design, withResources only applies to the stage in which RDD 
will be executed, the following stages will still be ordinary stages without 
additional resources.


was (Author: jerryshao):
Hi [~tgraves], I think eval() might unnecessarily break the lineage which could 
execute in one stage, for example: data transforming -> training -> 
transforming, this could possibly run in one stage, using eval will break into 
several stages, I'm not sure if it is the good choice. Also if we use eval to 
break the lineage, how do we store the intermediate data, like shuffle, or in 
memory/ on disk?

Yes, how to break the boundaries is hard for user to know, but currently I 
cannot figure out a good solution, unless we use eval() to explicitly separate 
them. To solve the conflicts, failing might be one choice.

Currently in my design, withResources only applies to the stage in which RDD 
will be executed, the following stages will still be ordinary stages without 
additional resouces.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-31 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563730#comment-16563730
 ] 

Saisai Shao commented on SPARK-24615:
-

Hi [~tgraves], I think eval() might unnecessarily break the lineage which could 
execute in one stage, for example: data transforming -> training -> 
transforming, this could possibly run in one stage, using eval will break into 
several stages, I'm not sure if it is the good choice. Also if we use eval to 
break the lineage, how do we store the intermediate data, like shuffle, or in 
memory/ on disk?

Yes, how to break the boundaries is hard for user to know, but currently I 
cannot figure out a good solution, unless we use eval() to explicitly separate 
them. To solve the conflicts, failing might be one choice.

Currently in my design, withResources only applies to the stage in which RDD 
will be executed, the following stages will still be ordinary stages without 
additional resouces.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-31 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563711#comment-16563711
 ] 

Thomas Graves commented on SPARK-24615:
---

so I guess my question is this the right approach at all.  Should we make it 
more obvious to the user where the boundaries would end by adding something 
like the eval().  I guess they could do like what they do for caching today 
which is force an action like count() to force the eval.

If we are making the user add in their own action to evaluate, I would pick 
failing if there are conflicting resources. That way its completely obvious to 
the user that they might not be getting what they want.  The only hard part is 
then though if the user doesn't know how to resolve that because of the way 
spark decides to pick its stages which might not be obvious to the users.  
Especially on the SQL side, but perhaps we aren't going to support that right 
now or support via hints?

Its still not clear to me from the above are we saying the withResources 
applies to just the immediate next stage then, or it sticks until they change 
it again?

 

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks

2018-07-31 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563689#comment-16563689
 ] 

Thomas Graves commented on SPARK-24579:
---

going from Spark feeds data into DL/AI frameworks for training. section where 
we have an example using train() with tensorflow,

def train(batches):

# create a TensorFlow Dataset from Arrow batches (see next section)

I assume the section that matches this would be the "Spark supports Tensor data 
types and TensorFlow computation graphs" section?  

It would be nice to perhaps finish that example to be more specific how that 
would work.  Do you know how often the tensorflow api has changed and broken 
api or are there different versions with different api. It definitely seems 
safer to have this as some sort of plugin but I haven't looked at the specifics.

Do we have examples how this would work with frameworks other then tensorflow?  
   Sorry if I missed it but I haven't seen if we plan on supporting specific 
ones other then tensorflow.

> SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
> 
>
> Key: SPARK-24579
> URL: https://issues.apache.org/jira/browse/SPARK-24579
> Project: Spark
>  Issue Type: Epic
>  Components: ML, PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen
> Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange 
> between Apache Spark and DL%2FAI Frameworks .pdf
>
>
> (see attached SPIP pdf for more details)
> At the crossroads of big data and AI, we see both the success of Apache Spark 
> as a unified
> analytics engine and the rise of AI frameworks like TensorFlow and Apache 
> MXNet (incubating).
> Both big data and AI are indispensable components to drive business 
> innovation and there have
> been multiple attempts from both communities to bring them together.
> We saw efforts from AI community to implement data solutions for AI 
> frameworks like tf.data and tf.Transform. However, with 50+ data sources and 
> built-in SQL, DataFrames, and Streaming features, Spark remains the community 
> choice for big data. This is why we saw many efforts to integrate DL/AI 
> frameworks with Spark to leverage its power, for example, TFRecords data 
> source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project 
> Hydrogen, this SPIP takes a different angle at Spark + AI unification.
> None of the integrations are possible without exchanging data between Spark 
> and external DL/AI frameworks. And the performance matters. However, there 
> doesn’t exist a standard way to exchange data and hence implementation and 
> performance optimization fall into pieces. For example, TensorFlowOnSpark 
> uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and 
> save data and pass the RDD records to TensorFlow in Python. And TensorFrames 
> converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s 
> Java API. How can we reduce the complexity?
> The proposal here is to standardize the data exchange interface (or format) 
> between Spark and DL/AI frameworks and optimize data conversion from/to this 
> interface.  So DL/AI frameworks can leverage Spark to load data virtually 
> from anywhere without spending extra effort building complex data solutions, 
> like reading features from a production data warehouse or streaming model 
> inference. Spark users can use DL/AI frameworks without learning specific 
> data APIs implemented there. And developers from both sides can work on 
> performance optimizations independently given the interface itself doesn’t 
> introduce big overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24816) SQL interface support repartitionByRange

2018-07-31 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-24816.
-
Resolution: Won't Fix

{{Order by}} is implement by {{rangepartitioning}}.

> SQL interface support repartitionByRange
> 
>
> Key: SPARK-24816
> URL: https://issues.apache.org/jira/browse/SPARK-24816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: DISTRIBUTE_BY_SORT_BY.png, 
> RANGE_DISTRIBUTE_BY_SORT_BY.png
>
>
> SQL interface support {{repartitionByRange}} to improvement data pushdown. I 
> have test this feature with a big table(data size: 1.1 T, row count: 
> 282,001,954,428) .
> The test sql is:
> {code:sql}
> select * from table where id=401564838907
> {code}
> The test result:
> |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
> MB-seconds|
> |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
> |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
> |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
> |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
> |RANGE PARTITION BY |38.5 GB|75355144|45 min|13 s|14525275297|
> |RANGE PARTITION BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24978) Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity of fast aggregation.

2018-07-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563487#comment-16563487
 ] 

Apache Spark commented on SPARK-24978:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/21931

> Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity 
> of fast aggregation.
> -
>
> Key: SPARK-24978
> URL: https://issues.apache.org/jira/browse/SPARK-24978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: caoxuewen
>Priority: Major
>
> this pr add a configuration parameter to configure the capacity of fast 
> aggregation. 
> Performance comparison:
>     /*
>     Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1
>     Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
>     Aggregate w multiple keys:   Best/Avg Time(ms)    Rate(M/s)   
> Per Row(ns)   Relative
>     
> 
>     fasthash = default    5612 / 5882  3.7
>  267.6   1.0X
>     fasthash = config 3586 / 3595  5.8
>  171.0   1.6X
>     */



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24978) Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity of fast aggregation.

2018-07-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24978:


Assignee: (was: Apache Spark)

> Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity 
> of fast aggregation.
> -
>
> Key: SPARK-24978
> URL: https://issues.apache.org/jira/browse/SPARK-24978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: caoxuewen
>Priority: Major
>
> this pr add a configuration parameter to configure the capacity of fast 
> aggregation. 
> Performance comparison:
>     /*
>     Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1
>     Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
>     Aggregate w multiple keys:   Best/Avg Time(ms)    Rate(M/s)   
> Per Row(ns)   Relative
>     
> 
>     fasthash = default    5612 / 5882  3.7
>  267.6   1.0X
>     fasthash = config 3586 / 3595  5.8
>  171.0   1.6X
>     */



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24978) Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity of fast aggregation.

2018-07-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24978:


Assignee: Apache Spark

> Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity 
> of fast aggregation.
> -
>
> Key: SPARK-24978
> URL: https://issues.apache.org/jira/browse/SPARK-24978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: caoxuewen
>Assignee: Apache Spark
>Priority: Major
>
> this pr add a configuration parameter to configure the capacity of fast 
> aggregation. 
> Performance comparison:
>     /*
>     Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1
>     Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
>     Aggregate w multiple keys:   Best/Avg Time(ms)    Rate(M/s)   
> Per Row(ns)   Relative
>     
> 
>     fasthash = default    5612 / 5882  3.7
>  267.6   1.0X
>     fasthash = config 3586 / 3595  5.8
>  171.0   1.6X
>     */



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24978) Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity of fast aggregation.

2018-07-31 Thread caoxuewen (JIRA)
caoxuewen created SPARK-24978:
-

 Summary: Add spark.sql.fast.hash.aggregate.row.max.capacity to 
configure the capacity of fast aggregation.
 Key: SPARK-24978
 URL: https://issues.apache.org/jira/browse/SPARK-24978
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0, 3.0.0
Reporter: caoxuewen


this pr add a configuration parameter to configure the capacity of fast 
aggregation. 
Performance comparison:
    /*
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1
    Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
    Aggregate w multiple keys:   Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    

    fasthash = default    5612 / 5882  3.7  
   267.6   1.0X
    fasthash = config 3586 / 3595  5.8  
   171.0   1.6X
    */



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24968) Configurable Chunksize in ChunkedByteBufferOutputStream

2018-07-31 Thread Vincent (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent resolved SPARK-24968.
-
Resolution: Fixed

> Configurable Chunksize in ChunkedByteBufferOutputStream
> ---
>
> Key: SPARK-24968
> URL: https://issues.apache.org/jira/browse/SPARK-24968
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.0, 2.4.0
>Reporter: Vincent
>Priority: Minor
>
> Hello,
> it seems that when creating a _ChunkedByteBufferOutputStream,_ the chunk size 
> is always configured to be 4MB. I suggest we make it configurable via spark 
> conf. This would allow to solve issues like SPARK-24917 (by increasing the 
> chunk size to a bigger value in the conf).
> What do you think ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24968) Configurable Chunksize in ChunkedByteBufferOutputStream

2018-07-31 Thread Vincent (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563467#comment-16563467
 ] 

Vincent commented on SPARK-24968:
-

indeed they are closely related

I'll close this ticket

> Configurable Chunksize in ChunkedByteBufferOutputStream
> ---
>
> Key: SPARK-24968
> URL: https://issues.apache.org/jira/browse/SPARK-24968
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.0, 2.4.0
>Reporter: Vincent
>Priority: Minor
>
> Hello,
> it seems that when creating a _ChunkedByteBufferOutputStream,_ the chunk size 
> is always configured to be 4MB. I suggest we make it configurable via spark 
> conf. This would allow to solve issues like SPARK-24917 (by increasing the 
> chunk size to a bigger value in the conf).
> What do you think ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24946) PySpark - Allow np.Arrays and pd.Series in df.approxQuantile

2018-07-31 Thread Paul Westenthanner (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563455#comment-16563455
 ] 

Paul Westenthanner commented on SPARK-24946:


Sure, go ahead :) 

> PySpark - Allow np.Arrays and pd.Series in df.approxQuantile
> 
>
> Key: SPARK-24946
> URL: https://issues.apache.org/jira/browse/SPARK-24946
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Paul Westenthanner
>Priority: Minor
>  Labels: DataFrame, beginner, pyspark
>
> As Python user it is convenient to pass a numpy array or pandas series 
> `{{approxQuantile}}(_col_, _probabilities_, _relativeError_)` for the 
> probabilities parameter. 
>  
> Especially for creating cumulative plots (say in 1% steps) it is handy to use 
> `approxQuantile(col, np.arange(0, 1.0, 0.01), relativeError)`.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14540) Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner

2018-07-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563434#comment-16563434
 ] 

Apache Spark commented on SPARK-14540:
--

User 'skonto' has created a pull request for this issue:
https://github.com/apache/spark/pull/21930

> Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner
> 
>
> Key: SPARK-14540
> URL: https://issues.apache.org/jira/browse/SPARK-14540
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>Priority: Major
>
> Using https://github.com/JoshRosen/spark/tree/build-for-2.12, I tried running 
> ClosureCleanerSuite with Scala 2.12 and ran into two bad test failures:
> {code}
> [info] - toplevel return statements in closures are identified at cleaning 
> time *** FAILED *** (32 milliseconds)
> [info]   Expected exception 
> org.apache.spark.util.ReturnStatementInClosureException to be thrown, but no 
> exception was thrown. (ClosureCleanerSuite.scala:57)
> {code}
> and
> {code}
> [info] - user provided closures are actually cleaned *** FAILED *** (56 
> milliseconds)
> [info]   Expected ReturnStatementInClosureException, but got 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task not 
> serializable: java.io.NotSerializableException: java.lang.Object
> [info]- element of array (index: 0)
> [info]- array (class "[Ljava.lang.Object;", size: 1)
> [info]- field (class "java.lang.invoke.SerializedLambda", name: 
> "capturedArgs", type: "class [Ljava.lang.Object;")
> [info]- object (class "java.lang.invoke.SerializedLambda", 
> SerializedLambda[capturingClass=class 
> org.apache.spark.util.TestUserClosuresActuallyCleaned$, 
> functionalInterfaceMethod=scala/runtime/java8/JFunction1$mcII$sp.apply$mcII$sp:(I)I,
>  implementation=invokeStatic 
> org/apache/spark/util/TestUserClosuresActuallyCleaned$.org$apache$spark$util$TestUserClosuresActuallyCleaned$$$anonfun$69:(Ljava/lang/Object;I)I,
>  instantiatedMethodType=(I)I, numCaptured=1])
> [info]- element of array (index: 0)
> [info]- array (class "[Ljava.lang.Object;", size: 1)
> [info]- field (class "java.lang.invoke.SerializedLambda", name: 
> "capturedArgs", type: "class [Ljava.lang.Object;")
> [info]- object (class "java.lang.invoke.SerializedLambda", 
> SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, 
> functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> org/apache/spark/rdd/RDD.org$apache$spark$rdd$RDD$$$anonfun$20$adapted:(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  
> instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  numCaptured=1])
> [info]- field (class "org.apache.spark.rdd.MapPartitionsRDD", name: 
> "f", type: "interface scala.Function3")
> [info]- object (class "org.apache.spark.rdd.MapPartitionsRDD", 
> MapPartitionsRDD[2] at apply at Transformer.scala:22)
> [info]- field (class "scala.Tuple2", name: "_1", type: "class 
> java.lang.Object")
> [info]- root object (class "scala.Tuple2", (MapPartitionsRDD[2] at 
> apply at 
> Transformer.scala:22,org.apache.spark.SparkContext$$Lambda$957/431842435@6e803685)).
> [info]   This means the closure provided by user is not actually cleaned. 
> (ClosureCleanerSuite.scala:78)
> {code}
> We'll need to figure out a closure cleaning strategy which works for 2.12 
> lambdas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24977) input_file_name() result can't save and use for partitionBy()

2018-07-31 Thread Srinivasarao Padala (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563424#comment-16563424
 ] 

Srinivasarao Padala commented on SPARK-24977:
-

i could able to see the the filenames by df.show() , but aftersaving to file , 
it is showing empty and

not able to use for partitionBy() while saving . getting Error - 

: java.lang.AssertionError: assertion failed: Empty partition column value in 
'filename='

> input_file_name() result can't save and use for partitionBy()
> -
>
> Key: SPARK-24977
> URL: https://issues.apache.org/jira/browse/SPARK-24977
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.3.1
>Reporter: Srinivasarao Padala
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24977) input_file_name() result can't save and use for paritionBy()

2018-07-31 Thread Srinivasarao Padala (JIRA)
Srinivasarao Padala created SPARK-24977:
---

 Summary: input_file_name() result can't save and use for 
paritionBy()
 Key: SPARK-24977
 URL: https://issues.apache.org/jira/browse/SPARK-24977
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core, SQL
Affects Versions: 2.3.1
Reporter: Srinivasarao Padala






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24977) input_file_name() result can't save and use for partitionBy()

2018-07-31 Thread Srinivasarao Padala (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinivasarao Padala updated SPARK-24977:

Summary: input_file_name() result can't save and use for partitionBy()  
(was: input_file_name() result can't save and use for paritionBy())

> input_file_name() result can't save and use for partitionBy()
> -
>
> Key: SPARK-24977
> URL: https://issues.apache.org/jira/browse/SPARK-24977
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.3.1
>Reporter: Srinivasarao Padala
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24968) Configurable Chunksize in ChunkedByteBufferOutputStream

2018-07-31 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563274#comment-16563274
 ] 

Hyukjin Kwon commented on SPARK-24968:
--

So is it a duplicate of SPARK-24917 roughly? I suggest to leave this resolved 
if that or this issue is a subset

> Configurable Chunksize in ChunkedByteBufferOutputStream
> ---
>
> Key: SPARK-24968
> URL: https://issues.apache.org/jira/browse/SPARK-24968
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.0, 2.4.0
>Reporter: Vincent
>Priority: Minor
>
> Hello,
> it seems that when creating a _ChunkedByteBufferOutputStream,_ the chunk size 
> is always configured to be 4MB. I suggest we make it configurable via spark 
> conf. This would allow to solve issues like SPARK-24917 (by increasing the 
> chunk size to a bigger value in the conf).
> What do you think ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24949) pyspark.sql.Column breaks the iterable contract

2018-07-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24949.
--
Resolution: Won't Fix

Let me leave this resolved. Seems difficult to fix but I wonder if it's worth. 
Please reopen and go ahead if a PR is ready. Let me leave this resolve till 
then.

> pyspark.sql.Column breaks the iterable contract
> ---
>
> Key: SPARK-24949
> URL: https://issues.apache.org/jira/browse/SPARK-24949
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Daniel Shields
>Priority: Minor
>
> pyspark.sql.Column implements __iter__ just to raise a TypeError:
> {code:java}
> def __iter__(self):
> raise TypeError("Column is not iterable")
> {code}
> This makes column look iterable even when it isn't:
> {code:java}
> isinstance(mycolumn, collections.Iterable) # Evaluates to True{code}
> This function should be removed from Column completely so it behaves like 
> every other non-iterable class.
> For further motivation of why this should be fixed, consider the below 
> example, which currently requires listing Column explicitly:
> {code:java}
> def listlike(value):
> # Column unfortunately implements __iter__ just to raise a TypeError.
> # This breaks the iterable contract and should be fixed in Spark proper.
> return isinstance(value, collections.Iterable) and not isinstance(value, 
> (str, bytes, pyspark.sql.Column))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24970) Spark Kinesis streaming application fails to recover from streaming checkpoint due to ProvisionedThroughputExceededException

2018-07-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563267#comment-16563267
 ] 

Apache Spark commented on SPARK-24970:
--

User 'brucezhao11' has created a pull request for this issue:
https://github.com/apache/spark/pull/21929

> Spark Kinesis streaming application fails to recover from streaming 
> checkpoint due to ProvisionedThroughputExceededException
> 
>
> Key: SPARK-24970
> URL: https://issues.apache.org/jira/browse/SPARK-24970
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: bruce_zhao
>Priority: Major
>  Labels: kinesis
>
>  
> We're using Spark streaming to consume Kinesis data, and found that it reads 
> more data from Kinesis and is easy to touch 
> ProvisionedThroughputExceededException *when it recovers from streaming 
> checkpoint*. 
>  
> Normally, it's a WARN in spark log. But when we have multiple streaming 
> applications (i.e., 5 applications) to consume the same Kinesis stream, the 
> situation becomes serious. *The application will fail to recover due to the 
> following exception in driver.*  And one application failure will also affect 
> the other running applications. 
>   
> {panel:title=Exception}
> org.apache.spark.SparkException: Job aborted due to stage failure: 
> {color:#ff}*Task 5 in stage 7.0 failed 4 times, most recent 
> failure*:{color} Lost task 5.3 in stage 7.0 (TID 128, 
> ip-172-31-14-36.ap-northeast-1.compute.internal, executor 1): 
> org.apache.spark.SparkException: Gave up after 3 retries while getting 
> records using shard iterator, last exception: at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$retryOrTimeout$2.apply(KinesisBackedBlockRDD.scala:288)
>  at scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:282)
>  at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecordsAndNextKinesisIterator(KinesisBackedBlockRDD.scala:223)
>  at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecords(KinesisBackedBlockRDD.scala:207)
>  at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:162)
>  at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:133)
>  at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) at 
> scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333)
>  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954) at 
> org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
> Caused by: 
> *{color:#ff}com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException:
>  Rate exceeded for shard shardId- in stream rellfsstream-an under 
> account 1{color}.* (Service: AmazonKinesis; Status Code: 400; 
> Error Code: ProvisionedThroughputExceededException; Request ID: 
> d3520677-060e-14c4-8014-2886b6b75f03) at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1257)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1029)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:741)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:715)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:697)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:665)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:647)
>  at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:511) at 
> com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2219)
>  at 
> com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2195)
>  at 
> com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetRecords(AmazonKinesisClient.java:1004)
>  at 
> 

[jira] [Assigned] (SPARK-24970) Spark Kinesis streaming application fails to recover from streaming checkpoint due to ProvisionedThroughputExceededException

2018-07-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24970:


Assignee: (was: Apache Spark)

> Spark Kinesis streaming application fails to recover from streaming 
> checkpoint due to ProvisionedThroughputExceededException
> 
>
> Key: SPARK-24970
> URL: https://issues.apache.org/jira/browse/SPARK-24970
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: bruce_zhao
>Priority: Major
>  Labels: kinesis
>
>  
> We're using Spark streaming to consume Kinesis data, and found that it reads 
> more data from Kinesis and is easy to touch 
> ProvisionedThroughputExceededException *when it recovers from streaming 
> checkpoint*. 
>  
> Normally, it's a WARN in spark log. But when we have multiple streaming 
> applications (i.e., 5 applications) to consume the same Kinesis stream, the 
> situation becomes serious. *The application will fail to recover due to the 
> following exception in driver.*  And one application failure will also affect 
> the other running applications. 
>   
> {panel:title=Exception}
> org.apache.spark.SparkException: Job aborted due to stage failure: 
> {color:#ff}*Task 5 in stage 7.0 failed 4 times, most recent 
> failure*:{color} Lost task 5.3 in stage 7.0 (TID 128, 
> ip-172-31-14-36.ap-northeast-1.compute.internal, executor 1): 
> org.apache.spark.SparkException: Gave up after 3 retries while getting 
> records using shard iterator, last exception: at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$retryOrTimeout$2.apply(KinesisBackedBlockRDD.scala:288)
>  at scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:282)
>  at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecordsAndNextKinesisIterator(KinesisBackedBlockRDD.scala:223)
>  at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecords(KinesisBackedBlockRDD.scala:207)
>  at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:162)
>  at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:133)
>  at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) at 
> scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333)
>  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954) at 
> org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
> Caused by: 
> *{color:#ff}com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException:
>  Rate exceeded for shard shardId- in stream rellfsstream-an under 
> account 1{color}.* (Service: AmazonKinesis; Status Code: 400; 
> Error Code: ProvisionedThroughputExceededException; Request ID: 
> d3520677-060e-14c4-8014-2886b6b75f03) at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1257)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1029)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:741)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:715)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:697)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:665)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:647)
>  at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:511) at 
> com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2219)
>  at 
> com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2195)
>  at 
> com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetRecords(AmazonKinesisClient.java:1004)
>  at 
> com.amazonaws.services.kinesis.AmazonKinesisClient.getRecords(AmazonKinesisClient.java:980)
>  at 
> 

[jira] [Assigned] (SPARK-24970) Spark Kinesis streaming application fails to recover from streaming checkpoint due to ProvisionedThroughputExceededException

2018-07-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24970:


Assignee: Apache Spark

> Spark Kinesis streaming application fails to recover from streaming 
> checkpoint due to ProvisionedThroughputExceededException
> 
>
> Key: SPARK-24970
> URL: https://issues.apache.org/jira/browse/SPARK-24970
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: bruce_zhao
>Assignee: Apache Spark
>Priority: Major
>  Labels: kinesis
>
>  
> We're using Spark streaming to consume Kinesis data, and found that it reads 
> more data from Kinesis and is easy to touch 
> ProvisionedThroughputExceededException *when it recovers from streaming 
> checkpoint*. 
>  
> Normally, it's a WARN in spark log. But when we have multiple streaming 
> applications (i.e., 5 applications) to consume the same Kinesis stream, the 
> situation becomes serious. *The application will fail to recover due to the 
> following exception in driver.*  And one application failure will also affect 
> the other running applications. 
>   
> {panel:title=Exception}
> org.apache.spark.SparkException: Job aborted due to stage failure: 
> {color:#ff}*Task 5 in stage 7.0 failed 4 times, most recent 
> failure*:{color} Lost task 5.3 in stage 7.0 (TID 128, 
> ip-172-31-14-36.ap-northeast-1.compute.internal, executor 1): 
> org.apache.spark.SparkException: Gave up after 3 retries while getting 
> records using shard iterator, last exception: at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$retryOrTimeout$2.apply(KinesisBackedBlockRDD.scala:288)
>  at scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:282)
>  at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecordsAndNextKinesisIterator(KinesisBackedBlockRDD.scala:223)
>  at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecords(KinesisBackedBlockRDD.scala:207)
>  at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:162)
>  at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:133)
>  at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) at 
> scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333)
>  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954) at 
> org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
> Caused by: 
> *{color:#ff}com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException:
>  Rate exceeded for shard shardId- in stream rellfsstream-an under 
> account 1{color}.* (Service: AmazonKinesis; Status Code: 400; 
> Error Code: ProvisionedThroughputExceededException; Request ID: 
> d3520677-060e-14c4-8014-2886b6b75f03) at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1257)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1029)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:741)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:715)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:697)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:665)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:647)
>  at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:511) at 
> com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2219)
>  at 
> com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2195)
>  at 
> com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetRecords(AmazonKinesisClient.java:1004)
>  at 
> com.amazonaws.services.kinesis.AmazonKinesisClient.getRecords(AmazonKinesisClient.java:980)
>  

[jira] [Resolved] (SPARK-24975) Spark history server REST API /api/v1/version returns error 404

2018-07-31 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24975.
-
Resolution: Duplicate

> Spark history server REST API /api/v1/version returns error 404
> ---
>
> Key: SPARK-24975
> URL: https://issues.apache.org/jira/browse/SPARK-24975
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1
>Reporter: shanyu zhao
>Priority: Major
>
> Spark history server REST API provides /api/v1/version, according to doc:
> [https://spark.apache.org/docs/latest/monitoring.html]
> However, for Spark 2.3, we see:
> {code:java}
> curl http://localhost:18080/api/v1/version
> 
> 
> 
> Error 404 Not Found
> 
> HTTP ERROR 404
> Problem accessing /api/v1/version. Reason:
>  Not Foundhttp://eclipse.org/jetty;>Powered by 
> Jetty:// 9.3.z-SNAPSHOT
> 
> {code}
> On a Spark 2.2 cluster, we see:
> {code:java}
> curl http://localhost:18080/api/v1/version
> {
> "spark" : "2.2.0"
> }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24975) Spark history server REST API /api/v1/version returns error 404

2018-07-31 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563259#comment-16563259
 ] 

Marco Gaido commented on SPARK-24975:
-

This seems a duplicate of SPARK-24188. Despite here I see that 2.3.1 is 
affected, while this should not be the case according to SPARK-24188. May you 
please check if 2.3.1 is actually affected and if not close this as duplicate? 
Thanks.

> Spark history server REST API /api/v1/version returns error 404
> ---
>
> Key: SPARK-24975
> URL: https://issues.apache.org/jira/browse/SPARK-24975
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1
>Reporter: shanyu zhao
>Priority: Major
>
> Spark history server REST API provides /api/v1/version, according to doc:
> [https://spark.apache.org/docs/latest/monitoring.html]
> However, for Spark 2.3, we see:
> {code:java}
> curl http://localhost:18080/api/v1/version
> 
> 
> 
> Error 404 Not Found
> 
> HTTP ERROR 404
> Problem accessing /api/v1/version. Reason:
>  Not Foundhttp://eclipse.org/jetty;>Powered by 
> Jetty:// 9.3.z-SNAPSHOT
> 
> {code}
> On a Spark 2.2 cluster, we see:
> {code:java}
> curl http://localhost:18080/api/v1/version
> {
> "spark" : "2.2.0"
> }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24720) kafka transaction creates Non-consecutive Offsets (due to transaction offset) making streaming fail when failOnDataLoss=true

2018-07-31 Thread Quentin Ambard (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563256#comment-16563256
 ] 

Quentin Ambard commented on SPARK-24720:


Maybe we could improve that. As we already have to iterate over the records to 
count the number of entries for the inputInfoTracker, we could do a single 
iteration where we get at the same time the count and the last offset.

> kafka transaction creates Non-consecutive Offsets (due to transaction offset) 
> making streaming fail when failOnDataLoss=true
> 
>
> Key: SPARK-24720
> URL: https://issues.apache.org/jira/browse/SPARK-24720
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Quentin Ambard
>Priority: Major
>
> When kafka transactions are used, sending 1 message to kafka will result to 1 
> offset for the data + 1 offset to mark the transaction.
> When kafka connector for spark streaming read a topic with non-consecutive 
> offset, it leads to a failure. SPARK-17147 fixed this issue for compacted 
> topics.
>  However, SPARK-17147 doesn't fix this issue for kafka transactions: if 1 
> message + 1 transaction commit are in a partition, spark will try to read 
> offsets  [0 2[. offset 0 (containing the message) will be read, but offset 1 
> won't return a value and buffer.hasNext() will be false even after a poll 
> since no data are present for offset 1 (it's the transaction commit)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24972) PivotFirst could not handle pivot columns of complex types

2018-07-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24972.
-
   Resolution: Fixed
 Assignee: Maryann Xue
Fix Version/s: 2.4.0

> PivotFirst could not handle pivot columns of complex types
> --
>
> Key: SPARK-24972
> URL: https://issues.apache.org/jira/browse/SPARK-24972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Minor
> Fix For: 2.4.0
>
>
> {{PivotFirst}} did not handle complex types for pivot columns properly. And 
> as a result, the pivot column could not be matched with any pivot value and 
> it always returned empty result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org