[jira] [Commented] (SPARK-24444) Improve pandas_udf GROUPED_MAP docs to explain column assignment

2018-05-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497587#comment-16497587
 ] 

Apache Spark commented on SPARK-2:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/21478

> Improve pandas_udf GROUPED_MAP docs to explain column assignment
> 
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> There have been several bugs regarding this and a clean solution still 
> changes some behavior.  Until this can be resolved, improve docs to explain 
> that columns are assigned by position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23920) High-order function: array_remove(x, element) → array

2018-05-31 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-23920.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21069
https://github.com/apache/spark/pull/21069

> High-order function: array_remove(x, element) → array
> -
>
> Key: SPARK-23920
> URL: https://issues.apache.org/jira/browse/SPARK-23920
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Remove all elements that equal element from array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23920) High-order function: array_remove(x, element) → array

2018-05-31 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-23920:
-

Assignee: Huaxin Gao

> High-order function: array_remove(x, element) → array
> -
>
> Key: SPARK-23920
> URL: https://issues.apache.org/jira/browse/SPARK-23920
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Huaxin Gao
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Remove all elements that equal element from array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24448) File not found on the address SparkFiles.get returns on standalone cluster

2018-05-31 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497548#comment-16497548
 ] 

Saisai Shao commented on SPARK-24448:
-

Does it only happen in standalone cluster mode, have you tried client mode?

> File not found on the address SparkFiles.get returns on standalone cluster
> --
>
> Key: SPARK-24448
> URL: https://issues.apache.org/jira/browse/SPARK-24448
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Pritpal Singh
>Priority: Major
>
> I want to upload a file on all worker nodes in a standalone cluster and 
> retrieve the location of file. Here is my code
>  
> val tempKeyStoreLoc = System.getProperty("java.io.tmpdir") + "/keystore.jks"
> val file = new File(tempKeyStoreLoc)
> sparkContext.addFile(file.getAbsolutePath)
> val keyLoc = SparkFiles.get("keystore.jks")
>  
> SparkFiles.get returns a random location where keystore.jks does not exist. I 
> submit the job in cluster mode. In fact the location Spark.Files returns does 
> not exist on any of the worker nodes (including the driver node). 
> I observed that Spark does load keystore.jks files on worker nodes at 
> /work///keystore.jks. The partition_id 
> changes from one worker node to another.
> My requirement is to upload a file on all nodes of a cluster and retrieve its 
> location. I'm expecting the location to be common across all worker nodes.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24326) Add local:// scheme support for the app jar in mesos cluster mode

2018-05-31 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-24326.
--
   Resolution: Fixed
 Assignee: Stavros Kontopoulos
Fix Version/s: 2.4.0

>  Add local:// scheme support for the app jar in mesos cluster mode
> --
>
> Key: SPARK-24326
> URL: https://issues.apache.org/jira/browse/SPARK-24326
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 2.4.0
>
>
> It is often useful to reference an application jar within the image used to 
> deploy a Spark job on mesos in cluster mode. This is not possible right now 
> because the mesos dispatcher will try to resolve the local://... uri on the 
> host (via the fetcher) and not in the container. Target is to have a scheme 
> like local:/// being resolved in the container's fs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24444) Improve pandas_udf GROUPED_MAP docs to explain column assignment

2018-05-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-2.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21471
[https://github.com/apache/spark/pull/21471]

> Improve pandas_udf GROUPED_MAP docs to explain column assignment
> 
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> There have been several bugs regarding this and a clean solution still 
> changes some behavior.  Until this can be resolved, improve docs to explain 
> that columns are assigned by position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24448) File not found on the address SparkFiles.get returns on standalone cluster

2018-05-31 Thread Pritpal Singh (JIRA)
Pritpal Singh created SPARK-24448:
-

 Summary: File not found on the address SparkFiles.get returns on 
standalone cluster
 Key: SPARK-24448
 URL: https://issues.apache.org/jira/browse/SPARK-24448
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Pritpal Singh


I want to upload a file on all worker nodes in a standalone cluster and 
retrieve the location of file. Here is my code

 

val tempKeyStoreLoc = System.getProperty("java.io.tmpdir") + "/keystore.jks"

val file = new File(tempKeyStoreLoc)

sparkContext.addFile(file.getAbsolutePath)

val keyLoc = SparkFiles.get("keystore.jks")

 

SparkFiles.get returns a random location where keystore.jks does not exist. I 
submit the job in cluster mode. In fact the location Spark.Files returns does 
not exist on any of the worker nodes (including the driver node). 

I observed that Spark does load keystore.jks files on worker nodes at 
/work///keystore.jks. The partition_id 
changes from one worker node to another.

My requirement is to upload a file on all nodes of a cluster and retrieve its 
location. I'm expecting the location to be common across all worker nodes.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context

2018-05-31 Thread Perry Chu (JIRA)
Perry Chu created SPARK-24447:
-

 Summary: Pyspark RowMatrix.columnSimilarities() loses spark context
 Key: SPARK-24447
 URL: https://issues.apache.org/jira/browse/SPARK-24447
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 2.3.0
Reporter: Perry Chu


The RDD behind the CoordinateMatrix returned by RowMatrix.columnSimilarities() 
appears to be losing track of the spark context. 

I'm pretty new to spark - not sure if the problem is on the python side or the 
scala side - would appreciate someone more experienced taking a look.

This snippet should reproduce the error:
{code:java}
from pyspark.mllib.linalg.distributed import RowMatrix

rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]])
matrix = RowMatrix(rows)
sims = matrix.columnSimilarities()

## This works, prints "3 3" as expected (3 columns = 3x3 matrix)
print(sims.numRows(),sims.numCols())

## This throws an error (stack trace below)
print(sims.entries.first())

## Later I tried this
print(rows.context) #
print(sims.entries.context) #, 
then throws an error{code}
Error stack trace
{code:java}
---
AttributeError Traceback (most recent call last)
 in ()
> 1 sims.entries.first()

/usr/lib/spark/python/pyspark/rdd.py in first(self)
1374 ValueError: RDD is empty
1375 """
-> 1376 rs = self.take(1)
1377 if rs:
1378 return rs[0]

/usr/lib/spark/python/pyspark/rdd.py in take(self, num)
1356
1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1358 res = self.context.runJob(self, takeUpToNumLeft, p)
1359
1360 items += res

/usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, 
partitions, allowLocal)
999 # SparkContext#runJob.
1000 mappedRDD = rdd.mapPartitions(partitionFunc)
-> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
partitions)
1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
1003

AttributeError: 'NoneType' object has no attribute 'sc'
{code}
PySpark columnSimilarities documentation

http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24330) Refactor ExecuteWriteTask in FileFormatWriter with DataWriter(V2)

2018-05-31 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24330:
---

Assignee: Gengliang Wang

> Refactor ExecuteWriteTask in FileFormatWriter with DataWriter(V2)
> -
>
> Key: SPARK-24330
> URL: https://issues.apache.org/jira/browse/SPARK-24330
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Refactor ExecuteWriteTask in FileFormatWriter to reduce common logic and 
> improve readability.
> After the change, callers only need to call {{commit()}} or {{abort}} at the 
> end of task.
> Also there is less code in {{SingleDirectoryWriteTask}} and 
> {{DynamicPartitionWriteTask}}.
> Definitions of related classes are moved to a new file, and 
> {{ExecuteWriteTask}} is renamed to {{FileFormatDataWriter}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24330) Refactor ExecuteWriteTask in FileFormatWriter with DataWriter(V2)

2018-05-31 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24330.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21381
[https://github.com/apache/spark/pull/21381]

> Refactor ExecuteWriteTask in FileFormatWriter with DataWriter(V2)
> -
>
> Key: SPARK-24330
> URL: https://issues.apache.org/jira/browse/SPARK-24330
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Refactor ExecuteWriteTask in FileFormatWriter to reduce common logic and 
> improve readability.
> After the change, callers only need to call {{commit()}} or {{abort}} at the 
> end of task.
> Also there is less code in {{SingleDirectoryWriteTask}} and 
> {{DynamicPartitionWriteTask}}.
> Definitions of related classes are moved to a new file, and 
> {{ExecuteWriteTask}} is renamed to {{FileFormatDataWriter}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23754) StopIterator exception in Python UDF results in partial result

2018-05-31 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497461#comment-16497461
 ] 

Hyukjin Kwon commented on SPARK-23754:
--

For clarification, this is _fixed_. We are just trying to clean up a bit which 
shouldn't be a blocker.

> StopIterator exception in Python UDF results in partial result
> --
>
> Key: SPARK-23754
> URL: https://issues.apache.org/jira/browse/SPARK-23754
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Assignee: Emilio Dorigatti
>Priority: Blocker
> Fix For: 2.3.1, 2.4.0
>
>
> Reproduce:
> {code:java}
> df = spark.range(0, 1000)
> from pyspark.sql.functions import udf
> def foo(x):
> raise StopIteration()
> df.withColumn('v', udf(foo)).show()
> # Results
> # +---+---+
> # | id|  v|
> # +---+---+
> # +---+---+{code}
> I think the task should fail in this case



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2018-05-31 Thread fengchaoge (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497456#comment-16497456
 ] 

fengchaoge commented on SPARK-21918:


Hu Liu   gone?

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>Priority: Major
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24442) Add configuration parameter to adjust the numbers of records and the charters per row before truncation when a user runs.show()

2018-05-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24442:
-
Fix Version/s: (was: 2.4.0)

> Add configuration parameter to adjust the numbers of records and the charters 
> per row before truncation when a user runs.show()
> ---
>
> Key: SPARK-24442
> URL: https://issues.apache.org/jira/browse/SPARK-24442
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Andrew K Long
>Priority: Minor
> Attachments: spark-adjustable-display-size.diff
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently the number of characters displayed when a user runs the .show() 
> function on a data frame is hard coded. The current default is too small when 
> used with wider console widths.  This fix will add two parameters.
>  
> parameter: "spark.show.default.number.of.rows" default: "20"
> parameter: "spark.show.default.truncate.characters.per.column" default: "20"
>  
> This change will be backwords compatible and will not break any existing 
> functionality nor change the default display characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24442) Add configuration parameter to adjust the numbers of records and the charters per row before truncation when a user runs.show()

2018-05-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24442:
-
Component/s: (was: Input/Output)
 SQL

> Add configuration parameter to adjust the numbers of records and the charters 
> per row before truncation when a user runs.show()
> ---
>
> Key: SPARK-24442
> URL: https://issues.apache.org/jira/browse/SPARK-24442
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Andrew K Long
>Priority: Minor
> Attachments: spark-adjustable-display-size.diff
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently the number of characters displayed when a user runs the .show() 
> function on a data frame is hard coded. The current default is too small when 
> used with wider console widths.  This fix will add two parameters.
>  
> parameter: "spark.show.default.number.of.rows" default: "20"
> parameter: "spark.show.default.truncate.characters.per.column" default: "20"
>  
> This change will be backwords compatible and will not break any existing 
> functionality nor change the default display characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24442) Add configuration parameter to adjust the numbers of records and the charters per row before truncation when a user runs.show()

2018-05-31 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497458#comment-16497458
 ] 

Hyukjin Kwon commented on SPARK-24442:
--

Please avoid setting Fix Version which is usually set when it's actually fixed.

> Add configuration parameter to adjust the numbers of records and the charters 
> per row before truncation when a user runs.show()
> ---
>
> Key: SPARK-24442
> URL: https://issues.apache.org/jira/browse/SPARK-24442
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Andrew K Long
>Priority: Minor
> Attachments: spark-adjustable-display-size.diff
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently the number of characters displayed when a user runs the .show() 
> function on a data frame is hard coded. The current default is too small when 
> used with wider console widths.  This fix will add two parameters.
>  
> parameter: "spark.show.default.number.of.rows" default: "20"
> parameter: "spark.show.default.truncate.characters.per.column" default: "20"
>  
> This change will be backwords compatible and will not break any existing 
> functionality nor change the default display characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2018-05-31 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497448#comment-16497448
 ] 

Hyukjin Kwon commented on SPARK-21187:
--

(y)

> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> This is to track adding the remaining type support in Arrow Converters. 
> Currently, only primitive data types are supported. '
> Remaining types:
>  * -*Date*-
>  * -*Timestamp*-
>  * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map
>  * -*Decimal*-
>  * *Binary* - in pyspark
> Some things to do before closing this out:
>  * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)-
>  * -Need to add some user docs-
>  * -Make sure Python tests are thorough-
>  * Check into complex type support mentioned in comments by [~leif], should 
> we support mulit-indexing?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24444) Improve pandas_udf GROUPED_MAP docs to explain column assignment

2018-05-31 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-2:
-
Target Version/s: 2.3.1, 2.4.0  (was: 2.3.1)

> Improve pandas_udf GROUPED_MAP docs to explain column assignment
> 
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> There have been several bugs regarding this and a clean solution still 
> changes some behavior.  Until this can be resolved, improve docs to explain 
> that columns are assigned by position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24396) Add Structured Streaming ForeachWriter for python

2018-05-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497283#comment-16497283
 ] 

Apache Spark commented on SPARK-24396:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/21477

> Add Structured Streaming ForeachWriter for python
> -
>
> Key: SPARK-24396
> URL: https://issues.apache.org/jira/browse/SPARK-24396
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> Users should be able to write ForeachWriter code in python, that is, they 
> should be able to use the partitionid and the version/batchId/epochId to 
> conditionally process rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24446) Library path with special characters breaks Spark on YARN

2018-05-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497275#comment-16497275
 ] 

Apache Spark commented on SPARK-24446:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21476

> Library path with special characters breaks Spark on YARN
> -
>
> Key: SPARK-24446
> URL: https://issues.apache.org/jira/browse/SPARK-24446
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> When YARN runs the application's main command, it does it like this:
> {code}
> bash -c ""
> {code}
> The way Spark injects the library path into that command makes it look like 
> this:
> {code}
> bash -c "LD_LIBRARY_PATH="/foo:/bar:/baz:$LD_LIBRARY_PATH"  command>"
> {code}
> So that works kinda out of luck, because the concatenation of the strings 
> creates a proper final command... except if you have something like a space 
> or an ampersand in the library path, in which case all containers will fail 
> with a cryptic message like the following:
> {noformat}
> WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Executor for 
> container container_1475411358336_0010_01_02 exited because of a YARN 
> event (e.g., pre-emption) and not because of an error in the running job.
> {noformat}
> And no useful log output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24446) Library path with special characters breaks Spark on YARN

2018-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24446:


Assignee: (was: Apache Spark)

> Library path with special characters breaks Spark on YARN
> -
>
> Key: SPARK-24446
> URL: https://issues.apache.org/jira/browse/SPARK-24446
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> When YARN runs the application's main command, it does it like this:
> {code}
> bash -c ""
> {code}
> The way Spark injects the library path into that command makes it look like 
> this:
> {code}
> bash -c "LD_LIBRARY_PATH="/foo:/bar:/baz:$LD_LIBRARY_PATH"  command>"
> {code}
> So that works kinda out of luck, because the concatenation of the strings 
> creates a proper final command... except if you have something like a space 
> or an ampersand in the library path, in which case all containers will fail 
> with a cryptic message like the following:
> {noformat}
> WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Executor for 
> container container_1475411358336_0010_01_02 exited because of a YARN 
> event (e.g., pre-emption) and not because of an error in the running job.
> {noformat}
> And no useful log output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24446) Library path with special characters breaks Spark on YARN

2018-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24446:


Assignee: Apache Spark

> Library path with special characters breaks Spark on YARN
> -
>
> Key: SPARK-24446
> URL: https://issues.apache.org/jira/browse/SPARK-24446
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> When YARN runs the application's main command, it does it like this:
> {code}
> bash -c ""
> {code}
> The way Spark injects the library path into that command makes it look like 
> this:
> {code}
> bash -c "LD_LIBRARY_PATH="/foo:/bar:/baz:$LD_LIBRARY_PATH"  command>"
> {code}
> So that works kinda out of luck, because the concatenation of the strings 
> creates a proper final command... except if you have something like a space 
> or an ampersand in the library path, in which case all containers will fail 
> with a cryptic message like the following:
> {noformat}
> WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Executor for 
> container container_1475411358336_0010_01_02 exited because of a YARN 
> event (e.g., pre-emption) and not because of an error in the running job.
> {noformat}
> And no useful log output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24416) Update configuration definition for spark.blacklist.killBlacklistedExecutors

2018-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24416:


Assignee: (was: Apache Spark)

> Update configuration definition for spark.blacklist.killBlacklistedExecutors
> 
>
> Key: SPARK-24416
> URL: https://issues.apache.org/jira/browse/SPARK-24416
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sanket Reddy
>Priority: Minor
>
> spark.blacklist.killBlacklistedExecutors is defined as 
> (Experimental) If set to "true", allow Spark to automatically kill, and 
> attempt to re-create, executors when they are blacklisted. Note that, when an 
> entire node is added to the blacklist, all of the executors on that node will 
> be killed.
> I presume the killing of blacklisted executors only happens after the stage 
> completes successfully and all tasks have completed or on fetch failures 
> (updateBlacklistForFetchFailure/updateBlacklistForSuccessfulTaskSet). It is 
> confusing because the definition states that the executor will be attempted 
> to be recreated as soon as it is blacklisted. This is not true while the 
> stage is in progress and an executor is blacklisted, it will not attempt to 
> cleanup until the stage finishes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24416) Update configuration definition for spark.blacklist.killBlacklistedExecutors

2018-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24416:


Assignee: Apache Spark

> Update configuration definition for spark.blacklist.killBlacklistedExecutors
> 
>
> Key: SPARK-24416
> URL: https://issues.apache.org/jira/browse/SPARK-24416
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sanket Reddy
>Assignee: Apache Spark
>Priority: Minor
>
> spark.blacklist.killBlacklistedExecutors is defined as 
> (Experimental) If set to "true", allow Spark to automatically kill, and 
> attempt to re-create, executors when they are blacklisted. Note that, when an 
> entire node is added to the blacklist, all of the executors on that node will 
> be killed.
> I presume the killing of blacklisted executors only happens after the stage 
> completes successfully and all tasks have completed or on fetch failures 
> (updateBlacklistForFetchFailure/updateBlacklistForSuccessfulTaskSet). It is 
> confusing because the definition states that the executor will be attempted 
> to be recreated as soon as it is blacklisted. This is not true while the 
> stage is in progress and an executor is blacklisted, it will not attempt to 
> cleanup until the stage finishes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24416) Update configuration definition for spark.blacklist.killBlacklistedExecutors

2018-05-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497259#comment-16497259
 ] 

Apache Spark commented on SPARK-24416:
--

User 'redsanket' has created a pull request for this issue:
https://github.com/apache/spark/pull/21475

> Update configuration definition for spark.blacklist.killBlacklistedExecutors
> 
>
> Key: SPARK-24416
> URL: https://issues.apache.org/jira/browse/SPARK-24416
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sanket Reddy
>Priority: Minor
>
> spark.blacklist.killBlacklistedExecutors is defined as 
> (Experimental) If set to "true", allow Spark to automatically kill, and 
> attempt to re-create, executors when they are blacklisted. Note that, when an 
> entire node is added to the blacklist, all of the executors on that node will 
> be killed.
> I presume the killing of blacklisted executors only happens after the stage 
> completes successfully and all tasks have completed or on fetch failures 
> (updateBlacklistForFetchFailure/updateBlacklistForSuccessfulTaskSet). It is 
> confusing because the definition states that the executor will be attempted 
> to be recreated as soon as it is blacklisted. This is not true while the 
> stage is in progress and an executor is blacklisted, it will not attempt to 
> cleanup until the stage finishes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster

2018-05-31 Thread Federico Lasa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497250#comment-16497250
 ] 

Federico Lasa commented on SPARK-21063:
---

Affected as well on 2.0.0 (HDP 2.5.3)

> Spark return an empty result from remote hadoop cluster
> ---
>
> Key: SPARK-21063
> URL: https://issues.apache.org/jira/browse/SPARK-21063
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Peter Bykov
>Priority: Major
>
> Spark returning empty result from when querying remote hadoop cluster.
> All firewall settings removed.
> Querying using JDBC working properly using hive-jdbc driver from version 1.1.1
> Code snippet is:
> {code:java}
> val spark = SparkSession.builder
> .appName("RemoteSparkTest")
> .master("local")
> .getOrCreate()
> val df = spark.read
>   .option("url", "jdbc:hive2://remote.hive.local:1/default")
>   .option("user", "user")
>   .option("password", "pass")
>   .option("dbtable", "test_table")
>   .option("driver", "org.apache.hive.jdbc.HiveDriver")
>   .format("jdbc")
>   .load()
>  
> df.show()
> {code}
> Result:
> {noformat}
> +---+
> |test_table.test_col|
> +---+
> +---+
> {noformat}
> All manipulations like: 
> {code:java}
> df.select(*).show()
> {code}
> returns empty result too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2018-05-31 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497244#comment-16497244
 ] 

Bryan Cutler commented on SPARK-21187:
--

Hi [~teddy.choi], MapType still needs some work to be done in Arrow before we 
can add the Spark implementation. If you are able to help out on that front, 
that would be great!

> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> This is to track adding the remaining type support in Arrow Converters. 
> Currently, only primitive data types are supported. '
> Remaining types:
>  * -*Date*-
>  * -*Timestamp*-
>  * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map
>  * -*Decimal*-
>  * *Binary* - in pyspark
> Some things to do before closing this out:
>  * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)-
>  * -Need to add some user docs-
>  * -Make sure Python tests are thorough-
>  * Check into complex type support mentioned in comments by [~leif], should 
> we support mulit-indexing?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24297) Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB

2018-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24297:


Assignee: (was: Apache Spark)

> Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB
> ---
>
> Key: SPARK-24297
> URL: https://issues.apache.org/jira/browse/SPARK-24297
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Shuffle, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Any network request which does not use stream-to-disk that is sending over 
> 2GB is doomed to fail, so we might as well at least set the default value of 
> spark.maxRemoteBlockSizeFetchToMem to something < 2GB.
> It probably makes sense to set it to something even lower still, but that 
> might require more careful testing; this is a totally safe first step.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24297) Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB

2018-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24297:


Assignee: Apache Spark

> Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB
> ---
>
> Key: SPARK-24297
> URL: https://issues.apache.org/jira/browse/SPARK-24297
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Shuffle, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Major
>
> Any network request which does not use stream-to-disk that is sending over 
> 2GB is doomed to fail, so we might as well at least set the default value of 
> spark.maxRemoteBlockSizeFetchToMem to something < 2GB.
> It probably makes sense to set it to something even lower still, but that 
> might require more careful testing; this is a totally safe first step.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24297) Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB

2018-05-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497219#comment-16497219
 ] 

Apache Spark commented on SPARK-24297:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/21474

> Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB
> ---
>
> Key: SPARK-24297
> URL: https://issues.apache.org/jira/browse/SPARK-24297
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Shuffle, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Any network request which does not use stream-to-disk that is sending over 
> 2GB is doomed to fail, so we might as well at least set the default value of 
> spark.maxRemoteBlockSizeFetchToMem to something < 2GB.
> It probably makes sense to set it to something even lower still, but that 
> might require more careful testing; this is a totally safe first step.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24232) Allow referring to kubernetes secrets as env variable

2018-05-31 Thread Anirudh Ramanathan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan resolved SPARK-24232.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21317
[https://github.com/apache/spark/pull/21317]

> Allow referring to kubernetes secrets as env variable
> -
>
> Key: SPARK-24232
> URL: https://issues.apache.org/jira/browse/SPARK-24232
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Dharmesh Kakadia
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 3.0.0
>
>
> Allow referring to kubernetes secrets in the driver process via environment 
> variables. This will allow developers to use secretes without leaking them in 
> the code and at the same time secrets can be decoupled and managed 
> separately. This can be used to refer to passwords, certificates etc while 
> talking to other service (jdbc passwords, storage keys etc).
> So, at the deployment time, something like 
> ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified 
> which will make [EnvName].[key] available as an environment variable and in 
> the code its always referred as env variable [key].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24232) Allow referring to kubernetes secrets as env variable

2018-05-31 Thread Anirudh Ramanathan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan reassigned SPARK-24232:
--

Assignee: Stavros Kontopoulos

> Allow referring to kubernetes secrets as env variable
> -
>
> Key: SPARK-24232
> URL: https://issues.apache.org/jira/browse/SPARK-24232
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Dharmesh Kakadia
>Assignee: Stavros Kontopoulos
>Priority: Major
>
> Allow referring to kubernetes secrets in the driver process via environment 
> variables. This will allow developers to use secretes without leaking them in 
> the code and at the same time secrets can be decoupled and managed 
> separately. This can be used to refer to passwords, certificates etc while 
> talking to other service (jdbc passwords, storage keys etc).
> So, at the deployment time, something like 
> ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified 
> which will make [EnvName].[key] available as an environment variable and in 
> the code its always referred as env variable [key].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-31 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497203#comment-16497203
 ] 

Joseph K. Bradley commented on SPARK-24359:
---

Clarification question: [~falaki] did you mean to say that the CRAN package 
SparkML will be updated for every *minor* release (2.3, 2.4, etc.)?  (I assume 
you did not mean every major release (3.0, 4.0, etc.) since those only happen 
every 2 years or so.)

I'd recommend we follow the same pattern as for the SparkR package: Updates to 
SparkML and SparkR will require official Spark releases, limiting us to 
patching SparkML only when there is a new Spark patch release (2.3.1, 2.3.2, 
etc.).  I feel like that's a lesser evil than the only other option I know of: 
splitting off SparkR and/or SparkML into completely separate projects under 
different Apache or non-Apache oversight.  What do you think?

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- 

[jira] [Updated] (SPARK-24446) Library path with special characters breaks Spark on YARN

2018-05-31 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-24446:
---
Description: 
When YARN runs the application's main command, it does it like this:

{code}
bash -c ""
{code}

The way Spark injects the library path into that command makes it look like 
this:

{code}
bash -c "LD_LIBRARY_PATH="/foo:/bar:/baz:$LD_LIBRARY_PATH" "
{code}

So that works kinda out of luck, because the concatenation of the strings 
creates a proper final command... except if you have something like a space or 
an ampersand in the library path, in which case all containers will fail with a 
cryptic message like the following:

{noformat}
WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Executor for container 
container_1475411358336_0010_01_02 exited because of a YARN event (e.g., 
pre-emption) and not because of an error in the running job.
{noformat}

And no useful log output.


  was:
When YARN runs the application's main command, it does it like this:

{code}
bash -c ""
{code}

The way Spark injects the library path into that command makes it look like 
this:

{code}
bash -c "LD_LIBRARY_PATH="/foo:/bar:/baz:$LD_LIBRARY_PATH" "
{code}

So that works kinda out of luck, because the concatenation of the strings 
creates a proper final command... except if you have something like a space of 
an ampersand in the library path, in which case all containers will fail with a 
cryptic message like the following:

{noformat}
WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Executor for container 
container_1475411358336_0010_01_02 exited because of a YARN event (e.g., 
pre-emption) and not because of an error in the running job.
{noformat}

And no useful log output.



> Library path with special characters breaks Spark on YARN
> -
>
> Key: SPARK-24446
> URL: https://issues.apache.org/jira/browse/SPARK-24446
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> When YARN runs the application's main command, it does it like this:
> {code}
> bash -c ""
> {code}
> The way Spark injects the library path into that command makes it look like 
> this:
> {code}
> bash -c "LD_LIBRARY_PATH="/foo:/bar:/baz:$LD_LIBRARY_PATH"  command>"
> {code}
> So that works kinda out of luck, because the concatenation of the strings 
> creates a proper final command... except if you have something like a space 
> or an ampersand in the library path, in which case all containers will fail 
> with a cryptic message like the following:
> {noformat}
> WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Executor for 
> container container_1475411358336_0010_01_02 exited because of a YARN 
> event (e.g., pre-emption) and not because of an error in the running job.
> {noformat}
> And no useful log output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24446) Library path with special characters breaks Spark on YARN

2018-05-31 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-24446:
--

 Summary: Library path with special characters breaks Spark on YARN
 Key: SPARK-24446
 URL: https://issues.apache.org/jira/browse/SPARK-24446
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.3.0
Reporter: Marcelo Vanzin


When YARN runs the application's main command, it does it like this:

{code}
bash -c ""
{code}

The way Spark injects the library path into that command makes it look like 
this:

{code}
bash -c "LD_LIBRARY_PATH="/foo:/bar:/baz:$LD_LIBRARY_PATH" "
{code}

So that works kinda out of luck, because the concatenation of the strings 
creates a proper final command... except if you have something like a space of 
an ampersand in the library path, in which case all containers will fail with a 
cryptic message like the following:

{noformat}
WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Executor for container 
container_1475411358336_0010_01_02 exited because of a YARN event (e.g., 
pre-emption) and not because of an error in the running job.
{noformat}

And no useful log output.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21896) Stack Overflow when window function nested inside aggregate function

2018-05-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497160#comment-16497160
 ] 

Apache Spark commented on SPARK-21896:
--

User 'aokolnychyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/21473

> Stack Overflow when window function nested inside aggregate function
> 
>
> Key: SPARK-21896
> URL: https://issues.apache.org/jira/browse/SPARK-21896
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Luyao Yang
>Priority: Minor
>
> A minimal example: with the following simple test data
> {noformat}
> >>> df = spark.createDataFrame([(1, 2), (1, 3), (2, 4)], ['a', 'b'])
> >>> df.show()
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  2|
> |  1|  3|
> |  2|  4|
> +---+---+
> {noformat}
> This works: 
> {noformat}
> >>> w = Window().orderBy('b')
> >>> result = (df.select(F.rank().over(w).alias('rk'))
> ....groupby()
> ....agg(F.max('rk'))
> ...  )
> >>> result.show()
> +---+
> |max(rk)|
> +---+
> |  3|
> +---+
> {noformat}
> But this equivalent gives an error. Note that the error is thrown right when 
> the operation is defined, not when an action is called later:
> {noformat}
> >>> result = (df.groupby()
> ....agg(F.max(F.rank().over(w)))
> ...  )
> Traceback (most recent call last):
>   File 
> "/usr/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py", 
> line 2885, in run_code
> exec(code_obj, self.user_global_ns, self.user_ns)
>   File "", line 2, in 
> .agg(F.max(F.rank().over(w)))
>   File "/usr/lib/spark/python/pyspark/sql/group.py", line 91, in agg
> _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
> line 1133, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
> 319, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o10789.agg.
> : java.lang.StackOverflowError
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:55)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:400)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:381)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:277)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$71.apply(Analyzer.scala:1688)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$71.apply(Analyzer.scala:1724)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$extract(Analyzer.scala:1687)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$26.applyOrElse(Analyzer.scala:1825)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$26.applyOrElse(Analyzer.scala:1800)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:287)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
>   at 
> 

[jira] [Commented] (SPARK-24442) Add configuration parameter to adjust the numbers of records and the charters per row before truncation when a user runs.show()

2018-05-31 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497087#comment-16497087
 ] 

Reynold Xin commented on SPARK-24442:
-

Actually a pretty good idea. I've often wished there's a way to change it.

 

I'd name it `spark.sql.show.defaultNumRows` though.

 

> Add configuration parameter to adjust the numbers of records and the charters 
> per row before truncation when a user runs.show()
> ---
>
> Key: SPARK-24442
> URL: https://issues.apache.org/jira/browse/SPARK-24442
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Andrew K Long
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: spark-adjustable-display-size.diff
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently the number of characters displayed when a user runs the .show() 
> function on a data frame is hard coded. The current default is too small when 
> used with wider console widths.  This fix will add two parameters.
>  
> parameter: "spark.show.default.number.of.rows" default: "20"
> parameter: "spark.show.default.truncate.characters.per.column" default: "20"
>  
> This change will be backwords compatible and will not break any existing 
> functionality nor change the default display characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24445) Schema in json format for from_json in SQL

2018-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24445:


Assignee: Apache Spark

> Schema in json format for from_json in SQL
> --
>
> Key: SPARK-24445
> URL: https://issues.apache.org/jira/browse/SPARK-24445
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> In Spark 2.3, schema for the from_json function can be specified in JSON 
> format in Scala and Python but in SQL. In SQL it is impossible to specify map 
> type for example because SQL DDL parser can handle struct type only. Need to 
> support schemas in JSON format as it has been already implemented 
> [there|https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3225-L3229]:
> {code:scala}
> val dataType = try {
>   DataType.fromJson(schema)
> } catch {
>   case NonFatal(_) => StructType.fromDDL(schema)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24445) Schema in json format for from_json in SQL

2018-05-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497084#comment-16497084
 ] 

Apache Spark commented on SPARK-24445:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21472

> Schema in json format for from_json in SQL
> --
>
> Key: SPARK-24445
> URL: https://issues.apache.org/jira/browse/SPARK-24445
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> In Spark 2.3, schema for the from_json function can be specified in JSON 
> format in Scala and Python but in SQL. In SQL it is impossible to specify map 
> type for example because SQL DDL parser can handle struct type only. Need to 
> support schemas in JSON format as it has been already implemented 
> [there|https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3225-L3229]:
> {code:scala}
> val dataType = try {
>   DataType.fromJson(schema)
> } catch {
>   case NonFatal(_) => StructType.fromDDL(schema)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24445) Schema in json format for from_json in SQL

2018-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24445:


Assignee: (was: Apache Spark)

> Schema in json format for from_json in SQL
> --
>
> Key: SPARK-24445
> URL: https://issues.apache.org/jira/browse/SPARK-24445
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> In Spark 2.3, schema for the from_json function can be specified in JSON 
> format in Scala and Python but in SQL. In SQL it is impossible to specify map 
> type for example because SQL DDL parser can handle struct type only. Need to 
> support schemas in JSON format as it has been already implemented 
> [there|https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3225-L3229]:
> {code:scala}
> val dataType = try {
>   DataType.fromJson(schema)
> } catch {
>   case NonFatal(_) => StructType.fromDDL(schema)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24445) Schema in json format for from_json in SQL

2018-05-31 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497075#comment-16497075
 ] 

Maxim Gekk commented on SPARK-24445:


I am working on the ticket at the moment.

> Schema in json format for from_json in SQL
> --
>
> Key: SPARK-24445
> URL: https://issues.apache.org/jira/browse/SPARK-24445
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> In Spark 2.3, schema for the from_json function can be specified in JSON 
> format in Scala and Python but in SQL. In SQL it is impossible to specify map 
> type for example because SQL DDL parser can handle struct type only. Need to 
> support schemas in JSON format as it has been already implemented 
> [there|https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3225-L3229]:
> {code:scala}
> val dataType = try {
>   DataType.fromJson(schema)
> } catch {
>   case NonFatal(_) => StructType.fromDDL(schema)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24445) Schema in json format for from_json in SQL

2018-05-31 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24445:
--

 Summary: Schema in json format for from_json in SQL
 Key: SPARK-24445
 URL: https://issues.apache.org/jira/browse/SPARK-24445
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maxim Gekk


In Spark 2.3, schema for the from_json function can be specified in JSON format 
in Scala and Python but in SQL. In SQL it is impossible to specify map type for 
example because SQL DDL parser can handle struct type only. Need to support 
schemas in JSON format as it has been already implemented 
[there|https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3225-L3229]:
{code:scala}
val dataType = try {
  DataType.fromJson(schema)
} catch {
  case NonFatal(_) => StructType.fromDDL(schema)
}
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-05-31 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497070#comment-16497070
 ] 

Ted Yu commented on SPARK-18057:


I tend to agree with Cody.

Just wondering if other people would accept kafka-0-10-sql module referencing 
Kafka 2.0.0 release.

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24444) Improve pandas_udf GROUPED_MAP docs to explain column assignment

2018-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2:


Assignee: Apache Spark  (was: Bryan Cutler)

> Improve pandas_udf GROUPED_MAP docs to explain column assignment
> 
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Major
>
> There have been several bugs regarding this and a clean solution still 
> changes some behavior.  Until this can be resolved, improve docs to explain 
> that columns are assigned by position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24444) Improve pandas_udf GROUPED_MAP docs to explain column assignment

2018-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2:


Assignee: Bryan Cutler  (was: Apache Spark)

> Improve pandas_udf GROUPED_MAP docs to explain column assignment
> 
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> There have been several bugs regarding this and a clean solution still 
> changes some behavior.  Until this can be resolved, improve docs to explain 
> that columns are assigned by position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24444) Improve pandas_udf GROUPED_MAP docs to explain column assignment

2018-05-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497032#comment-16497032
 ] 

Apache Spark commented on SPARK-2:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/21471

> Improve pandas_udf GROUPED_MAP docs to explain column assignment
> 
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> There have been several bugs regarding this and a clean solution still 
> changes some behavior.  Until this can be resolved, improve docs to explain 
> that columns are assigned by position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-05-31 Thread Anirudh Ramanathan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497014#comment-16497014
 ] 

Anirudh Ramanathan commented on SPARK-24434:


Open to suggestions on what could be intuitive in this particular case. Perhaps 
there's also precedent for multi-line low-level configuration in other parts of 
Spark. 

cc/ [~felixcheung] 

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24434) Support user-specified driver and executor pod templates

2018-05-31 Thread Anirudh Ramanathan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497009#comment-16497009
 ] 

Anirudh Ramanathan edited comment on SPARK-24434 at 5/31/18 6:54 PM:
-

I was basing my suggestion of JSON on allowing specifying JSON strings inline 
as configuration, but I guess one could also specify a YAML file with the 
template and have spark configuration point to that file. [~skonto], you make a 
good point, it is another configuration mechanism that people may have to 
learn. This decision should be based more on UX and consistency with what Spark 
users expect in general. [~eje], to your point, I think we could support both 
if needed, but it might be prudent to find the one that's more intuitive to 
users in order to do first.

Sidenote: There's also [jsonpath|https://kubernetes.io/docs/reference/kubectl/] 
that kubectl supports but that could be overkill here.


was (Author: foxish):
Good point. I was basing my suggestion of JSON on allowing specifying JSON 
strings inline as configuration, but I guess one could also specify a YAML file 
with the template and have spark configuration point to that file. [~skonto], 
you make a good point, it is another configuration mechanism that people may 
have to learn. This decision should be based more on UX and consistency with 
what Spark users expect in general. [~eje], to your point, I think we could 
support both if needed, but it might be prudent to find the one that's more 
intuitive to users in order to do first.

Sidenote: There's also [jsonpath|https://kubernetes.io/docs/reference/kubectl/] 
that kubectl supports but that could be overkill here.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-05-31 Thread Anirudh Ramanathan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497009#comment-16497009
 ] 

Anirudh Ramanathan commented on SPARK-24434:


Good point. I was basing my suggestion of JSON on allowing specifying JSON 
strings inline as configuration, but I guess one could also specify a YAML file 
with the template and have spark configuration point to that file. [~skonto], 
you make a good point, it is another configuration mechanism that people may 
have to learn. This decision should be based more on UX and consistency with 
what Spark users expect in general. [~eje], to your point, I think we could 
support both if needed, but it might be prudent to find the one that's more 
intuitive to users in order to do first.

Sidenote: There's also [jsonpath|https://kubernetes.io/docs/reference/kubectl/] 
that kubectl supports but that could be overkill here.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24434) Support user-specified driver and executor pod templates

2018-05-31 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497000#comment-16497000
 ] 

Stavros Kontopoulos edited comment on SPARK-24434 at 5/31/18 6:49 PM:
--

[~foxish] JSON will be exposed to the user? IMHO If that is the case it seems a 
bit weird, people are used to yaml with k8s and java properties with Spark. Is 
there an option to keep it more simple?


was (Author: skonto):
[~foxish] JSON will be exposed to the user? IMHO If that is the case it seems a 
bit weird, people are used to yaml with k8s and java properties with Spark. 

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24444) Improve pandas_udf GROUPED_MAP docs to explain column assignment

2018-05-31 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-2:


 Summary: Improve pandas_udf GROUPED_MAP docs to explain column 
assignment
 Key: SPARK-2
 URL: https://issues.apache.org/jira/browse/SPARK-2
 Project: Spark
  Issue Type: Documentation
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Bryan Cutler
Assignee: Bryan Cutler


There have been several bugs regarding this and a clean solution still changes 
some behavior.  Until this can be resolved, improve docs to explain that 
columns are assigned by position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-05-31 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497000#comment-16497000
 ] 

Stavros Kontopoulos commented on SPARK-24434:
-

[~foxish] JSON will be exposed to the user? IMHO If that is the case it seems a 
bit weird, people are used to yaml with k8s and java properties with Spark. 

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-05-31 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496991#comment-16496991
 ] 

Erik Erlandson commented on SPARK-24434:


[~foxish] is there a technical (or ux) argument for json, versus yaml (or 
allowing both)?

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23900) format_number udf should take user specifed format as argument

2018-05-31 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-23900.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21010
[https://github.com/apache/spark/pull/21010]

> format_number udf should take user specifed format as argument
> --
>
> Key: SPARK-23900
> URL: https://issues.apache.org/jira/browse/SPARK-23900
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> https://issues.apache.org/jira/browse/HIVE-5370
> {noformat}
> Currently, format_number udf formats the number to #,###,###.##, but it 
> should also take a user specified format as optional input.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18981) The last job hung when speculation is on

2018-05-31 Thread John Zhuge (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496990#comment-16496990
 ] 

John Zhuge commented on SPARK-18981:


Fixed by SPARK-11334.

> The last job hung when speculation is on
> 
>
> Key: SPARK-18981
> URL: https://issues.apache.org/jira/browse/SPARK-18981
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: spark2.0.2
> hadoop2.5.0
>Reporter: roncenzhao
>Priority: Critical
> Attachments: Test.scala, job_hang.png, run_scala.sh
>
>
> CONF:
> spark.speculation   true
> spark.dynamicAllocation.minExecutors0
> spark.executor.cores   2
> When I run the follow app, the bug will trigger.
> ```
> sc.runJob(job1)
> sleep(100s)
> sc.runJob(job2) // the job2 will hang and never be scheduled
> ```
> The triggering condition is described as follows:
> condition1: During the sleeping time, the executors will be released and the 
> # of the executor will be zero some seconds later. The #numExecutorsTarget in 
> 'ExecutorAllocationManager' will be 0.
> condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks 
> will be negative during the ending of job1's tasks. 
> condition3: The job2 only hava one task.
> result:
> In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', 
> we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, 
> #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or 
> negative. So the 'ExecutorAllocationManager' will not request container from 
> yarn. The app will hang.
> In the attachments, submitting the app by 'run_scala.sh' will lead to the 
> 'hang' problem as the 'job_hang.png' shows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23900) format_number udf should take user specifed format as argument

2018-05-31 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-23900:
-

Assignee: Yuming Wang

> format_number udf should take user specifed format as argument
> --
>
> Key: SPARK-23900
> URL: https://issues.apache.org/jira/browse/SPARK-23900
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-5370
> {noformat}
> Currently, format_number udf formats the number to #,###,###.##, but it 
> should also take a user specified format as optional input.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24397) Add TaskContext.getLocalProperties in Python

2018-05-31 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-24397.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21437
[https://github.com/apache/spark/pull/21437]

> Add TaskContext.getLocalProperties in Python
> 
>
> Key: SPARK-24397
> URL: https://issues.apache.org/jira/browse/SPARK-24397
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-05-31 Thread Yinan Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496967#comment-16496967
 ] 

Yinan Li commented on SPARK-24434:
--

[~foxish] that sounds like the approach to go. 

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-05-31 Thread Anirudh Ramanathan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496962#comment-16496962
 ] 

Anirudh Ramanathan commented on SPARK-24434:


The way several custom APIs have done this before is having a PodTemplate field 
that uses the Kubernetes API to provide a rich type-safe interface to add 
arbitrary modifications to pods. It's typically easier with golang structs to 
do that, but we should investigate if from openapi, there's a way for the Java 
client to expose the same. Given that we will want it to map back to 
stringified configuration, supporting JSON strings seems like a good choice 
there. 
 
So, the flow I see is JSON strings converted into valid (type-checked) and 
supported PodTemplate specifications that are eventually added to driver and 
executor pods.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24443) comparison should accept structurally-equal types

2018-05-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496956#comment-16496956
 ] 

Apache Spark commented on SPARK-24443:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21470

> comparison should accept structurally-equal types
> -
>
> Key: SPARK-24443
> URL: https://issues.apache.org/jira/browse/SPARK-24443
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24443) comparison should accept structurally-equal types

2018-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24443:


Assignee: Apache Spark  (was: Wenchen Fan)

> comparison should accept structurally-equal types
> -
>
> Key: SPARK-24443
> URL: https://issues.apache.org/jira/browse/SPARK-24443
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24443) comparison should accept structurally-equal types

2018-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24443:


Assignee: Wenchen Fan  (was: Apache Spark)

> comparison should accept structurally-equal types
> -
>
> Key: SPARK-24443
> URL: https://issues.apache.org/jira/browse/SPARK-24443
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-05-31 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23874:
-
Description: 
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
ARROW-2645

 

 

  was:
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support
 * Bug fix related to array serialization 
[ARROW-1973|https://issues.apache.org/jira/browse/ARROW-1973]
 * Python2 str will be made into an Arrow string instead of bytes 
[ARROW-2101|https://issues.apache.org/jira/browse/ARROW-2101]
 * Python bytearrays are supported in as input to pyarrow 
[ARROW-2141|https://issues.apache.org/jira/browse/ARROW-2141]
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter [ARROW-1962|https://issues.apache.org/jira/browse/ARROW-1962]
 * Cleanup pyarrow type equality checks 
[ARROW-2423|https://issues.apache.org/jira/browse/ARROW-2423]

 

 


> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer

2018-05-31 Thread Misha Dmitriev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Misha Dmitriev updated SPARK-24356:
---
Attachment: dup-file-strings-details.png

> Duplicate strings in File.path managed by FileSegmentManagedBuffer
> --
>
> Key: SPARK-24356
> URL: https://issues.apache.org/jira/browse/SPARK-24356
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
>Reporter: Misha Dmitriev
>Priority: Major
> Attachments: SPARK-24356.01.patch, dup-file-strings-details.png
>
>
> I recently analyzed a heap dump of Yarn Node Manager that was suffering from 
> high GC pressure due to high object churn. Analysis was done with the jxray 
> tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a 
> number of well-known memory issues. One problem that it found in this dump is 
> 19.5% of memory wasted due to duplicate strings. Of these duplicates, more 
> than a half come from {{FileInputStream.path}} and {{File.path}}. All the 
> {{FileInputStream}} objects that JXRay shows are garbage - looks like they 
> are used for a very short period and then discarded (I guess there is a 
> separate question of whether that's a good pattern). But {{File}} instances 
> are traceable to 
> {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here 
> is the full reference chain:
>  
> {code:java}
> ↖java.io.File.path
> ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file
> ↖{j.u.ArrayList}
> ↖j.u.ArrayList$Itr.this$0
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance
> {code}
>  
> Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very 
> similar, so I think {{FileInputStream}}s are generated by the 
> {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely 
> come from 
> [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263]
>  
> To avoid duplicate strings in {{File.path}}'s in this case, it is suggested 
> that in the above code we create a File with a complete, normalized pathname, 
> that has been already interned. This will prevent the code inside 
> {{java.io.File}} from modifying this string, and thus it will use the 
> interned copy, and will pass it to FileInputStream. Essentially the current 
> line
> {code:java}
> return new File(new File(localDir, String.format("%02x", subDirId)), 
> filename);{code}
> should be replaced with something like
> {code:java}
> String pathname = localDir + File.separator + String.format(...) + 
> File.separator + filename;
> pathname = fileSystem.normalize(pathname).intern();
> return new File(pathname);{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24443) comparison should accept structurally-equal types

2018-05-31 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-24443:
---

 Summary: comparison should accept structurally-equal types
 Key: SPARK-24443
 URL: https://issues.apache.org/jira/browse/SPARK-24443
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-05-31 Thread Ismael Juma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496924#comment-16496924
 ] 

Ismael Juma commented on SPARK-18057:
-

Apache Kafka 2.0.0 will include KIP-266 and KAFKA-4879 has also been fixed. It 
would be great for Spark to transition to clients 2.0.0 once it's released. The 
code remains compatible, but Java 8 is now required.

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24437) Memory leak in UnsafeHashedRelation

2018-05-31 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496910#comment-16496910
 ] 

Marco Gaido commented on SPARK-24437:
-

Reproducing the issue is quite easy: you just need to run queries with 
broadcast joins, ie. queries with a join where one of the tables involved is 
small. I am not sure which is the HashMap you are referring to, though, sorry.

> Memory leak in UnsafeHashedRelation
> ---
>
> Key: SPARK-24437
> URL: https://issues.apache.org/jira/browse/SPARK-24437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: gagan taneja
>Priority: Critical
> Attachments: Screen Shot 2018-05-30 at 2.05.40 PM.png, Screen Shot 
> 2018-05-30 at 2.07.22 PM.png
>
>
> There seems to memory leak with 
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation
> We have a long running instance of STS.
> With each query execution requiring Broadcast Join, UnsafeHashedRelation is 
> getting added for cleanup in ContextCleaner. This reference of 
> UnsafeHashedRelation is being held at some other Collection and not becoming 
> eligible for GC and because of this ContextCleaner is not able to clean it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24442) Add configuration parameter to adjust the numbers of records and the charters per row before truncation when a user runs.show()

2018-05-31 Thread Andrew K Long (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496868#comment-16496868
 ] 

Andrew K Long commented on SPARK-24442:
---

Hey Sean,

 

Thanks for commenting!

 

"There are already method arguments for truncation and max rows, so I don't 
know if it's worth the complexity to alter defaults with yet another config 
param."

 

While there exists parameters there's no easy default way of adjusting the 
width.  I have several tables where there's always 22 characters of data and 22 
rows of data so to actually get the data from the console I have to always add 
the parameters.

 

> df.show

vs

> df.show(30,false)

"The naming convention doesn't quite match other spark params too."

 

I'm totally open to a better naming convention. it didn't quite seem to fit 
with most of the other parameters,

 

"I wonder if there is any way to detect the terminal width with any 
reliability, even if not in all cases? like how commonly is COLUMNS set in a 
shell?"

 

I did a bit of research on this.  According to stack overflow there's no 
reliable cross-platform way of doing this.

([https://stackoverflow.com/questions/1286461/can-i-find-the-console-width-with-java?utm_medium=organic_source=google_rich_qa_campaign=google_rich_qa)]

 

There is a library, JLine2, that claims to be able todo this but this would 
require adding a whole new dependency which seems overkill when an optional 
parameter will do the job just fine.  

 

> Add configuration parameter to adjust the numbers of records and the charters 
> per row before truncation when a user runs.show()
> ---
>
> Key: SPARK-24442
> URL: https://issues.apache.org/jira/browse/SPARK-24442
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Andrew K Long
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: spark-adjustable-display-size.diff
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently the number of characters displayed when a user runs the .show() 
> function on a data frame is hard coded. The current default is too small when 
> used with wider console widths.  This fix will add two parameters.
>  
> parameter: "spark.show.default.number.of.rows" default: "20"
> parameter: "spark.show.default.truncate.characters.per.column" default: "20"
>  
> This change will be backwords compatible and will not break any existing 
> functionality nor change the default display characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24442) Add configuration parameter to adjust the numbers of records and the charters per row before truncation when a user runs.show()

2018-05-31 Thread Andrew K Long (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496868#comment-16496868
 ] 

Andrew K Long edited comment on SPARK-24442 at 5/31/18 5:09 PM:


Hey Sean,

 

Thanks for commenting!

 

"There are already method arguments for truncation and max rows, so I don't 
know if it's worth the complexity to alter defaults with yet another config 
param."

 

While there exists parameters there's no easy default way of adjusting the 
width.  I have several tables where there's always 22 characters of data and 22 
rows of data so to actually get the data from the console I have to always add 
the parameters.

 

> df.show

vs

> df.show(30,false)

"The naming convention doesn't quite match other spark params too."

 

I'm totally open to a better naming convention. it didn't quite seem to fit 
with most of the other parameters,

 

"I wonder if there is any way to detect the terminal width with any 
reliability, even if not in all cases? like how commonly is COLUMNS set in a 
shell?"

 

I did a bit of research on this.  According to stack overflow there's no 
reliable cross-platform way of doing this.

[https://stackoverflow.com/questions/1286461/can-i-find-the-console-width-with-java?utm_medium=organic_source=google_rich_qa_campaign=google_rich_qa|https://stackoverflow.com/questions/1286461/can-i-find-the-console-width-with-java?utm_medium=organic_source=google_rich_qa_campaign=google_rich_qa)]

 

There is a library, JLine2, that claims to be able todo this but this would 
require adding a whole new dependency which seems overkill when an optional 
parameter will do the job just fine.  

 


was (Author: andrewklong):
Hey Sean,

 

Thanks for commenting!

 

"There are already method arguments for truncation and max rows, so I don't 
know if it's worth the complexity to alter defaults with yet another config 
param."

 

While there exists parameters there's no easy default way of adjusting the 
width.  I have several tables where there's always 22 characters of data and 22 
rows of data so to actually get the data from the console I have to always add 
the parameters.

 

> df.show

vs

> df.show(30,false)

"The naming convention doesn't quite match other spark params too."

 

I'm totally open to a better naming convention. it didn't quite seem to fit 
with most of the other parameters,

 

"I wonder if there is any way to detect the terminal width with any 
reliability, even if not in all cases? like how commonly is COLUMNS set in a 
shell?"

 

I did a bit of research on this.  According to stack overflow there's no 
reliable cross-platform way of doing this.

([https://stackoverflow.com/questions/1286461/can-i-find-the-console-width-with-java?utm_medium=organic_source=google_rich_qa_campaign=google_rich_qa)]

 

There is a library, JLine2, that claims to be able todo this but this would 
require adding a whole new dependency which seems overkill when an optional 
parameter will do the job just fine.  

 

> Add configuration parameter to adjust the numbers of records and the charters 
> per row before truncation when a user runs.show()
> ---
>
> Key: SPARK-24442
> URL: https://issues.apache.org/jira/browse/SPARK-24442
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Andrew K Long
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: spark-adjustable-display-size.diff
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently the number of characters displayed when a user runs the .show() 
> function on a data frame is hard coded. The current default is too small when 
> used with wider console widths.  This fix will add two parameters.
>  
> parameter: "spark.show.default.number.of.rows" default: "20"
> parameter: "spark.show.default.truncate.characters.per.column" default: "20"
>  
> This change will be backwords compatible and will not break any existing 
> functionality nor change the default display characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24437) Memory leak in UnsafeHashedRelation

2018-05-31 Thread gagan taneja (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496866#comment-16496866
 ] 

gagan taneja commented on SPARK-24437:
--

No Dynamic allocation. Also this is an issue with Driver where Broadcast is not 
Garbage collected.

By any chance do  you know which other Collection is holding reference to this 
Broadcast. Also do you have a simple way of reproducing this issue

> Memory leak in UnsafeHashedRelation
> ---
>
> Key: SPARK-24437
> URL: https://issues.apache.org/jira/browse/SPARK-24437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: gagan taneja
>Priority: Critical
> Attachments: Screen Shot 2018-05-30 at 2.05.40 PM.png, Screen Shot 
> 2018-05-30 at 2.07.22 PM.png
>
>
> There seems to memory leak with 
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation
> We have a long running instance of STS.
> With each query execution requiring Broadcast Join, UnsafeHashedRelation is 
> getting added for cleanup in ContextCleaner. This reference of 
> UnsafeHashedRelation is being held at some other Collection and not becoming 
> eligible for GC and because of this ContextCleaner is not able to clean it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24381) Improve Unit Test Coverage of NOT IN subqueries

2018-05-31 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-24381:
---
Fix Version/s: (was: 2.3.1)
   2.4.0

> Improve Unit Test Coverage of NOT IN subqueries
> ---
>
> Key: SPARK-24381
> URL: https://issues.apache.org/jira/browse/SPARK-24381
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Miles Yucht
>Priority: Major
> Fix For: 2.4.0
>
>
> Today, the unit test coverage for NOT IN queries in SubquerySuite is somewhat 
> lacking. There are a couple test cases that exist, but it isn't necessarily 
> clear that those tests cover all of the subcomponents of null-aware anti 
> joins, i.e. where the subquery returns a null value, if specific columns of 
> either relation are null, etc. Also, it is somewhat difficult for a newcomer 
> to understand the intended behavior of a null-aware anti join without great 
> effort. We should make sure we have proper coverage as well as improve the 
> documentation of this particular subquery, especially with respect to null 
> behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24381) Improve Unit Test Coverage of NOT IN subqueries

2018-05-31 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-24381:
---
Fix Version/s: (was: 2.3.2)
   2.3.1

> Improve Unit Test Coverage of NOT IN subqueries
> ---
>
> Key: SPARK-24381
> URL: https://issues.apache.org/jira/browse/SPARK-24381
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Miles Yucht
>Priority: Major
> Fix For: 2.3.1
>
>
> Today, the unit test coverage for NOT IN queries in SubquerySuite is somewhat 
> lacking. There are a couple test cases that exist, but it isn't necessarily 
> clear that those tests cover all of the subcomponents of null-aware anti 
> joins, i.e. where the subquery returns a null value, if specific columns of 
> either relation are null, etc. Also, it is somewhat difficult for a newcomer 
> to understand the intended behavior of a null-aware anti join without great 
> effort. We should make sure we have proper coverage as well as improve the 
> documentation of this particular subquery, especially with respect to null 
> behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24414) Stages page doesn't show all task attempts when failures

2018-05-31 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24414:
--

Assignee: Marcelo Vanzin

> Stages page doesn't show all task attempts when failures
> 
>
> Key: SPARK-24414
> URL: https://issues.apache.org/jira/browse/SPARK-24414
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Marcelo Vanzin
>Priority: Critical
> Fix For: 2.3.1, 2.4.0
>
>
> If you have task failures, the StagePage doesn't render all the task attempts 
> properly.  It seems to make the table the size of the total number of 
> successful tasks rather then including all the failed tasks.
> Even though the table size is smaller, if you sort by various columns you can 
> see all the tasks are actually there, it just seems the size of the table is 
> wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24414) Stages page doesn't show all task attempts when failures

2018-05-31 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24414.

   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 21457
[https://github.com/apache/spark/pull/21457]

> Stages page doesn't show all task attempts when failures
> 
>
> Key: SPARK-24414
> URL: https://issues.apache.org/jira/browse/SPARK-24414
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Marcelo Vanzin
>Priority: Critical
> Fix For: 2.4.0, 2.3.1
>
>
> If you have task failures, the StagePage doesn't render all the task attempts 
> properly.  It seems to make the table the size of the total number of 
> successful tasks rather then including all the failed tasks.
> Even though the table size is smaller, if you sort by various columns you can 
> see all the tasks are actually there, it just seems the size of the table is 
> wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24442) Add configuration parameter to adjust the numbers of records and the charters per row before truncation when a user runs.show()

2018-05-31 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496850#comment-16496850
 ] 

Sean Owen commented on SPARK-24442:
---

There are already method arguments for truncation and max rows, so I don't know 
if it's worth the complexity to alter defaults with yet another config param. 
The naming convention doesn't quite match other spark params too.

You're right about truncation. If anything it could be yet another method arg, 
but that too gets confusing. I wonder if there is any way to detect the 
terminal width with any reliability, even if not in all cases? like how 
commonly is COLUMNS set in a shell?

I suppose I could see the arg for a new parameter that lets frameworks actively 
set the actual width on behalf of the user, then.

> Add configuration parameter to adjust the numbers of records and the charters 
> per row before truncation when a user runs.show()
> ---
>
> Key: SPARK-24442
> URL: https://issues.apache.org/jira/browse/SPARK-24442
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Andrew K Long
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: spark-adjustable-display-size.diff
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently the number of characters displayed when a user runs the .show() 
> function on a data frame is hard coded. The current default is too small when 
> used with wider console widths.  This fix will add two parameters.
>  
> parameter: "spark.show.default.number.of.rows" default: "20"
> parameter: "spark.show.default.truncate.characters.per.column" default: "20"
>  
> This change will be backwords compatible and will not break any existing 
> functionality nor change the default display characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24442) Add configuration parameter to adjust the numbers of records and the charters per row before truncation when a user runs.show()

2018-05-31 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496840#comment-16496840
 ] 

Sean Owen commented on SPARK-24442:
---

Have a look at [https://spark.apache.org/contributing.html] first – among other 
things, we use pull requests and not patches.

> Add configuration parameter to adjust the numbers of records and the charters 
> per row before truncation when a user runs.show()
> ---
>
> Key: SPARK-24442
> URL: https://issues.apache.org/jira/browse/SPARK-24442
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Andrew K Long
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: spark-adjustable-display-size.diff
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently the number of characters displayed when a user runs the .show() 
> function on a data frame is hard coded. The current default is too small when 
> used with wider console widths.  This fix will add two parameters.
>  
> parameter: "spark.show.default.number.of.rows" default: "20"
> parameter: "spark.show.default.truncate.characters.per.column" default: "20"
>  
> This change will be backwords compatible and will not break any existing 
> functionality nor change the default display characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24442) Add configuration parameter to adjust the numbers of records and the charters per row before truncation when a user runs.show()

2018-05-31 Thread Andrew K Long (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew K Long updated SPARK-24442:
--
Attachment: spark-adjustable-display-size.diff

> Add configuration parameter to adjust the numbers of records and the charters 
> per row before truncation when a user runs.show()
> ---
>
> Key: SPARK-24442
> URL: https://issues.apache.org/jira/browse/SPARK-24442
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Andrew K Long
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: spark-adjustable-display-size.diff
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently the number of characters displayed when a user runs the .show() 
> function on a data frame is hard coded. The current default is too small when 
> used with wider console widths.  This fix will add two parameters.
>  
> parameter: "spark.show.default.number.of.rows" default: "20"
> parameter: "spark.show.default.truncate.characters.per.column" default: "20"
>  
> This change will be backwords compatible and will not break any existing 
> functionality nor change the default display characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24442) Add configuration parameter to adjust the numbers of records and the charters per row before truncation when a user runs.show()

2018-05-31 Thread Andrew K Long (JIRA)
Andrew K Long created SPARK-24442:
-

 Summary: Add configuration parameter to adjust the numbers of 
records and the charters per row before truncation when a user runs.show()
 Key: SPARK-24442
 URL: https://issues.apache.org/jira/browse/SPARK-24442
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.3.0, 2.2.0
Reporter: Andrew K Long
 Fix For: 2.4.0


Currently the number of characters displayed when a user runs the .show() 
function on a data frame is hard coded. The current default is too small when 
used with wider console widths.  This fix will add two parameters.

 

parameter: "spark.show.default.number.of.rows" default: "20"

parameter: "spark.show.default.truncate.characters.per.column" default: "20"

 

This change will be backwords compatible and will not break any existing 
functionality nor change the default display characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24441) Expose total size of states in HDFSBackedStateStoreProvider

2018-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24441:


Assignee: Apache Spark

> Expose total size of states in HDFSBackedStateStoreProvider
> ---
>
> Key: SPARK-24441
> URL: https://issues.apache.org/jira/browse/SPARK-24441
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> While Spark exposes state metrics for single state, Spark still doesn't 
> expose overall memory usage of state (loadedMaps) in 
> HDFSBackedStateStoreProvider. 
> Since HDFSBackedStateStoreProvider caches multiple versions of entire state 
> in hashmap, this can occupy much memory than single version of state. Based 
> on the default value of minVersionsToRetain, the size of cache map can grow 
> more than 100 times of the size of single state. It would be better to expose 
> it as well so that end users can determine actual memory usage for state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24441) Expose total size of states in HDFSBackedStateStoreProvider

2018-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24441:


Assignee: (was: Apache Spark)

> Expose total size of states in HDFSBackedStateStoreProvider
> ---
>
> Key: SPARK-24441
> URL: https://issues.apache.org/jira/browse/SPARK-24441
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> While Spark exposes state metrics for single state, Spark still doesn't 
> expose overall memory usage of state (loadedMaps) in 
> HDFSBackedStateStoreProvider. 
> Since HDFSBackedStateStoreProvider caches multiple versions of entire state 
> in hashmap, this can occupy much memory than single version of state. Based 
> on the default value of minVersionsToRetain, the size of cache map can grow 
> more than 100 times of the size of single state. It would be better to expose 
> it as well so that end users can determine actual memory usage for state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24441) Expose total size of states in HDFSBackedStateStoreProvider

2018-05-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496665#comment-16496665
 ] 

Apache Spark commented on SPARK-24441:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/21469

> Expose total size of states in HDFSBackedStateStoreProvider
> ---
>
> Key: SPARK-24441
> URL: https://issues.apache.org/jira/browse/SPARK-24441
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> While Spark exposes state metrics for single state, Spark still doesn't 
> expose overall memory usage of state (loadedMaps) in 
> HDFSBackedStateStoreProvider. 
> Since HDFSBackedStateStoreProvider caches multiple versions of entire state 
> in hashmap, this can occupy much memory than single version of state. Based 
> on the default value of minVersionsToRetain, the size of cache map can grow 
> more than 100 times of the size of single state. It would be better to expose 
> it as well so that end users can determine actual memory usage for state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22151) PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly

2018-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22151:


Assignee: Apache Spark

> PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly
> --
>
> Key: SPARK-22151
> URL: https://issues.apache.org/jira/browse/SPARK-22151
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
>Assignee: Apache Spark
>Priority: Major
>
> Running in yarn cluster mode and trying to set pythonpath via 
> spark.yarn.appMasterEnv.PYTHONPATH doesn't work.
> the yarn Client code looks at the env variables:
> val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath)
> But when you set spark.yarn.appMasterEnv it puts it into the local env. 
> So the python path set in spark.yarn.appMasterEnv isn't properly set.
> You can work around if you are running in cluster mode by setting it on the 
> client like:
> PYTHONPATH=./addon/python/ spark-submit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22151) PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly

2018-05-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496648#comment-16496648
 ] 

Apache Spark commented on SPARK-22151:
--

User 'pgandhi999' has created a pull request for this issue:
https://github.com/apache/spark/pull/21468

> PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly
> --
>
> Key: SPARK-22151
> URL: https://issues.apache.org/jira/browse/SPARK-22151
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
>Priority: Major
>
> Running in yarn cluster mode and trying to set pythonpath via 
> spark.yarn.appMasterEnv.PYTHONPATH doesn't work.
> the yarn Client code looks at the env variables:
> val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath)
> But when you set spark.yarn.appMasterEnv it puts it into the local env. 
> So the python path set in spark.yarn.appMasterEnv isn't properly set.
> You can work around if you are running in cluster mode by setting it on the 
> client like:
> PYTHONPATH=./addon/python/ spark-submit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22151) PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly

2018-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22151:


Assignee: (was: Apache Spark)

> PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly
> --
>
> Key: SPARK-22151
> URL: https://issues.apache.org/jira/browse/SPARK-22151
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
>Priority: Major
>
> Running in yarn cluster mode and trying to set pythonpath via 
> spark.yarn.appMasterEnv.PYTHONPATH doesn't work.
> the yarn Client code looks at the env variables:
> val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath)
> But when you set spark.yarn.appMasterEnv it puts it into the local env. 
> So the python path set in spark.yarn.appMasterEnv isn't properly set.
> You can work around if you are running in cluster mode by setting it on the 
> client like:
> PYTHONPATH=./addon/python/ spark-submit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24441) Expose total size of states in HDFSBackedStateStoreProvider

2018-05-31 Thread Jungtaek Lim (JIRA)
Jungtaek Lim created SPARK-24441:


 Summary: Expose total size of states in 
HDFSBackedStateStoreProvider
 Key: SPARK-24441
 URL: https://issues.apache.org/jira/browse/SPARK-24441
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Jungtaek Lim


While Spark exposes state metrics for single state, Spark still doesn't expose 
overall memory usage of state (loadedMaps) in HDFSBackedStateStoreProvider. 

Since HDFSBackedStateStoreProvider caches multiple versions of entire state in 
hashmap, this can occupy much memory than single version of state. Based on the 
default value of minVersionsToRetain, the size of cache map can grow more than 
100 times of the size of single state. It would be better to expose it as well 
so that end users can determine actual memory usage for state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-05-31 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496633#comment-16496633
 ] 

Cody Koeninger commented on SPARK-18057:


I'd just modify KafkaTestUtils to match the way things were
reorganized in the Kafka project.



> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24437) Memory leak in UnsafeHashedRelation

2018-05-31 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496598#comment-16496598
 ] 

Marco Gaido commented on SPARK-24437:
-

Do you have dynamic allocation enabled?

> Memory leak in UnsafeHashedRelation
> ---
>
> Key: SPARK-24437
> URL: https://issues.apache.org/jira/browse/SPARK-24437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: gagan taneja
>Priority: Critical
> Attachments: Screen Shot 2018-05-30 at 2.05.40 PM.png, Screen Shot 
> 2018-05-30 at 2.07.22 PM.png
>
>
> There seems to memory leak with 
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation
> We have a long running instance of STS.
> With each query execution requiring Broadcast Join, UnsafeHashedRelation is 
> getting added for cleanup in ContextCleaner. This reference of 
> UnsafeHashedRelation is being held at some other Collection and not becoming 
> eligible for GC and because of this ContextCleaner is not able to clean it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23936) High-order function: map_concat(map1, map2, ..., mapN) → map

2018-05-31 Thread Bruce Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496599#comment-16496599
 ] 

Bruce Robbins commented on SPARK-23936:
---

tl;dr version: Spark's Map type allows duplicates. However, this function's 
description requires eliminating any duplicates added when concatenated maps.

I give three proposals on how to deal with this discrepancy, one of which is 
simply allowing the additional duplicates that might be added when 
concatenating maps.

> High-order function: map_concat(map1, map2, ..., mapN) → 
> map
> ---
>
> Key: SPARK-23936
> URL: https://issues.apache.org/jira/browse/SPARK-23936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref:  https://prestodb.io/docs/current/functions/map.html
> Returns the union of all the given maps. If a key is found in multiple given 
> maps, that key’s value in the resulting map comes from the last one of those 
> maps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24437) Memory leak in UnsafeHashedRelation

2018-05-31 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496585#comment-16496585
 ] 

Marco Gaido commented on SPARK-24437:
-

I just remembered that I started working on this some time ago. Here you can 
find a WIP patch: 
https://github.com/mgaido91/spark/commit/ec78196c82f8c03d00bf90270eebb3ae1859c742.
 I don't remember if it is working but still need some refinement or if I was 
just looking for a better solution, but in general it should work. I'll get 
back working on it as soon as I can.

> Memory leak in UnsafeHashedRelation
> ---
>
> Key: SPARK-24437
> URL: https://issues.apache.org/jira/browse/SPARK-24437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: gagan taneja
>Priority: Critical
> Attachments: Screen Shot 2018-05-30 at 2.05.40 PM.png, Screen Shot 
> 2018-05-30 at 2.07.22 PM.png
>
>
> There seems to memory leak with 
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation
> We have a long running instance of STS.
> With each query execution requiring Broadcast Join, UnsafeHashedRelation is 
> getting added for cleanup in ContextCleaner. This reference of 
> UnsafeHashedRelation is being held at some other Collection and not becoming 
> eligible for GC and because of this ContextCleaner is not able to clean it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24146) spark.ml parity for sequential pattern mining - PrefixSpan: Python API

2018-05-31 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-24146.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21265
[https://github.com/apache/spark/pull/21265]

> spark.ml parity for sequential pattern mining - PrefixSpan: Python API
> --
>
> Key: SPARK-24146
> URL: https://issues.apache.org/jira/browse/SPARK-24146
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> spark.ml parity for sequential pattern mining - PrefixSpan: Python API



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23754) StopIterator exception in Python UDF results in partial result

2018-05-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496470#comment-16496470
 ] 

Apache Spark commented on SPARK-23754:
--

User 'e-dorigatti' has created a pull request for this issue:
https://github.com/apache/spark/pull/21467

> StopIterator exception in Python UDF results in partial result
> --
>
> Key: SPARK-23754
> URL: https://issues.apache.org/jira/browse/SPARK-23754
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Assignee: Emilio Dorigatti
>Priority: Blocker
> Fix For: 2.3.1, 2.4.0
>
>
> Reproduce:
> {code:java}
> df = spark.range(0, 1000)
> from pyspark.sql.functions import udf
> def foo(x):
> raise StopIteration()
> df.withColumn('v', udf(foo)).show()
> # Results
> # +---+---+
> # | id|  v|
> # +---+---+
> # +---+---+{code}
> I think the task should fail in this case



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24427) Spark 2.2 - Exception occurred while saving table in spark. Multiple sources found for parquet

2018-05-31 Thread Ashok Rai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496367#comment-16496367
 ] 

Ashok Rai commented on SPARK-24427:
---

Ok. Please let me know the mailing list. I will send my error there.



>  Spark 2.2 - Exception occurred while saving table in spark. Multiple sources 
> found for parquet 
> 
>
> Key: SPARK-24427
> URL: https://issues.apache.org/jira/browse/SPARK-24427
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: Ashok Rai
>Priority: Major
>
> We are getting below error while loading into Hive table. In our code, we use 
> "saveAsTable" - which as per documentation automatically chooses the format 
> that the table was created on. We have now tested by creating the table as 
> Parquet as well as ORC. In both cases - the same error occurred.
>  
> -
> 2018-05-29 12:25:07,433 ERROR [main] ERROR - Exception occurred while saving 
> table in spark.
>  org.apache.spark.sql.AnalysisException: Multiple sources found for parquet 
> (org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat, 
> org.apache.spark.sql.execution.datasources.parquet.DefaultSource), please 
> specify the fully qualified class name.;
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:584)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:111)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:75)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) 
> ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:75)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:71)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>  ~[scala-library-2.11.8.jar:?]
>  at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>  ~[scala-library-2.11.8.jar:?]
>  at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48) 
> ~[scala-library-2.11.8.jar:?]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at scala.collection.immutable.List.foreach(List.scala:381) 
> ~[scala-library-2.11.8.jar:?]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> 

[jira] [Resolved] (SPARK-24427) Spark 2.2 - Exception occurred while saving table in spark. Multiple sources found for parquet

2018-05-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24427.
--
Resolution: Invalid

Here, https://spark.apache.org/community.html

Let's leave this closed for now until it's clear that it's an issue.

>  Spark 2.2 - Exception occurred while saving table in spark. Multiple sources 
> found for parquet 
> 
>
> Key: SPARK-24427
> URL: https://issues.apache.org/jira/browse/SPARK-24427
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: Ashok Rai
>Priority: Major
>
> We are getting below error while loading into Hive table. In our code, we use 
> "saveAsTable" - which as per documentation automatically chooses the format 
> that the table was created on. We have now tested by creating the table as 
> Parquet as well as ORC. In both cases - the same error occurred.
>  
> -
> 2018-05-29 12:25:07,433 ERROR [main] ERROR - Exception occurred while saving 
> table in spark.
>  org.apache.spark.sql.AnalysisException: Multiple sources found for parquet 
> (org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat, 
> org.apache.spark.sql.execution.datasources.parquet.DefaultSource), please 
> specify the fully qualified class name.;
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:584)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:111)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:75)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) 
> ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:75)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:71)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>  ~[scala-library-2.11.8.jar:?]
>  at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>  ~[scala-library-2.11.8.jar:?]
>  at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48) 
> ~[scala-library-2.11.8.jar:?]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at scala.collection.immutable.List.foreach(List.scala:381) 
> ~[scala-library-2.11.8.jar:?]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50)
>  

[jira] [Created] (SPARK-24440) When use constant as column we may get wrong answer versus impala

2018-05-31 Thread zhoukang (JIRA)
zhoukang created SPARK-24440:


 Summary: When use constant as column we may get wrong answer 
versus impala
 Key: SPARK-24440
 URL: https://issues.apache.org/jira/browse/SPARK-24440
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0, 2.1.0
Reporter: zhoukang


For query below:

{code:java}
select `date`, 100 as platform, count(distinct deviceid) as new_user from 
tv.clean_new_user where `date`=20180528 group by `date`, platform
{code}
We intended to group by 100 and get distinct deviceid number.
By spark sql,we get:
{code}
+---+---+---+--+
|   date| platform  | new_user  |
+---+---+---+--+
| 20180528  | 100   | 521   |
| 20180528  | 100   | 82|
| 20180528  | 100   | 3 |
| 20180528  | 100   | 2 |
| 20180528  | 100   | 7 |
| 20180528  | 100   | 870   |
| 20180528  | 100   | 3 |
| 20180528  | 100   | 8 |
| 20180528  | 100   | 3 |
| 20180528  | 100   | 2204  |
| 20180528  | 100   | 1123  |
| 20180528  | 100   | 1 |
| 20180528  | 100   | 54|
| 20180528  | 100   | 440   |
| 20180528  | 100   | 4 |
| 20180528  | 100   | 478   |
| 20180528  | 100   | 34|
| 20180528  | 100   | 195   |
| 20180528  | 100   | 17|
| 20180528  | 100   | 18|
| 20180528  | 100   | 2 |
| 20180528  | 100   | 2 |
| 20180528  | 100   | 84|
| 20180528  | 100   | 1616  |
| 20180528  | 100   | 15|
| 20180528  | 100   | 7 |
| 20180528  | 100   | 479   |
| 20180528  | 100   | 50|
| 20180528  | 100   | 376   |
| 20180528  | 100   | 21|
| 20180528  | 100   | 842   |
| 20180528  | 100   | 444   |
| 20180528  | 100   | 538   |
| 20180528  | 100   | 1 |
| 20180528  | 100   | 2 |
| 20180528  | 100   | 7 |
| 20180528  | 100   | 17|
| 20180528  | 100   | 133   |
| 20180528  | 100   | 7 |
| 20180528  | 100   | 415   |
| 20180528  | 100   | 2 |
| 20180528  | 100   | 318   |
| 20180528  | 100   | 5 |
| 20180528  | 100   | 1 |
| 20180528  | 100   | 2060  |
| 20180528  | 100   | 1217  |
| 20180528  | 100   | 2 |
| 20180528  | 100   | 60|
| 20180528  | 100   | 22|
| 20180528  | 100   | 4 |
+---+---+---+--+
{code}
Actually sum of the deviceid is below:
{code}
0: jdbc:hive2://xxx/> select sum(t1.new_user) from (select `date`, 100 as 
platform, count(distinct deviceid) as new_user from tv.clean_new_user where 
`date`=20180528 group by `date`, platform)t1; 
++--+
| sum(new_user)  |
++--+
| 14816  |
++--+
1 row selected (4.934 seconds)
{code}
And the real distinct deviceid value is below:
{code}
0: jdbc:hive2://xxx/> select 100 as platform, count(distinct deviceid) as 
new_user from tv.clean_new_user where `date`=20180528;
+---+---+--+
| platform  | new_user  |
+---+---+--+
| 100   | 14773 |
+---+---+--+
1 row selected (2.846 seconds)
{code}

In impala,with the first query we can get result below:
{code}
[xxx] > select `date`, 100 as platform, count(distinct deviceid) as new_user 
from tv.clean_new_user where `date`=20180528 group by `date`, platform;Query: 
select `date`, 100 as platform, count(distinct deviceid) as new_user from 
tv.clean_new_user where `date`=20180528 group by `date`, platform
+--+--+--+
| date | platform | new_user |
+--+--+--+
| 20180528 | 100  | 14773|
+--+--+--+
Fetched 1 row(s) in 1.00s
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24427) Spark 2.2 - Exception occurred while saving table in spark. Multiple sources found for parquet

2018-05-31 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496359#comment-16496359
 ] 

Hyukjin Kwon commented on SPARK-24427:
--

The log messages basically mean it detected multiple datasources having the 
same name, parquet. It's more likely user's mistake. Shall we ask a question to 
the mailing list before filling here as an issue?

>  Spark 2.2 - Exception occurred while saving table in spark. Multiple sources 
> found for parquet 
> 
>
> Key: SPARK-24427
> URL: https://issues.apache.org/jira/browse/SPARK-24427
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: Ashok Rai
>Priority: Major
>
> We are getting below error while loading into Hive table. In our code, we use 
> "saveAsTable" - which as per documentation automatically chooses the format 
> that the table was created on. We have now tested by creating the table as 
> Parquet as well as ORC. In both cases - the same error occurred.
>  
> -
> 2018-05-29 12:25:07,433 ERROR [main] ERROR - Exception occurred while saving 
> table in spark.
>  org.apache.spark.sql.AnalysisException: Multiple sources found for parquet 
> (org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat, 
> org.apache.spark.sql.execution.datasources.parquet.DefaultSource), please 
> specify the fully qualified class name.;
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:584)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:111)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:75)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) 
> ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:75)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:71)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>  ~[scala-library-2.11.8.jar:?]
>  at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>  ~[scala-library-2.11.8.jar:?]
>  at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48) 
> ~[scala-library-2.11.8.jar:?]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at scala.collection.immutable.List.foreach(List.scala:381) 
> ~[scala-library-2.11.8.jar:?]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> 

[jira] [Commented] (SPARK-24427) Spark 2.2 - Exception occurred while saving table in spark. Multiple sources found for parquet

2018-05-31 Thread Ashok Rai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496358#comment-16496358
 ] 

Ashok Rai commented on SPARK-24427:
---

I have not specified any version in spark-submit.
I am using "export SPARK_MAJOR_VERSION=2" before spark-submit command.



>  Spark 2.2 - Exception occurred while saving table in spark. Multiple sources 
> found for parquet 
> 
>
> Key: SPARK-24427
> URL: https://issues.apache.org/jira/browse/SPARK-24427
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: Ashok Rai
>Priority: Major
>
> We are getting below error while loading into Hive table. In our code, we use 
> "saveAsTable" - which as per documentation automatically chooses the format 
> that the table was created on. We have now tested by creating the table as 
> Parquet as well as ORC. In both cases - the same error occurred.
>  
> -
> 2018-05-29 12:25:07,433 ERROR [main] ERROR - Exception occurred while saving 
> table in spark.
>  org.apache.spark.sql.AnalysisException: Multiple sources found for parquet 
> (org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat, 
> org.apache.spark.sql.execution.datasources.parquet.DefaultSource), please 
> specify the fully qualified class name.;
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:584)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:111)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:75)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) 
> ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:75)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:71)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>  ~[scala-library-2.11.8.jar:?]
>  at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>  ~[scala-library-2.11.8.jar:?]
>  at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48) 
> ~[scala-library-2.11.8.jar:?]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at scala.collection.immutable.List.foreach(List.scala:381) 
> ~[scala-library-2.11.8.jar:?]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50)
>  

[jira] [Commented] (SPARK-24437) Memory leak in UnsafeHashedRelation

2018-05-31 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496331#comment-16496331
 ] 

Marco Gaido commented on SPARK-24437:
-

I remember another JIRA about this. Anyway, this is indeed a problem.

> Memory leak in UnsafeHashedRelation
> ---
>
> Key: SPARK-24437
> URL: https://issues.apache.org/jira/browse/SPARK-24437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: gagan taneja
>Priority: Critical
> Attachments: Screen Shot 2018-05-30 at 2.05.40 PM.png, Screen Shot 
> 2018-05-30 at 2.07.22 PM.png
>
>
> There seems to memory leak with 
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation
> We have a long running instance of STS.
> With each query execution requiring Broadcast Join, UnsafeHashedRelation is 
> getting added for cleanup in ContextCleaner. This reference of 
> UnsafeHashedRelation is being held at some other Collection and not becoming 
> eligible for GC and because of this ContextCleaner is not able to clean it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23266) Matrix Inversion on BlockMatrix

2018-05-31 Thread Chandan Misra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496306#comment-16496306
 ] 

Chandan Misra commented on SPARK-23266:
---

I want to add this feature in any of the coming versions. Kindly let me know 
how this can be done.

> Matrix Inversion on BlockMatrix
> ---
>
> Key: SPARK-23266
> URL: https://issues.apache.org/jira/browse/SPARK-23266
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Chandan Misra
>Priority: Minor
>
> Matrix inversion is the basic building block for many other algorithms like 
> regression, classification, geostatistical analysis using ordinary kriging 
> etc. A simple Spark BlockMatrix based efficient distributed 
> divide-and-conquer algorithm can be implemented using only *6* 
> multiplications in each recursion level of the algorithm. The reference paper 
> can be found in
> [https://arxiv.org/abs/1801.04723]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23904) Big execution plan cause OOM

2018-05-31 Thread Ruben Berenguel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496290#comment-16496290
 ] 

Ruben Berenguel commented on SPARK-23904:
-

Thanks [~igreenfi], still at it then :)

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23257) Implement Kerberos Support in Kubernetes resource manager

2018-05-31 Thread Rob Vesse (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496284#comment-16496284
 ] 

Rob Vesse commented on SPARK-23257:
---

[~ifilonenko] Any updates on this?

We're currently using the fork as Kerberos support is a must-have for our 
customers and would love to get this into upstream and get ourselves back onto 
an official Spark release.

We can likely help out with testing, review and/or implementation as needed

> Implement Kerberos Support in Kubernetes resource manager
> -
>
> Key: SPARK-23257
> URL: https://issues.apache.org/jira/browse/SPARK-23257
> Project: Spark
>  Issue Type: Wish
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Rob Keevil
>Priority: Major
>
> On the forked k8s branch of Spark at 
> [https://github.com/apache-spark-on-k8s/spark/pull/540] , Kerberos support 
> has been added to the Kubernetes resource manager.  The Kubernetes code 
> between these two repositories appears to have diverged, so this commit 
> cannot be merged in easily.  Are there any plans to re-implement this work on 
> the main Spark repository?
>  
> [ifilonenko|https://github.com/ifilonenko] [~liyinan926] I am happy to help 
> with the development and testing of this, but i wanted to confirm that this 
> isn't already in progress -  I could not find any discussion about this 
> specific topic online.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23904) Big execution plan cause OOM

2018-05-31 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496272#comment-16496272
 ] 

Izek Greenfield edited comment on SPARK-23904 at 5/31/18 8:43 AM:
--

[~RBerenguel] 
Class: SQLExecution
Method: withNewExecutionId
Line: 73

{code:scala}
sparkSession.sparkContext.listenerBus.post(SparkListenerSQLExecutionStart(
executionId, callSite.shortForm, callSite.longForm, 
queryExecution.toString,
SparkPlanInfo.fromSparkPlan(queryExecution.executedPlan), 
System.currentTimeMillis()))
{code}
 


was (Author: igreenfi):
[~RBerenguel] 
Class: SQLExecution
Method: withNewExecutionId
Line: 73

{code:scala}
sparkSession.sparkContext.listenerBus.post(SparkListenerSQLExecutionStart(
executionId, callSite.shortForm, callSite.longForm, 
"queryExecution.toString",
SparkPlanInfo.fromSparkPlan(queryExecution.executedPlan), 
System.currentTimeMillis()))
{code}
 

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >