[jira] [Updated] (SPARK-17129) Support statistics collection and cardinality estimation for partitioned tables

2017-11-17 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-17129:
-
Description: Support statistics collection and cardinality estimation for 
partitioned tables.  (was: I upgrade this JIRA, because there are many tasks 
found and needed to be done here.)

> Support statistics collection and cardinality estimation for partitioned 
> tables
> ---
>
> Key: SPARK-17129
> URL: https://issues.apache.org/jira/browse/SPARK-17129
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> Support statistics collection and cardinality estimation for partitioned 
> tables.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22550) 64KB JVM bytecode limit problem with elt

2017-11-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22550:


Assignee: Apache Spark

> 64KB JVM bytecode limit problem with elt
> 
>
> Key: SPARK-22550
> URL: https://issues.apache.org/jira/browse/SPARK-22550
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> {{elt}} can throw an exception due to the 64KB JVM bytecode limit when they 
> use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22550) 64KB JVM bytecode limit problem with elt

2017-11-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257901#comment-16257901
 ] 

Apache Spark commented on SPARK-22550:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/19778

> 64KB JVM bytecode limit problem with elt
> 
>
> Key: SPARK-22550
> URL: https://issues.apache.org/jira/browse/SPARK-22550
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> {{elt}} can throw an exception due to the 64KB JVM bytecode limit when they 
> use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22550) 64KB JVM bytecode limit problem with elt

2017-11-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22550:


Assignee: (was: Apache Spark)

> 64KB JVM bytecode limit problem with elt
> 
>
> Key: SPARK-22550
> URL: https://issues.apache.org/jira/browse/SPARK-22550
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> {{elt}} can throw an exception due to the 64KB JVM bytecode limit when they 
> use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22549) 64KB JVM bytecode limit problem with concat_ws

2017-11-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257898#comment-16257898
 ] 

Apache Spark commented on SPARK-22549:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/19777

> 64KB JVM bytecode limit problem with concat_ws
> --
>
> Key: SPARK-22549
> URL: https://issues.apache.org/jira/browse/SPARK-22549
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> {{concat_ws}} can throw an exception due to the 64KB JVM bytecode limit when 
> they use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22549) 64KB JVM bytecode limit problem with concat_ws

2017-11-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22549:


Assignee: (was: Apache Spark)

> 64KB JVM bytecode limit problem with concat_ws
> --
>
> Key: SPARK-22549
> URL: https://issues.apache.org/jira/browse/SPARK-22549
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> {{concat_ws}} can throw an exception due to the 64KB JVM bytecode limit when 
> they use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22549) 64KB JVM bytecode limit problem with concat_ws

2017-11-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22549:


Assignee: Apache Spark

> 64KB JVM bytecode limit problem with concat_ws
> --
>
> Key: SPARK-22549
> URL: https://issues.apache.org/jira/browse/SPARK-22549
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> {{concat_ws}} can throw an exception due to the 64KB JVM bytecode limit when 
> they use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22550) 64KB JVM bytecode limit problem with elt

2017-11-17 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-22550:
-
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-22510

> 64KB JVM bytecode limit problem with elt
> 
>
> Key: SPARK-22550
> URL: https://issues.apache.org/jira/browse/SPARK-22550
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> {{elt}} can throw an exception due to the 64KB JVM bytecode limit when they 
> use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22550) 64KB JVM bytecode limit problem with elt

2017-11-17 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-22550:


 Summary: 64KB JVM bytecode limit problem with elt
 Key: SPARK-22550
 URL: https://issues.apache.org/jira/browse/SPARK-22550
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Kazuaki Ishizaki


{{elt}} can throw an exception due to the 64KB JVM bytecode limit when they use 
with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22549) 64KB JVM bytecode limit problem with concat_ws

2017-11-17 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-22549:
-
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-22510

> 64KB JVM bytecode limit problem with concat_ws
> --
>
> Key: SPARK-22549
> URL: https://issues.apache.org/jira/browse/SPARK-22549
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> {{concat_ws}} can throw an exception due to the 64KB JVM bytecode limit when 
> they use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22549) 64KB JVM bytecode limit problem with concat_ws

2017-11-17 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-22549:


 Summary: 64KB JVM bytecode limit problem with concat_ws
 Key: SPARK-22549
 URL: https://issues.apache.org/jira/browse/SPARK-22549
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Kazuaki Ishizaki


{{concat_ws}} can throw an exception due to the 64KB JVM bytecode limit when 
they use with a lot of arguments




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22498) 64KB JVM bytecode limit problem with concat

2017-11-17 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-22498:
-
Summary: 64KB JVM bytecode limit problem with concat  (was: 64KB JVM 
bytecode limit problem with concat and concat_ws)

> 64KB JVM bytecode limit problem with concat
> ---
>
> Key: SPARK-22498
> URL: https://issues.apache.org/jira/browse/SPARK-22498
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> Both {{concat}} and {{concat_ws}} can throw an exception due to the 64KB JVM 
> bytecode limit when they use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22498) 64KB JVM bytecode limit problem with concat

2017-11-17 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-22498:
-
Description: {{concat}} can throw an exception due to the 64KB JVM bytecode 
limit when they use with a lot of arguments  (was: Both {{concat}} and 
{{concat_ws}} can throw an exception due to the 64KB JVM bytecode limit when 
they use with a lot of arguments)

> 64KB JVM bytecode limit problem with concat
> ---
>
> Key: SPARK-22498
> URL: https://issues.apache.org/jira/browse/SPARK-22498
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> {{concat}} can throw an exception due to the 64KB JVM bytecode limit when 
> they use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22548) Incorrect nested AND expression pushed down to JDBC data source

2017-11-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22548:


Assignee: (was: Apache Spark)

> Incorrect nested AND expression pushed down to JDBC data source
> ---
>
> Key: SPARK-22548
> URL: https://issues.apache.org/jira/browse/SPARK-22548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jia Li
>
> Let’s say I have a JDBC data source table ‘foobar’ with 3 rows:
> NAME  THEID
> ==
> fred  1
> mary  2
> joe 'foo' "bar"3
> This query returns incorrect result. 
> SELECT * FROM foobar WHERE (THEID > 0 AND TRIM(NAME) = 'mary') OR (NAME = 
> 'fred')
> It’s supposed to return:
> fred  1
> mary  2
> But it returns
> fred  1
> mary  2
> joe 'foo' "bar"3
> This is because one leg of the nested AND predicate, TRIM(NAME) = 'mary’, can 
> not be pushed down but is lost during JDBC push down filter translation. The 
> same translation method is also called by Data Source V2. I have a fix for 
> this issue and will open a PR. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22548) Incorrect nested AND expression pushed down to JDBC data source

2017-11-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257839#comment-16257839
 ] 

Apache Spark commented on SPARK-22548:
--

User 'jliwork' has created a pull request for this issue:
https://github.com/apache/spark/pull/19776

> Incorrect nested AND expression pushed down to JDBC data source
> ---
>
> Key: SPARK-22548
> URL: https://issues.apache.org/jira/browse/SPARK-22548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jia Li
>
> Let’s say I have a JDBC data source table ‘foobar’ with 3 rows:
> NAME  THEID
> ==
> fred  1
> mary  2
> joe 'foo' "bar"3
> This query returns incorrect result. 
> SELECT * FROM foobar WHERE (THEID > 0 AND TRIM(NAME) = 'mary') OR (NAME = 
> 'fred')
> It’s supposed to return:
> fred  1
> mary  2
> But it returns
> fred  1
> mary  2
> joe 'foo' "bar"3
> This is because one leg of the nested AND predicate, TRIM(NAME) = 'mary’, can 
> not be pushed down but is lost during JDBC push down filter translation. The 
> same translation method is also called by Data Source V2. I have a fix for 
> this issue and will open a PR. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22548) Incorrect nested AND expression pushed down to JDBC data source

2017-11-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22548:


Assignee: Apache Spark

> Incorrect nested AND expression pushed down to JDBC data source
> ---
>
> Key: SPARK-22548
> URL: https://issues.apache.org/jira/browse/SPARK-22548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jia Li
>Assignee: Apache Spark
>
> Let’s say I have a JDBC data source table ‘foobar’ with 3 rows:
> NAME  THEID
> ==
> fred  1
> mary  2
> joe 'foo' "bar"3
> This query returns incorrect result. 
> SELECT * FROM foobar WHERE (THEID > 0 AND TRIM(NAME) = 'mary') OR (NAME = 
> 'fred')
> It’s supposed to return:
> fred  1
> mary  2
> But it returns
> fred  1
> mary  2
> joe 'foo' "bar"3
> This is because one leg of the nested AND predicate, TRIM(NAME) = 'mary’, can 
> not be pushed down but is lost during JDBC push down filter translation. The 
> same translation method is also called by Data Source V2. I have a fix for 
> this issue and will open a PR. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22548) Incorrect nested AND expression pushed down to JDBC data source

2017-11-17 Thread Jia Li (JIRA)
Jia Li created SPARK-22548:
--

 Summary: Incorrect nested AND expression pushed down to JDBC data 
source
 Key: SPARK-22548
 URL: https://issues.apache.org/jira/browse/SPARK-22548
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Jia Li


Let’s say I have a JDBC data source table ‘foobar’ with 3 rows:

NAMETHEID
==
fred1
mary2
joe 'foo' "bar"3

This query returns incorrect result. 
SELECT * FROM foobar WHERE (THEID > 0 AND TRIM(NAME) = 'mary') OR (NAME = 
'fred')

It’s supposed to return:
fred1
mary2

But it returns
fred1
mary2
joe 'foo' "bar"3

This is because one leg of the nested AND predicate, TRIM(NAME) = 'mary’, can 
not be pushed down but is lost during JDBC push down filter translation. The 
same translation method is also called by Data Source V2. I have a fix for this 
issue and will open a PR. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19371) Cannot spread cached partitions evenly across executors

2017-11-17 Thread Sergei Lebedev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257809#comment-16257809
 ] 

Sergei Lebedev commented on SPARK-19371:


> Usually the answer is to force a shuffle [...]

[~srowen] we are seeing exactly the same imbalance as a result of a shuffle. 
From the Executors page, it looks like a couple of executors get much more 
"reduce" tasks than the others. Does this sound like a possible scenario?



> Cannot spread cached partitions evenly across executors
> ---
>
> Key: SPARK-19371
> URL: https://issues.apache.org/jira/browse/SPARK-19371
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Thunder Stumpges
> Attachments: RDD Block Distribution on two executors.png, Unbalanced 
> RDD Blocks, and resulting task imbalance.png, Unbalanced RDD Blocks, and 
> resulting task imbalance.png, execution timeline.png
>
>
> Before running an intensive iterative job (in this case a distributed topic 
> model training), we need to load a dataset and persist it across executors. 
> After loading from HDFS and persisting, the partitions are spread unevenly 
> across executors (based on the initial scheduling of the reads which are not 
> data locale sensitive). The partition sizes are even, just not their 
> distribution over executors. We currently have no way to force the partitions 
> to spread evenly, and as the iterative algorithm begins, tasks are 
> distributed to executors based on this initial load, forcing some very 
> unbalanced work.
> This has been mentioned a 
> [number|http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Partitions-not-distributed-evenly-to-executors-tt16988.html#a17059]
>  of 
> [times|http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html]
>  in 
> [various|http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html]
>  user/dev group threads.
> None of the discussions I could find had solutions that worked for me. Here 
> are examples of things I have tried. All resulted in partitions in memory 
> that were NOT evenly distributed to executors, causing future tasks to be 
> imbalanced across executors as well.
> *Reduce Locality*
> {code}spark.shuffle.reduceLocality.enabled=false/true{code}
> *"Legacy" memory mode*
> {code}spark.memory.useLegacyMode = true/false{code}
> *Basic load and repartition*
> {code}
> val numPartitions = 48*16
> val df = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions).
> persist
> df.count
> {code}
> *Load and repartition to 2x partitions, then shuffle repartition down to 
> desired partitions*
> {code}
> val numPartitions = 48*16
> val df2 = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions*2)
> val df = df2.repartition(numPartitions).
> persist
> df.count
> {code}
> It would be great if when persisting an RDD/DataFrame, if we could request 
> that those partitions be stored evenly across executors in preparation for 
> future tasks. 
> I'm not sure if this is a more general issue (I.E. not just involving 
> persisting RDDs), but for the persisted in-memory case, it can make a HUGE 
> difference in the over-all running time of the remaining work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22544) FileStreamSource should use its own hadoop conf to call globPathIfNecessary

2017-11-17 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-22544.
--
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 2.2.2
   2.3.0

> FileStreamSource should use its own hadoop conf to call globPathIfNecessary
> ---
>
> Key: SPARK-22544
> URL: https://issues.apache.org/jira/browse/SPARK-22544
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.3.0, 2.2.2
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2017-11-17 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257742#comment-16257742
 ] 

Bryan Cutler commented on SPARK-21187:
--

[~icexelloss] It looks like there is a bug in older Arrow that's causing a 
problem with ArrayType, but it is fixed in the latest.  So decimals and arrays 
depend on an upgrade.  StructType and MapType still need to be done, but that 
will need a bit more work and discussion.

> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>
> This is to track adding the remaining type support in Arrow Converters.  
> Currently, only primitive data types are supported.  '
> Remaining types:
> * -*Date*-
> * -*Timestamp*-
> * *Complex*: Struct, Array, Map
> * *Decimal*
> Some things to do before closing this out:
> * Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)
> * Need to add some user docs
> * Make sure Python tests are thorough
> * Check into complex type support mentioned in comments by [~leif], should we 
> support mulit-indexing?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22517) NullPointerException in ShuffleExternalSorter.spill()

2017-11-17 Thread Andreas Maier (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257660#comment-16257660
 ] 

Andreas Maier commented on SPARK-22517:
---

Unfortunately I don't have some minimal code to reproduce the problem. It 
occurred after some hours in an computation with hundreds of gigabytes of data, 
after some data was spilled from memory onto disk. But the problem looks 
similar to SPARK-21907. 

> NullPointerException in ShuffleExternalSorter.spill()
> -
>
> Key: SPARK-22517
> URL: https://issues.apache.org/jira/browse/SPARK-22517
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>
> I see a NullPointerException during sorting with the following stacktrace:
> {code}
> 17/11/13 15:02:56 ERROR Executor: Exception in task 138.0 in stage 9.0 (TID 
> 13497)
> java.lang.NullPointerException
> at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:193)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:254)
> at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
> at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
> at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
> at 
> org.apache.spark.shuffle.sort.ShuffleInMemorySorter.reset(ShuffleInMemorySorter.java:100)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:256)
> at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
> at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
> at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.growPointerArrayIfNecessary(ShuffleExternalSorter.java:328)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:379)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22547) Don't include executor ID in metrics name

2017-11-17 Thread Li Haoyi (JIRA)
Li Haoyi created SPARK-22547:


 Summary: Don't include executor ID in metrics name
 Key: SPARK-22547
 URL: https://issues.apache.org/jira/browse/SPARK-22547
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Li Haoyi


Spark's metrics system prefixes all metrics collected from executors with the 
executor ID. 

* 
https://github.com/apache/spark/blob/fccb337f9d1e44a83cfcc00ce33eae1fad367695/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L136

This behavior causes two problems: 

* it's not possible to aggregate over executors (since the metric name is 
different for each host) 
* upstream metrics systems like Ganglia or Prometheus are put under high load 
because of the number of time series to store.

By removing the `executorId` from the name of the metric we register, that 
solves both the above problems



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22526) Spark hangs while reading binary files from S3

2017-11-17 Thread mohamed imran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257247#comment-16257247
 ] 

mohamed imran commented on SPARK-22526:
---

[~srowen] Yes I agree that it is an issue with the http api which spark is 
using to read files from S3. I don't see any other issues apart from the 
httpclient. But as a whole its an issue related to Spark framework which uses 
this http client. You need to look at the http client successful open and close 
connection while using S3 API.

By default Spark 2.2.0 uses httpclient-4.5.2.jar  httpcore-4.4.4.jar.

I suspect this http jars has some issues while connecting to S3 . Please look 
at it.

> Spark hangs while reading binary files from S3
> --
>
> Key: SPARK-22526
> URL: https://issues.apache.org/jira/browse/SPARK-22526
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: mohamed imran
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Hi,
> I am using Spark 2.2.0(recent version) to read binary files from S3. I use 
> sc.binaryfiles to read the files.
> It is working fine until some 100 file read but later it get hangs 
> indefinitely from 5 up to 40 mins like Avro file read issue(it was fixed in 
> the later releases)
> I tried setting the fs.s3a.connection.maximum to some maximum values but 
> didn't help.
> And finally i ended up using the spark speculation parameter set which is 
> again didnt help much. 
> One thing Which I observed is that it is not closing the connection after 
> every read of binary files from the S3.
> example :- sc.binaryFiles("s3a://test/test123.zip")
> Please look into this major issue!  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22526) Spark hangs while reading binary files from S3

2017-11-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257230#comment-16257230
 ] 

Sean Owen commented on SPARK-22526:
---

It could be a problem with S3, or the S3 API. You haven't shown what 
connections aren't getting closed. I'm saying this is not likely an issue with 
Spark itself.

> Spark hangs while reading binary files from S3
> --
>
> Key: SPARK-22526
> URL: https://issues.apache.org/jira/browse/SPARK-22526
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: mohamed imran
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Hi,
> I am using Spark 2.2.0(recent version) to read binary files from S3. I use 
> sc.binaryfiles to read the files.
> It is working fine until some 100 file read but later it get hangs 
> indefinitely from 5 up to 40 mins like Avro file read issue(it was fixed in 
> the later releases)
> I tried setting the fs.s3a.connection.maximum to some maximum values but 
> didn't help.
> And finally i ended up using the spark speculation parameter set which is 
> again didnt help much. 
> One thing Which I observed is that it is not closing the connection after 
> every read of binary files from the S3.
> example :- sc.binaryFiles("s3a://test/test123.zip")
> Please look into this major issue!  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22526) Spark hangs while reading binary files from S3

2017-11-17 Thread mohamed imran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257224#comment-16257224
 ] 

mohamed imran edited comment on SPARK-22526 at 11/17/17 5:04 PM:
-

[~srowen] I am processing inside the foreach loop. like this
example code:-
Dataframe.collect.foreach{x=>


filepath = x.getAs("filepath");

ziprdd = sc.binaryfiles(s"$filepath") ;// filename will be test.zip(example)

ziprdd.count;



}

i dont process Avro files. I am processing binary files which is compressed 
normal CSV files from S3.

After some 100th or above 150th read, spark gets hangs while reading from S3.

Hope this info is suffice to clarify the issues. Let me know if you need 
anything else.


was (Author: imranece59):
[~srowen] I am processing inside the foreach loop. like this
example code:-
Dataframe.collect.foreach{x=>


filepath = x.getAs("filepath")

ziprdd = sc.binaryfiles(s"$filepath") // filename will be test.zip(example)

ziprdd.count



}

i dont process Avro files. I am processing binary files which is compressed 
normal CSV files from S3.

After some 100th or above 150th read, spark gets hangs while reading from S3.

Hope this info is suffice to clarify the issues. Let me know if you need 
anything else.

> Spark hangs while reading binary files from S3
> --
>
> Key: SPARK-22526
> URL: https://issues.apache.org/jira/browse/SPARK-22526
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: mohamed imran
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Hi,
> I am using Spark 2.2.0(recent version) to read binary files from S3. I use 
> sc.binaryfiles to read the files.
> It is working fine until some 100 file read but later it get hangs 
> indefinitely from 5 up to 40 mins like Avro file read issue(it was fixed in 
> the later releases)
> I tried setting the fs.s3a.connection.maximum to some maximum values but 
> didn't help.
> And finally i ended up using the spark speculation parameter set which is 
> again didnt help much. 
> One thing Which I observed is that it is not closing the connection after 
> every read of binary files from the S3.
> example :- sc.binaryFiles("s3a://test/test123.zip")
> Please look into this major issue!  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22526) Spark hangs while reading binary files from S3

2017-11-17 Thread mohamed imran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257224#comment-16257224
 ] 

mohamed imran commented on SPARK-22526:
---

[~srowen] I am processing inside the foreach loop. like this
example code:-
Dataframe.collect.foreach{x=>

filepath = x.getAs("filepath")
ziprdd = sc.binaryfiles(s"$filepath") // filename will be test.zip(example)
ziprdd.count

}

i dont process Avro files. I am processing binary files which is compressed 
normal CSV files from S3.

After some 100th or above 150th read, spark gets hangs while reading from S3.

Hope this info is suffice to clarify the issues. Let me know if you need 
anything else.

> Spark hangs while reading binary files from S3
> --
>
> Key: SPARK-22526
> URL: https://issues.apache.org/jira/browse/SPARK-22526
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: mohamed imran
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Hi,
> I am using Spark 2.2.0(recent version) to read binary files from S3. I use 
> sc.binaryfiles to read the files.
> It is working fine until some 100 file read but later it get hangs 
> indefinitely from 5 up to 40 mins like Avro file read issue(it was fixed in 
> the later releases)
> I tried setting the fs.s3a.connection.maximum to some maximum values but 
> didn't help.
> And finally i ended up using the spark speculation parameter set which is 
> again didnt help much. 
> One thing Which I observed is that it is not closing the connection after 
> every read of binary files from the S3.
> example :- sc.binaryFiles("s3a://test/test123.zip")
> Please look into this major issue!  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22526) Spark hangs while reading binary files from S3

2017-11-17 Thread mohamed imran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257224#comment-16257224
 ] 

mohamed imran edited comment on SPARK-22526 at 11/17/17 5:03 PM:
-

[~srowen] I am processing inside the foreach loop. like this
example code:-
Dataframe.collect.foreach{x=>


filepath = x.getAs("filepath")

ziprdd = sc.binaryfiles(s"$filepath") // filename will be test.zip(example)

ziprdd.count



}

i dont process Avro files. I am processing binary files which is compressed 
normal CSV files from S3.

After some 100th or above 150th read, spark gets hangs while reading from S3.

Hope this info is suffice to clarify the issues. Let me know if you need 
anything else.


was (Author: imranece59):
[~srowen] I am processing inside the foreach loop. like this
example code:-
Dataframe.collect.foreach{x=>

filepath = x.getAs("filepath")
ziprdd = sc.binaryfiles(s"$filepath") // filename will be test.zip(example)
ziprdd.count

}

i dont process Avro files. I am processing binary files which is compressed 
normal CSV files from S3.

After some 100th or above 150th read, spark gets hangs while reading from S3.

Hope this info is suffice to clarify the issues. Let me know if you need 
anything else.

> Spark hangs while reading binary files from S3
> --
>
> Key: SPARK-22526
> URL: https://issues.apache.org/jira/browse/SPARK-22526
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: mohamed imran
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Hi,
> I am using Spark 2.2.0(recent version) to read binary files from S3. I use 
> sc.binaryfiles to read the files.
> It is working fine until some 100 file read but later it get hangs 
> indefinitely from 5 up to 40 mins like Avro file read issue(it was fixed in 
> the later releases)
> I tried setting the fs.s3a.connection.maximum to some maximum values but 
> didn't help.
> And finally i ended up using the spark speculation parameter set which is 
> again didnt help much. 
> One thing Which I observed is that it is not closing the connection after 
> every read of binary files from the S3.
> example :- sc.binaryFiles("s3a://test/test123.zip")
> Please look into this major issue!  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22538) SQLTransformer.transform(inputDataFrame) uncaches inputDataFrame

2017-11-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22538:
---

Assignee: Liang-Chi Hsieh

> SQLTransformer.transform(inputDataFrame) uncaches inputDataFrame
> 
>
> Key: SPARK-22538
> URL: https://issues.apache.org/jira/browse/SPARK-22538
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark, SQL, Web UI
>Affects Versions: 2.2.0
>Reporter: MBA Learns to Code
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0, 2.2.2
>
>
> When running the below code on PySpark v2.2.0, the cached input DataFrame df 
> disappears from SparkUI after SQLTransformer.transform(...) is called on it.
> I don't yet know whether this is only a SparkUI bug, or the input DataFrame 
> df is indeed unpersisted from memory. If the latter is true, this can be a 
> serious bug because any new computation using new_df would have to re-do all 
> the work leading up to df.
> {code}
> import pandas
> import pyspark
> from pyspark.ml.feature import SQLTransformer
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(pandas.DataFrame(dict(x=[-1, 0, 1])))
> # after below step, SparkUI Storage shows 1 cached RDD
> df.cache(); df.count()
> # after below step, cached RDD disappears from SparkUI Storage
> new_df = SQLTransformer(statement='SELECT * FROM __THIS__').transform(df)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22538) SQLTransformer.transform(inputDataFrame) uncaches inputDataFrame

2017-11-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22538.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.2

Issue resolved by pull request 19772
[https://github.com/apache/spark/pull/19772]

> SQLTransformer.transform(inputDataFrame) uncaches inputDataFrame
> 
>
> Key: SPARK-22538
> URL: https://issues.apache.org/jira/browse/SPARK-22538
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark, SQL, Web UI
>Affects Versions: 2.2.0
>Reporter: MBA Learns to Code
> Fix For: 2.2.2, 2.3.0
>
>
> When running the below code on PySpark v2.2.0, the cached input DataFrame df 
> disappears from SparkUI after SQLTransformer.transform(...) is called on it.
> I don't yet know whether this is only a SparkUI bug, or the input DataFrame 
> df is indeed unpersisted from memory. If the latter is true, this can be a 
> serious bug because any new computation using new_df would have to re-do all 
> the work leading up to df.
> {code}
> import pandas
> import pyspark
> from pyspark.ml.feature import SQLTransformer
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(pandas.DataFrame(dict(x=[-1, 0, 1])))
> # after below step, SparkUI Storage shows 1 cached RDD
> df.cache(); df.count()
> # after below step, cached RDD disappears from SparkUI Storage
> new_df = SQLTransformer(statement='SELECT * FROM __THIS__').transform(df)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2017-11-17 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257142#comment-16257142
 ] 

Li Jin commented on SPARK-21187:


[~bryanc], the only type left is Decimal and that depends on Arrow 0.8 release, 
is that right?

> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>
> This is to track adding the remaining type support in Arrow Converters.  
> Currently, only primitive data types are supported.  '
> Remaining types:
> * -*Date*-
> * -*Timestamp*-
> * *Complex*: Struct, Array, Map
> * *Decimal*
> Some things to do before closing this out:
> * Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)
> * Need to add some user docs
> * Make sure Python tests are thorough
> * Check into complex type support mentioned in comments by [~leif], should we 
> support mulit-indexing?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22409) Add function type argument to pandas_udf

2017-11-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22409:
---

Assignee: Li Jin

> Add function type argument to pandas_udf
> 
>
> Key: SPARK-22409
> URL: https://issues.apache.org/jira/browse/SPARK-22409
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Assignee: Li Jin
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22409) Add function type argument to pandas_udf

2017-11-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22409:

Affects Version/s: (was: 2.2.0)
   2.3.0

> Add function type argument to pandas_udf
> 
>
> Key: SPARK-22409
> URL: https://issues.apache.org/jira/browse/SPARK-22409
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Assignee: Li Jin
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22409) Add function type argument to pandas_udf

2017-11-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22409.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19630
[https://github.com/apache/spark/pull/19630]

> Add function type argument to pandas_udf
> 
>
> Key: SPARK-22409
> URL: https://issues.apache.org/jira/browse/SPARK-22409
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Li Jin
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22274) User-defined aggregation functions with pandas udf

2017-11-17 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257120#comment-16257120
 ] 

Li Jin commented on SPARK-22274:


Absolutely.

> User-defined aggregation functions with pandas udf
> --
>
> Key: SPARK-22274
> URL: https://issues.apache.org/jira/browse/SPARK-22274
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Li Jin
>
> This function doesn't implement partial aggregation and shuffles all data. A 
> uadf that supports partial aggregation is not covered by this Jira.
> Exmaple:
> {code:java}
> @pandas_udf(DoubleType())
> def mean(v)
>   return v.mean()
> df.groupby('id').apply(mean(df.v1), mean(df.v2))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-17 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257088#comment-16257088
 ] 

Marco Gaido commented on SPARK-22516:
-

not sure why but this is caused by the fact that your file contains "CR LF" as 
line separator instead of only LF

> CSV Read breaks: When "multiLine" = "true", if "comment" option is set as 
> last line's first character
> -
>
> Key: SPARK-22516
> URL: https://issues.apache.org/jira/browse/SPARK-22516
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>Priority: Minor
>  Labels: csvparser
> Attachments: testCommentChar.csv
>
>
> Try to read attached CSV file with following parse properties,
> scala> *val csvFile = 
> spark.read.option("header","true").option("inferSchema", 
> "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/test
> CommentChar.csv");   *
>   
>   
> csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]  
>   
>  
>   
>   
>  
> scala> csvFile.show   
>   
>  
> +---+---+ 
>   
>  
> |  a|  b| 
>   
>  
> +---+---+ 
>   
>  
> +---+---+   
> {color:#8eb021}*Noticed that it works fine.*{color}
> If we add an option "multiLine" = "true", it fails with below exception. This 
> happens only if we pass "comment" == input dataset's last line's first 
> character
> scala> val csvFile = 
> *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
>  "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/testCommentChar.csv");*
> 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
> com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End 
> of input reached
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> Header extraction enabled=null
> Headers=null
> Ignore leading whitespaces=false
> Ignore trailing whitespaces=false
> Input buffer size=128
> Input reading on separate thread=false
> Keep escape sequences=false
> Keep quotes=false
> Length of content displayed on error=-1
> Line separator detection enabled=false
> Maximum number of characters per column=-1
> Maximum number of columns=20480
> Normalize escaped line separators=true
> Null value=
> Number of records to read=all
> Processor=none
> Restricting data in exceptions=false
> RowProcessor error handler=null
> Selected fields=none
> Skip empty lines=true
> Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
> CsvFormat:
> Comment character=c
> Field delimiter=,
> Line separator (normalized)=\n
> Line separator sequence=\r\n
> Quote character="
> Quote escape character=\
> Quote escape escape character=null
> Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
> at 
> 

[jira] [Commented] (SPARK-22532) Spark SQL function 'drop_duplicates' throws error when passing in a column that is an element of a struct

2017-11-17 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257020#comment-16257020
 ] 

Marco Gaido commented on SPARK-22532:
-

the reason is that `header.eventId.lo` is not a column name, but it is an 
`Expression`. It is like you are using any kind of expression to transform a 
column (eg. `a + b` or `coalesce(a, 1)`) which is not supported by the 
`dropDuplicates` operation.

> Spark SQL function 'drop_duplicates' throws error when passing in a column 
> that is an element of a struct
> -
>
> Key: SPARK-22532
> URL: https://issues.apache.org/jira/browse/SPARK-22532
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
> Environment: Attempted on the following versions:
> * Spark 2.1 (CDH 5.9.2 w/ SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904)
> * Spark 2.1 (installed via homebrew)
> * Spark 2.2 (installed via homebrew)
> Also tried on Spark 1.6 that comes with CDH 5.9.2 and it works correctly; 
> this appears to be a regression.
>Reporter: Nicholas Hakobian
>
> When attempting to use drop_duplicates with a subset of columns that exist 
> within a struct the following error it raised:
> {noformat}
> AnalysisException: u'Cannot resolve column name "header.eventId.lo" among 
> (header);'
> {noformat}
> A complete example (using old sqlContext syntax so the same code can be run 
> with Spark 1.x as well):
> {noformat}
> from pyspark.sql import Row
> from pyspark.sql.functions import *
> data = [
> Row(header=Row(eventId=Row(lo=0, hi=1))),
> Row(header=Row(eventId=Row(lo=0, hi=1))),
> Row(header=Row(eventId=Row(lo=1, hi=2))),
> Row(header=Row(eventId=Row(lo=2, hi=3))),
> ]
> df = sqlContext.createDataFrame(data)
> df.drop_duplicates(['header.eventId.lo', 'header.eventId.hi']).show()
> {noformat}
> produces the following stack trace:
> {noformat}
> ---
> AnalysisException Traceback (most recent call last)
>  in ()
>  11 df = sqlContext.createDataFrame(data)
>  12
> ---> 13 df.drop_duplicates(['header.eventId.lo', 'header.eventId.hi']).show()
> /usr/local/Cellar/apache-spark/2.2.0/libexec/python/pyspark/sql/dataframe.py 
> in dropDuplicates(self, subset)
>1243 jdf = self._jdf.dropDuplicates()
>1244 else:
> -> 1245 jdf = self._jdf.dropDuplicates(self._jseq(subset))
>1246 return DataFrame(jdf, self.sql_ctx)
>1247
> /usr/local/Cellar/apache-spark/2.2.0/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134
>1135 for temp_arg in temp_args:
> /usr/local/Cellar/apache-spark/2.2.0/libexec/python/pyspark/sql/utils.py in 
> deco(*a, **kw)
>  67  
> e.java_exception.getStackTrace()))
>  68 if s.startswith('org.apache.spark.sql.AnalysisException: 
> '):
> ---> 69 raise AnalysisException(s.split(': ', 1)[1], 
> stackTrace)
>  70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
>  71 raise AnalysisException(s.split(': ', 1)[1], 
> stackTrace)
> AnalysisException: u'Cannot resolve column name "header.eventId.lo" among 
> (header);'
> {noformat}
> This works _correctly_ in Spark 1.6, but fails in 2.1 (via homebrew and CDH) 
> and 2.2 (via homebrew)
> An inconvenient workaround (but it works) is the following:
> {noformat}
> (
> df
> .withColumn('lo', col('header.eventId.lo'))
> .withColumn('hi', col('header.eventId.hi'))
> .drop_duplicates(['lo', 'hi'])
> .drop('lo')
> .drop('hi')
> .show()
> )
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22541) Dataframes: applying multiple filters one after another using udfs and accumulators results in faulty accumulators

2017-11-17 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257004#comment-16257004
 ] 

Eric Maynard commented on SPARK-22541:
--

Yeah, this is a common problem when you have side effects in your 
transformations. If you need to enforce a specific order on your 
transformations or otherwise split them up (rather than letting spark behind 
them), you can try putting actions -- like repartitions -- between the 
transformation functions.

> Dataframes: applying multiple filters one after another using udfs and 
> accumulators results in faulty accumulators
> --
>
> Key: SPARK-22541
> URL: https://issues.apache.org/jira/browse/SPARK-22541
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: pyspark 2.2.0, ubuntu
>Reporter: Janne K. Olesen
>
> I'm using udf filters and accumulators to keep track of filtered rows in 
> dataframes.
> If I'm applying multiple filters one after the other, they seem to be 
> executed in parallel, not in sequence, which messes with the accumulators i'm 
> using to keep track of filtered data. 
> {code:title=example.py|borderStyle=solid}
> from pyspark.sql.functions import udf, col
> from pyspark.sql.types import BooleanType
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.getOrCreate()
> sc = spark.sparkContext
> df = spark.createDataFrame([("a", 1, 1), ("b", 2, 2), ("c", 3, 3)], ["key", 
> "val1", "val2"])
> def __myfilter(val, acc):
> if val < 2:
> return True
> else:
> acc.add(1)
> return False
> acc1 = sc.accumulator(0)
> acc2 = sc.accumulator(0)
> def myfilter1(val):
> return __myfilter(val, acc1)
> def myfilter2(val):
> return __myfilter(val, acc2)
> my_udf1 = udf(myfilter1, BooleanType())
> my_udf2 = udf(myfilter2, BooleanType())
> df.show()
> # +---+++
> # |key|val1|val2|
> # +---+++
> # |  a|   1|   1|
> # |  b|   2|   2|
> # |  c|   3|   3|
> # +---+++
> df = df.filter(my_udf1(col("val1")))
> # df.show()
> # +---+++
> # |key|val1|val2|
> # +---+++
> # |  a|   1|   1|
> # +---+++
> # expected acc1: 2
> # expected acc2: 0
> df = df.filter(my_udf2(col("val2")))
> # df.show()
> # +---+++
> # |key|val1|val2|
> # +---+++
> # |  a|   1|   1|
> # +---+++
> # expected acc1: 2
> # expected acc2: 0
> df.show()
> print("acc1: %s" % acc1.value)  # expected 2, is 2 OK
> print("acc2: %s" % acc2.value)  # expected 0, is 2 !!!
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22343) Add support for publishing Spark metrics into Prometheus

2017-11-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256994#comment-16256994
 ] 

Apache Spark commented on SPARK-22343:
--

User 'matyix' has created a pull request for this issue:
https://github.com/apache/spark/pull/19775

> Add support for publishing Spark metrics into Prometheus
> 
>
> Key: SPARK-22343
> URL: https://issues.apache.org/jira/browse/SPARK-22343
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Janos Matyas
>
> I've created a PR (https://github.com/apache-spark-on-k8s/spark/pull/531) to 
> supporting publishing Spark metrics into Prometheus metrics in the 
> https://github.com/apache-spark-on-k8s/spark fork (Spark on Kubernetes). 
> According to the maintainers of the project I should create a ticket here as 
> well, in order to be tracked upstream. See below the original text of the PR: 
> _
> Publishing Spark metrics into Prometheus - as discussed earlier in 
> https://github.com/apache-spark-on-k8s/spark/pull/384. 
> Implemented a metrics sink that publishes Spark metrics into Prometheus via 
> [Prometheus Pushgateway](https://prometheus.io/docs/instrumenting/pushing/). 
> Metrics data published by Spark is based on 
> [Dropwizard](http://metrics.dropwizard.io/). The format of Spark metrics is 
> not supported natively by Prometheus thus these are converted using 
> [DropwizardExports](https://prometheus.io/client_java/io/prometheus/client/dropwizard/DropwizardExports.html)
>  prior pushing metrics to the pushgateway.
> Also the default Prometheus pushgateway client API implementation does not 
> support metrics timestamp thus the client API has been ehanced to enrich 
> metrics  data with timestamp. _



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22343) Add support for publishing Spark metrics into Prometheus

2017-11-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22343:


Assignee: (was: Apache Spark)

> Add support for publishing Spark metrics into Prometheus
> 
>
> Key: SPARK-22343
> URL: https://issues.apache.org/jira/browse/SPARK-22343
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Janos Matyas
>
> I've created a PR (https://github.com/apache-spark-on-k8s/spark/pull/531) to 
> supporting publishing Spark metrics into Prometheus metrics in the 
> https://github.com/apache-spark-on-k8s/spark fork (Spark on Kubernetes). 
> According to the maintainers of the project I should create a ticket here as 
> well, in order to be tracked upstream. See below the original text of the PR: 
> _
> Publishing Spark metrics into Prometheus - as discussed earlier in 
> https://github.com/apache-spark-on-k8s/spark/pull/384. 
> Implemented a metrics sink that publishes Spark metrics into Prometheus via 
> [Prometheus Pushgateway](https://prometheus.io/docs/instrumenting/pushing/). 
> Metrics data published by Spark is based on 
> [Dropwizard](http://metrics.dropwizard.io/). The format of Spark metrics is 
> not supported natively by Prometheus thus these are converted using 
> [DropwizardExports](https://prometheus.io/client_java/io/prometheus/client/dropwizard/DropwizardExports.html)
>  prior pushing metrics to the pushgateway.
> Also the default Prometheus pushgateway client API implementation does not 
> support metrics timestamp thus the client API has been ehanced to enrich 
> metrics  data with timestamp. _



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22343) Add support for publishing Spark metrics into Prometheus

2017-11-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22343:


Assignee: Apache Spark

> Add support for publishing Spark metrics into Prometheus
> 
>
> Key: SPARK-22343
> URL: https://issues.apache.org/jira/browse/SPARK-22343
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Janos Matyas
>Assignee: Apache Spark
>
> I've created a PR (https://github.com/apache-spark-on-k8s/spark/pull/531) to 
> supporting publishing Spark metrics into Prometheus metrics in the 
> https://github.com/apache-spark-on-k8s/spark fork (Spark on Kubernetes). 
> According to the maintainers of the project I should create a ticket here as 
> well, in order to be tracked upstream. See below the original text of the PR: 
> _
> Publishing Spark metrics into Prometheus - as discussed earlier in 
> https://github.com/apache-spark-on-k8s/spark/pull/384. 
> Implemented a metrics sink that publishes Spark metrics into Prometheus via 
> [Prometheus Pushgateway](https://prometheus.io/docs/instrumenting/pushing/). 
> Metrics data published by Spark is based on 
> [Dropwizard](http://metrics.dropwizard.io/). The format of Spark metrics is 
> not supported natively by Prometheus thus these are converted using 
> [DropwizardExports](https://prometheus.io/client_java/io/prometheus/client/dropwizard/DropwizardExports.html)
>  prior pushing metrics to the pushgateway.
> Also the default Prometheus pushgateway client API implementation does not 
> support metrics timestamp thus the client API has been ehanced to enrich 
> metrics  data with timestamp. _



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22540) HighlyCompressedMapStatus's avgSize is incorrect

2017-11-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22540.
---
   Resolution: Fixed
Fix Version/s: 2.2.2
   2.3.0

> HighlyCompressedMapStatus's avgSize is incorrect
> 
>
> Key: SPARK-22540
> URL: https://issues.apache.org/jira/browse/SPARK-22540
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: yucai
>Assignee: yucai
> Fix For: 2.3.0, 2.2.2
>
>
> The calculation of HighlyCompressedMapStatus's avgSize is incorrect. 
> Currently, it looks like "sum of small blocks / count of all non empty 
> blocks", the count of all non empty blocks not only contains small blocks, 
> which contains huge blocks number also, but we need the count of small blocks 
> only.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22540) HighlyCompressedMapStatus's avgSize is incorrect

2017-11-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-22540:
-

Assignee: yucai

> HighlyCompressedMapStatus's avgSize is incorrect
> 
>
> Key: SPARK-22540
> URL: https://issues.apache.org/jira/browse/SPARK-22540
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: yucai
>Assignee: yucai
>
> The calculation of HighlyCompressedMapStatus's avgSize is incorrect. 
> Currently, it looks like "sum of small blocks / count of all non empty 
> blocks", the count of all non empty blocks not only contains small blocks, 
> which contains huge blocks number also, but we need the count of small blocks 
> only.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22528) History service and non-HDFS filesystems

2017-11-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22528.
---
Resolution: Invalid

Move to the mailing list for now; not obviously something due to Spark.

> History service and non-HDFS filesystems
> 
>
> Key: SPARK-22528
> URL: https://issues.apache.org/jira/browse/SPARK-22528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: paul mackles
>Priority: Minor
>
> We are using Azure Data Lake (ADL) to store our event logs. This worked fine 
> in 2.1.x but in 2.2.0, the event logs are no longer visible to the history 
> server. I tracked it down to the call to:
> {code}
> SparkHadoopUtil.get.checkAccessPermission()
> {code}
> which was added to "FSHistoryProvider" in 2.2.0.
> I was able to workaround it by:
> * setting the files on ADL to world readable
> * setting HADOOP_PROXY to the Azure objectId of the service principal that 
> owns file
> Neither of those workaround are particularly desirable in our environment. 
> That said, I am not sure how this should be addressed:
> * Is this an issue with the Azure/Hadoop bindings not setting up the user 
> context correctly so that the "checkAccessPermission()" call succeeds w/out 
> having to use the username under which the process is running?
> * Is this an issue with "checkAccessPermission()" not really accounting for 
> all of the possible FileSystem implementations? If so, I would imagine that 
> there are similar issues when using S3.
> In spite of this check, I know the files are accessible through the 
> underlying FileSystem object so it feels like the latter but I don't think 
> that the FileSystem object alone could be used to implement this check.
> Any thoughts [~jerryshao]?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22526) Spark hangs while reading binary files from S3

2017-11-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256986#comment-16256986
 ] 

Sean Owen commented on SPARK-22526:
---

You're only showing a read of one file, a .zip file. I don't see any Avro 
files, or 100 files here. I'd have to close this.

> Spark hangs while reading binary files from S3
> --
>
> Key: SPARK-22526
> URL: https://issues.apache.org/jira/browse/SPARK-22526
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: mohamed imran
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Hi,
> I am using Spark 2.2.0(recent version) to read binary files from S3. I use 
> sc.binaryfiles to read the files.
> It is working fine until some 100 file read but later it get hangs 
> indefinitely from 5 up to 40 mins like Avro file read issue(it was fixed in 
> the later releases)
> I tried setting the fs.s3a.connection.maximum to some maximum values but 
> didn't help.
> And finally i ended up using the spark speculation parameter set which is 
> again didnt help much. 
> One thing Which I observed is that it is not closing the connection after 
> every read of binary files from the S3.
> example :- sc.binaryFiles("s3a://test/test123.zip")
> Please look into this major issue!  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22518) Make default cache storage level configurable

2017-11-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22518.
---
Resolution: Won't Fix

> Make default cache storage level configurable
> -
>
> Key: SPARK-22518
> URL: https://issues.apache.org/jira/browse/SPARK-22518
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Rares Mirica
>Priority: Minor
>
> Caching defaults to the hard-coded value MEMORY_ONLY, and as most users call 
> the convenient .cache() method this value is not configurable in a global 
> way. Please make this configurable through a spark config option.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22513) Provide build profile for hadoop 2.8

2017-11-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22513.
---
Resolution: Won't Fix

> Provide build profile for hadoop 2.8
> 
>
> Key: SPARK-22513
> URL: https://issues.apache.org/jira/browse/SPARK-22513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Christine Koppelt
>
> hadoop 2.8 comes with a patch which is necessary to make it run on NixOS [1]. 
> Therefore it would be cool to have a Spark version pre-built for Hadoop 2.8.
> [1] 
> https://github.com/apache/hadoop/commit/5231c527aaf19fb3f4bd59dcd2ab19bfb906d377#diff-19821342174c77119be4a99dc3f3618d



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22475) show histogram in DESC COLUMN command

2017-11-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22475:


Assignee: Apache Spark

> show histogram in DESC COLUMN command
> -
>
> Key: SPARK-22475
> URL: https://issues.apache.org/jira/browse/SPARK-22475
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22475) show histogram in DESC COLUMN command

2017-11-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22475:


Assignee: (was: Apache Spark)

> show histogram in DESC COLUMN command
> -
>
> Key: SPARK-22475
> URL: https://issues.apache.org/jira/browse/SPARK-22475
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22475) show histogram in DESC COLUMN command

2017-11-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256924#comment-16256924
 ] 

Apache Spark commented on SPARK-22475:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/19774

> show histogram in DESC COLUMN command
> -
>
> Key: SPARK-22475
> URL: https://issues.apache.org/jira/browse/SPARK-22475
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22274) User-defined aggregation functions with pandas udf

2017-11-17 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256919#comment-16256919
 ] 

holdenk commented on SPARK-22274:
-

Wonderful, do ping me on the PR then :)

> User-defined aggregation functions with pandas udf
> --
>
> Key: SPARK-22274
> URL: https://issues.apache.org/jira/browse/SPARK-22274
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Li Jin
>
> This function doesn't implement partial aggregation and shuffles all data. A 
> uadf that supports partial aggregation is not covered by this Jira.
> Exmaple:
> {code:java}
> @pandas_udf(DoubleType())
> def mean(v)
>   return v.mean()
> df.groupby('id').apply(mean(df.v1), mean(df.v2))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22042) ReorderJoinPredicates can break when child's partitioning is not decided

2017-11-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22042:

Affects Version/s: (was: 2.2.0)
   (was: 2.1.0)
   2.3.0

> ReorderJoinPredicates can break when child's partitioning is not decided
> 
>
> Key: SPARK-22042
> URL: https://issues.apache.org/jira/browse/SPARK-22042
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Tejas Patil
>
> When `ReorderJoinPredicates` tries to get the `outputPartitioning` of its 
> children, the children may not be properly constructed as the child-subtree 
> has to still go through other planner rules.
> In this particular case, the child is `SortMergeJoinExec`. Since the required 
> `Exchange` operators are not in place (because `EnsureRequirements` runs 
> _after_ `ReorderJoinPredicates`), the join's children would not have 
> partitioning defined. This breaks while creation the `PartitioningCollection` 
> here : 
> https://github.com/apache/spark/blob/94439997d57875838a8283c543f9b44705d3a503/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L69
> Small repro:
> {noformat}
> context.sql("SET spark.sql.autoBroadcastJoinThreshold=0")
> val df = (0 until 50).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", 
> "k")
> df.write.format("parquet").saveAsTable("table1")
> df.write.format("parquet").saveAsTable("table2")
> df.write.format("parquet").bucketBy(8, "j", "k").saveAsTable("bucketed_table")
> sql("""
>   SELECT *
>   FROM (
> SELECT a.i, a.j, a.k
> FROM bucketed_table a
> JOIN table1 b
> ON a.i = b.i
>   ) c
>   JOIN table2
>   ON c.i = table2.i
> """).explain
> {noformat}
> This fails with :
> {noformat}
> java.lang.IllegalArgumentException: requirement failed: 
> PartitioningCollection requires all of its partitionings have the same 
> numPartitions.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.physical.PartitioningCollection.(partitioning.scala:324)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputPartitioning(SortMergeJoinExec.scala:69)
>   at 
> org.apache.spark.sql.execution.ProjectExec.outputPartitioning(basicPhysicalOperators.scala:82)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:91)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:76)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:76)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:34)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:90)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:90)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.command.ExplainCommand.run(commands.scala:147)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:78)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:91)
>   at org.apache.spark.sql.Dataset.explain(Dataset.scala:464)
>   at 

[jira] [Commented] (SPARK-22546) Allow users to update the dataType of a column

2017-11-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256901#comment-16256901
 ] 

Apache Spark commented on SPARK-22546:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/19773

> Allow users to update the dataType of a column
> --
>
> Key: SPARK-22546
> URL: https://issues.apache.org/jira/browse/SPARK-22546
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Li Yuanjian
>
> [SPARK-17910|https://issues.apache.org/jira/browse/SPARK-17910] supported 
> user to change comment of column, the patch also left TODO for other 
> metadata. Here just implement the dataType changing requirement, others like 
> renaming column and changing position may have different considerations in 
> datasource table and hive table, maybe need more discussion.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22546) Allow users to update the dataType of a column

2017-11-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22546:


Assignee: (was: Apache Spark)

> Allow users to update the dataType of a column
> --
>
> Key: SPARK-22546
> URL: https://issues.apache.org/jira/browse/SPARK-22546
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Li Yuanjian
>
> [SPARK-17910|https://issues.apache.org/jira/browse/SPARK-17910] supported 
> user to change comment of column, the patch also left TODO for other 
> metadata. Here just implement the dataType changing requirement, others like 
> renaming column and changing position may have different considerations in 
> datasource table and hive table, maybe need more discussion.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22546) Allow users to update the dataType of a column

2017-11-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22546:


Assignee: Apache Spark

> Allow users to update the dataType of a column
> --
>
> Key: SPARK-22546
> URL: https://issues.apache.org/jira/browse/SPARK-22546
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Li Yuanjian
>Assignee: Apache Spark
>
> [SPARK-17910|https://issues.apache.org/jira/browse/SPARK-17910] supported 
> user to change comment of column, the patch also left TODO for other 
> metadata. Here just implement the dataType changing requirement, others like 
> renaming column and changing position may have different considerations in 
> datasource table and hive table, maybe need more discussion.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22546) Allow users to update the dataType of a column

2017-11-17 Thread Li Yuanjian (JIRA)
Li Yuanjian created SPARK-22546:
---

 Summary: Allow users to update the dataType of a column
 Key: SPARK-22546
 URL: https://issues.apache.org/jira/browse/SPARK-22546
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Li Yuanjian


[SPARK-17910|https://issues.apache.org/jira/browse/SPARK-17910] supported user 
to change comment of column, the patch also left TODO for other metadata. Here 
just implement the dataType changing requirement, others like renaming column 
and changing position may have different considerations in datasource table and 
hive table, maybe need more discussion.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14959) ​Problem Reading partitioned ORC or Parquet files

2017-11-17 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256789#comment-16256789
 ] 

Steve Loughran commented on SPARK-14959:


Came across a reference to this while scanning for getFileBlockLocations() use.

HDFS shouldn't be throwing this. {{getFileBlockLocations(Path, offset, len)}} 
is nominally the same as {{getFileBlockLocations(getFileStatus(Path), offset, 
len)}}; the latter will return an empty array on a directory. Looks like the 
HDFS behaviour has been there for years, and people can argue that it's the 
correct behaviour: but its the only subclass of the base FileSystem 
implementation, and it doesn't fail on a directory. Maybe it can be fixed, at 
the very least the behaviour needs to be specified explicitly. 

> ​Problem Reading partitioned ORC or Parquet files
> -
>
> Key: SPARK-14959
> URL: https://issues.apache.org/jira/browse/SPARK-14959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Hadoop 2.7.1.2.4.0.0-169 (HDP 2.4)
>Reporter: Sebastian YEPES FERNANDEZ
>Assignee: Xin Wu
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Hello,
> I have noticed that in the pasts days there is an issue when trying to read 
> partitioned files from HDFS.
> I am running on Spark master branch #c544356
> The write actually works but the read fails.
> {code:title=Issue Reproduction}
> case class Data(id: Int, text: String)
> val ds = spark.createDataset( Seq(Data(0, "hello"), Data(1, "hello"), Data(0, 
> "world"), Data(1, "there")) )
> scala> 
> ds.write.mode(org.apache.spark.sql.SaveMode.Overwrite).format("parquet").partitionBy("id").save("/user/spark/test.parquet")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".  
>   
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> java.io.FileNotFoundException: Path is not a file: 
> /user/spark/test.parquet/id=0
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75)
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>   at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1242)
>   at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
>   at org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1285)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> 

[jira] [Commented] (SPARK-22541) Dataframes: applying multiple filters one after another using udfs and accumulators results in faulty accumulators

2017-11-17 Thread Janne K. Olesen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256676#comment-16256676
 ] 

Janne K. Olesen commented on SPARK-22541:
-

I agree, the filtered results are correct, but that is beside the point. It 
seems like query optimization does something like
{noformat}
  for row in df:
 result_a = filter1(row)
 result_b = filter2(row)
 result = result_a && result_b
{noformat}

but in my opion it should be 
{noformat}
  for row in df:
 result_a = filter1(row)
 if result_a == True:
return filter2(row)
else
   return False
{noformat}

If filter2 is executed regardless of the result of filter1, this can lead to 
strange errors. Considering the following example:
{code:title=Example.py}
from pyspark.sql.functions import udf, col
from pyspark.sql.types import BooleanType
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df_input = spark.createDataFrame([("a", None), ("b", 2), ("c", 3)], ["key", 
"val"])

# works as expected
df = df_input.filter(col("val").isNotNull())
df = df.filter(col("val") > 2)
df.show()

# this will raise an error and fail
# TypeError: '>' not supported between instances of 'NoneType' and 'int'
isNotNone = udf(lambda x: x is not None, BooleanType())
filter2 = udf(lambda x: x > 2, BooleanType())
df = df_input.filter(isNotNone(col("val")))
df = df.filter(filter2(col("val")))
df.show()
{code}



> Dataframes: applying multiple filters one after another using udfs and 
> accumulators results in faulty accumulators
> --
>
> Key: SPARK-22541
> URL: https://issues.apache.org/jira/browse/SPARK-22541
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: pyspark 2.2.0, ubuntu
>Reporter: Janne K. Olesen
>
> I'm using udf filters and accumulators to keep track of filtered rows in 
> dataframes.
> If I'm applying multiple filters one after the other, they seem to be 
> executed in parallel, not in sequence, which messes with the accumulators i'm 
> using to keep track of filtered data. 
> {code:title=example.py|borderStyle=solid}
> from pyspark.sql.functions import udf, col
> from pyspark.sql.types import BooleanType
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.getOrCreate()
> sc = spark.sparkContext
> df = spark.createDataFrame([("a", 1, 1), ("b", 2, 2), ("c", 3, 3)], ["key", 
> "val1", "val2"])
> def __myfilter(val, acc):
> if val < 2:
> return True
> else:
> acc.add(1)
> return False
> acc1 = sc.accumulator(0)
> acc2 = sc.accumulator(0)
> def myfilter1(val):
> return __myfilter(val, acc1)
> def myfilter2(val):
> return __myfilter(val, acc2)
> my_udf1 = udf(myfilter1, BooleanType())
> my_udf2 = udf(myfilter2, BooleanType())
> df.show()
> # +---+++
> # |key|val1|val2|
> # +---+++
> # |  a|   1|   1|
> # |  b|   2|   2|
> # |  c|   3|   3|
> # +---+++
> df = df.filter(my_udf1(col("val1")))
> # df.show()
> # +---+++
> # |key|val1|val2|
> # +---+++
> # |  a|   1|   1|
> # +---+++
> # expected acc1: 2
> # expected acc2: 0
> df = df.filter(my_udf2(col("val2")))
> # df.show()
> # +---+++
> # |key|val1|val2|
> # +---+++
> # |  a|   1|   1|
> # +---+++
> # expected acc1: 2
> # expected acc2: 0
> df.show()
> print("acc1: %s" % acc1.value)  # expected 2, is 2 OK
> print("acc2: %s" % acc2.value)  # expected 0, is 2 !!!
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22495) Fix setup of SPARK_HOME variable on Windows

2017-11-17 Thread Jakub Nowacki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakub Nowacki reassigned SPARK-22495:
-

Assignee: Jakub Nowacki

> Fix setup of SPARK_HOME variable on Windows
> ---
>
> Key: SPARK-22495
> URL: https://issues.apache.org/jira/browse/SPARK-22495
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Windows
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Jakub Nowacki
>Priority: Minor
>
> On Windows, pip installed pyspark is unable to find out the spark home. There 
> is already proposed change, sufficient details and discussions in 
> https://github.com/apache/spark/pull/19370 and SPARK-18136



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org