[jira] [Assigned] (SPARK-22521) VectorIndexerModel support handle unseen categories via handleInvalid: Python API

2017-11-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22521:


Assignee: Apache Spark

> VectorIndexerModel support handle unseen categories via handleInvalid: Python 
> API
> -
>
> Key: SPARK-22521
> URL: https://issues.apache.org/jira/browse/SPARK-22521
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>
> VectorIndexerModel support handle unseen categories via handleInvalid: Python 
> API



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22521) VectorIndexerModel support handle unseen categories via handleInvalid: Python API

2017-11-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22521:


Assignee: (was: Apache Spark)

> VectorIndexerModel support handle unseen categories via handleInvalid: Python 
> API
> -
>
> Key: SPARK-22521
> URL: https://issues.apache.org/jira/browse/SPARK-22521
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>
> VectorIndexerModel support handle unseen categories via handleInvalid: Python 
> API



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22521) VectorIndexerModel support handle unseen categories via handleInvalid: Python API

2017-11-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252999#comment-16252999
 ] 

Apache Spark commented on SPARK-22521:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/19753

> VectorIndexerModel support handle unseen categories via handleInvalid: Python 
> API
> -
>
> Key: SPARK-22521
> URL: https://issues.apache.org/jira/browse/SPARK-22521
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>
> VectorIndexerModel support handle unseen categories via handleInvalid: Python 
> API



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19371) Cannot spread cached partitions evenly across executors

2017-11-14 Thread Thunder Stumpges (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252953#comment-16252953
 ] 

Thunder Stumpges edited comment on SPARK-19371 at 11/15/17 5:50 AM:


In my case, it is important because that cached RDD is partitioned carefully, 
and used to join to a streaming dataset every 10 seconds to perform processing. 
The stream RDD is shuffled to align/join with this data-set, and work is then 
done. This runs for days or months at a time. If I could just once at the 
beginning move the RDD blocks so they're balanced, the benefit of that one time 
move (all in memory) would pay off many times over. 

Setting locality.wait=0 only causes this same network traffic to be ongoing (a 
task starting on one executor and its RDD block being on another executor), 
correct? BTW, I have tried my job with spark.locality.wait=0, but it seems to 
have limited effect. Tasks are still being executed on the executors with the 
RDD blocks. Only two tasks did NOT run as PROCESS_LOCAL, and they took many 
times longer to execute:

!execution timeline.png|width=800, height=344!

It all seems very clear in my head, am I missing something? Is this a really 
complicated thing to do? 

Basically the operation would be a "rebalance" that applies to cached RDDs and 
would attempt to balance the distribution of blocks across executors. 


was (Author: thunderstumpges):
In my case, it is important because that cached RDD is partitioned carefully, 
and used to join to a streaming dataset every 10 seconds to perform processing. 
The stream RDD is shuffled to align/join with this data-set, and work is then 
done. This runs for days or months at a time. If I could just once at the 
beginning move the RDD blocks so they're balanced, the benefit of that one time 
move (all in memory) would pay off many times over. 

Setting locality.wait=0 only causes this same network traffic to be ongoing (a 
task starting on one executor and its RDD block being on another executor), 
correct? BTW, I have tried my job with spark.locality.wait=0, but it seems to 
have limited effect. Tasks are still being executed on the executors with the 
RDD blocks. Only two tasks did NOT run as PROCESS_LOCAL, and they took many 
times longer to execute:

!execution timeline.png!

It all seems very clear in my head, am I missing something? Is this a really 
complicated thing to do? 

Basically the operation would be a "rebalance" that applies to cached RDDs and 
would attempt to balance the distribution of blocks across executors. 

> Cannot spread cached partitions evenly across executors
> ---
>
> Key: SPARK-19371
> URL: https://issues.apache.org/jira/browse/SPARK-19371
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Thunder Stumpges
> Attachments: RDD Block Distribution on two executors.png, Unbalanced 
> RDD Blocks, and resulting task imbalance.png, Unbalanced RDD Blocks, and 
> resulting task imbalance.png, execution timeline.png
>
>
> Before running an intensive iterative job (in this case a distributed topic 
> model training), we need to load a dataset and persist it across executors. 
> After loading from HDFS and persisting, the partitions are spread unevenly 
> across executors (based on the initial scheduling of the reads which are not 
> data locale sensitive). The partition sizes are even, just not their 
> distribution over executors. We currently have no way to force the partitions 
> to spread evenly, and as the iterative algorithm begins, tasks are 
> distributed to executors based on this initial load, forcing some very 
> unbalanced work.
> This has been mentioned a 
> [number|http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Partitions-not-distributed-evenly-to-executors-tt16988.html#a17059]
>  of 
> [times|http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html]
>  in 
> [various|http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html]
>  user/dev group threads.
> None of the discussions I could find had solutions that worked for me. Here 
> are examples of things I have tried. All resulted in partitions in memory 
> that were NOT evenly distributed to executors, causing future tasks to be 
> imbalanced across executors as well.
> *Reduce Locality*
> {code}spark.shuffle.reduceLocality.enabled=false/true{code}
> *"Legacy" memory mode*
> {code}spark.memory.useLegacyMode = true/false{code}
> *Basic load and repartition*
> {code}
> val numPartitions = 48*16
> val df = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions).
> persist
> df.count
> {code}
> *Load and repartition to 2x partitions, then shuffle repartition 

[jira] [Comment Edited] (SPARK-19371) Cannot spread cached partitions evenly across executors

2017-11-14 Thread Thunder Stumpges (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252969#comment-16252969
 ] 

Thunder Stumpges edited comment on SPARK-19371 at 11/15/17 5:52 AM:


Here's a view of something that happened this time around as I started the job 
(this time with locality.wait using default). This RDD somehow got all 120 
partitions on ONLY TWO executors, both on the same physical machine. From now 
on, all jobs either run on that one single machine, or (if I set 
locality.wait=0) move that data to other executors before executing, correct?
!RDD Block Distribution on two executors.png|width=800, height=306!



was (Author: thunderstumpges):
Here's a view of something that happened this time around as I started the job 
(this time with locality.wait using default). This RDD somehow got all 120 
partitions on ONLY TWO executors, both on the same physical machine. From now 
on, all jobs either run on that one single machine, or (if I set 
locality.wait=0) move that data to other executors before executing, correct?
!RDD Block Distribution on two executors.png!


> Cannot spread cached partitions evenly across executors
> ---
>
> Key: SPARK-19371
> URL: https://issues.apache.org/jira/browse/SPARK-19371
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Thunder Stumpges
> Attachments: RDD Block Distribution on two executors.png, Unbalanced 
> RDD Blocks, and resulting task imbalance.png, Unbalanced RDD Blocks, and 
> resulting task imbalance.png, execution timeline.png
>
>
> Before running an intensive iterative job (in this case a distributed topic 
> model training), we need to load a dataset and persist it across executors. 
> After loading from HDFS and persisting, the partitions are spread unevenly 
> across executors (based on the initial scheduling of the reads which are not 
> data locale sensitive). The partition sizes are even, just not their 
> distribution over executors. We currently have no way to force the partitions 
> to spread evenly, and as the iterative algorithm begins, tasks are 
> distributed to executors based on this initial load, forcing some very 
> unbalanced work.
> This has been mentioned a 
> [number|http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Partitions-not-distributed-evenly-to-executors-tt16988.html#a17059]
>  of 
> [times|http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html]
>  in 
> [various|http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html]
>  user/dev group threads.
> None of the discussions I could find had solutions that worked for me. Here 
> are examples of things I have tried. All resulted in partitions in memory 
> that were NOT evenly distributed to executors, causing future tasks to be 
> imbalanced across executors as well.
> *Reduce Locality*
> {code}spark.shuffle.reduceLocality.enabled=false/true{code}
> *"Legacy" memory mode*
> {code}spark.memory.useLegacyMode = true/false{code}
> *Basic load and repartition*
> {code}
> val numPartitions = 48*16
> val df = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions).
> persist
> df.count
> {code}
> *Load and repartition to 2x partitions, then shuffle repartition down to 
> desired partitions*
> {code}
> val numPartitions = 48*16
> val df2 = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions*2)
> val df = df2.repartition(numPartitions).
> persist
> df.count
> {code}
> It would be great if when persisting an RDD/DataFrame, if we could request 
> that those partitions be stored evenly across executors in preparation for 
> future tasks. 
> I'm not sure if this is a more general issue (I.E. not just involving 
> persisting RDDs), but for the persisted in-memory case, it can make a HUGE 
> difference in the over-all running time of the remaining work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19371) Cannot spread cached partitions evenly across executors

2017-11-14 Thread Thunder Stumpges (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252969#comment-16252969
 ] 

Thunder Stumpges commented on SPARK-19371:
--

Here's a view of something that happened this time around as I started the job 
(this time with locality.wait using default). This RDD somehow got all 120 
partitions on ONLY TWO executors, both on the same physical machine. From now 
on, all jobs either run on that one single machine, or (if I set 
locality.wait=0) move that data to other executors before executing, correct?
!RDD Block Distribution on two executors.png!


> Cannot spread cached partitions evenly across executors
> ---
>
> Key: SPARK-19371
> URL: https://issues.apache.org/jira/browse/SPARK-19371
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Thunder Stumpges
> Attachments: RDD Block Distribution on two executors.png, Unbalanced 
> RDD Blocks, and resulting task imbalance.png, Unbalanced RDD Blocks, and 
> resulting task imbalance.png, execution timeline.png
>
>
> Before running an intensive iterative job (in this case a distributed topic 
> model training), we need to load a dataset and persist it across executors. 
> After loading from HDFS and persisting, the partitions are spread unevenly 
> across executors (based on the initial scheduling of the reads which are not 
> data locale sensitive). The partition sizes are even, just not their 
> distribution over executors. We currently have no way to force the partitions 
> to spread evenly, and as the iterative algorithm begins, tasks are 
> distributed to executors based on this initial load, forcing some very 
> unbalanced work.
> This has been mentioned a 
> [number|http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Partitions-not-distributed-evenly-to-executors-tt16988.html#a17059]
>  of 
> [times|http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html]
>  in 
> [various|http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html]
>  user/dev group threads.
> None of the discussions I could find had solutions that worked for me. Here 
> are examples of things I have tried. All resulted in partitions in memory 
> that were NOT evenly distributed to executors, causing future tasks to be 
> imbalanced across executors as well.
> *Reduce Locality*
> {code}spark.shuffle.reduceLocality.enabled=false/true{code}
> *"Legacy" memory mode*
> {code}spark.memory.useLegacyMode = true/false{code}
> *Basic load and repartition*
> {code}
> val numPartitions = 48*16
> val df = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions).
> persist
> df.count
> {code}
> *Load and repartition to 2x partitions, then shuffle repartition down to 
> desired partitions*
> {code}
> val numPartitions = 48*16
> val df2 = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions*2)
> val df = df2.repartition(numPartitions).
> persist
> df.count
> {code}
> It would be great if when persisting an RDD/DataFrame, if we could request 
> that those partitions be stored evenly across executors in preparation for 
> future tasks. 
> I'm not sure if this is a more general issue (I.E. not just involving 
> persisting RDDs), but for the persisted in-memory case, it can make a HUGE 
> difference in the over-all running time of the remaining work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19371) Cannot spread cached partitions evenly across executors

2017-11-14 Thread Thunder Stumpges (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thunder Stumpges updated SPARK-19371:
-
Attachment: RDD Block Distribution on two executors.png

> Cannot spread cached partitions evenly across executors
> ---
>
> Key: SPARK-19371
> URL: https://issues.apache.org/jira/browse/SPARK-19371
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Thunder Stumpges
> Attachments: RDD Block Distribution on two executors.png, Unbalanced 
> RDD Blocks, and resulting task imbalance.png, Unbalanced RDD Blocks, and 
> resulting task imbalance.png, execution timeline.png
>
>
> Before running an intensive iterative job (in this case a distributed topic 
> model training), we need to load a dataset and persist it across executors. 
> After loading from HDFS and persisting, the partitions are spread unevenly 
> across executors (based on the initial scheduling of the reads which are not 
> data locale sensitive). The partition sizes are even, just not their 
> distribution over executors. We currently have no way to force the partitions 
> to spread evenly, and as the iterative algorithm begins, tasks are 
> distributed to executors based on this initial load, forcing some very 
> unbalanced work.
> This has been mentioned a 
> [number|http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Partitions-not-distributed-evenly-to-executors-tt16988.html#a17059]
>  of 
> [times|http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html]
>  in 
> [various|http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html]
>  user/dev group threads.
> None of the discussions I could find had solutions that worked for me. Here 
> are examples of things I have tried. All resulted in partitions in memory 
> that were NOT evenly distributed to executors, causing future tasks to be 
> imbalanced across executors as well.
> *Reduce Locality*
> {code}spark.shuffle.reduceLocality.enabled=false/true{code}
> *"Legacy" memory mode*
> {code}spark.memory.useLegacyMode = true/false{code}
> *Basic load and repartition*
> {code}
> val numPartitions = 48*16
> val df = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions).
> persist
> df.count
> {code}
> *Load and repartition to 2x partitions, then shuffle repartition down to 
> desired partitions*
> {code}
> val numPartitions = 48*16
> val df2 = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions*2)
> val df = df2.repartition(numPartitions).
> persist
> df.count
> {code}
> It would be great if when persisting an RDD/DataFrame, if we could request 
> that those partitions be stored evenly across executors in preparation for 
> future tasks. 
> I'm not sure if this is a more general issue (I.E. not just involving 
> persisting RDDs), but for the persisted in-memory case, it can make a HUGE 
> difference in the over-all running time of the remaining work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19371) Cannot spread cached partitions evenly across executors

2017-11-14 Thread Thunder Stumpges (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252953#comment-16252953
 ] 

Thunder Stumpges commented on SPARK-19371:
--

In my case, it is important because that cached RDD is partitioned carefully, 
and used to join to a streaming dataset every 10 seconds to perform processing. 
The stream RDD is shuffled to align/join with this data-set, and work is then 
done. This runs for days or months at a time. If I could just once at the 
beginning move the RDD blocks so they're balanced, the benefit of that one time 
move (all in memory) would pay off many times over. 

Setting locality.wait=0 only causes this same network traffic to be ongoing (a 
task starting on one executor and its RDD block being on another executor), 
correct? BTW, I have tried my job with spark.locality.wait=0, but it seems to 
have limited effect. Tasks are still being executed on the executors with the 
RDD blocks. Only two tasks did NOT run as PROCESS_LOCAL, and they took many 
times longer to execute:

!execution timeline.png!

It all seems very clear in my head, am I missing something? Is this a really 
complicated thing to do? 

Basically the operation would be a "rebalance" that applies to cached RDDs and 
would attempt to balance the distribution of blocks across executors. 

> Cannot spread cached partitions evenly across executors
> ---
>
> Key: SPARK-19371
> URL: https://issues.apache.org/jira/browse/SPARK-19371
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Thunder Stumpges
> Attachments: Unbalanced RDD Blocks, and resulting task imbalance.png, 
> Unbalanced RDD Blocks, and resulting task imbalance.png, execution 
> timeline.png
>
>
> Before running an intensive iterative job (in this case a distributed topic 
> model training), we need to load a dataset and persist it across executors. 
> After loading from HDFS and persisting, the partitions are spread unevenly 
> across executors (based on the initial scheduling of the reads which are not 
> data locale sensitive). The partition sizes are even, just not their 
> distribution over executors. We currently have no way to force the partitions 
> to spread evenly, and as the iterative algorithm begins, tasks are 
> distributed to executors based on this initial load, forcing some very 
> unbalanced work.
> This has been mentioned a 
> [number|http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Partitions-not-distributed-evenly-to-executors-tt16988.html#a17059]
>  of 
> [times|http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html]
>  in 
> [various|http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html]
>  user/dev group threads.
> None of the discussions I could find had solutions that worked for me. Here 
> are examples of things I have tried. All resulted in partitions in memory 
> that were NOT evenly distributed to executors, causing future tasks to be 
> imbalanced across executors as well.
> *Reduce Locality*
> {code}spark.shuffle.reduceLocality.enabled=false/true{code}
> *"Legacy" memory mode*
> {code}spark.memory.useLegacyMode = true/false{code}
> *Basic load and repartition*
> {code}
> val numPartitions = 48*16
> val df = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions).
> persist
> df.count
> {code}
> *Load and repartition to 2x partitions, then shuffle repartition down to 
> desired partitions*
> {code}
> val numPartitions = 48*16
> val df2 = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions*2)
> val df = df2.repartition(numPartitions).
> persist
> df.count
> {code}
> It would be great if when persisting an RDD/DataFrame, if we could request 
> that those partitions be stored evenly across executors in preparation for 
> future tasks. 
> I'm not sure if this is a more general issue (I.E. not just involving 
> persisting RDDs), but for the persisted in-memory case, it can make a HUGE 
> difference in the over-all running time of the remaining work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19371) Cannot spread cached partitions evenly across executors

2017-11-14 Thread Thunder Stumpges (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thunder Stumpges updated SPARK-19371:
-
Attachment: execution timeline.png

> Cannot spread cached partitions evenly across executors
> ---
>
> Key: SPARK-19371
> URL: https://issues.apache.org/jira/browse/SPARK-19371
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Thunder Stumpges
> Attachments: Unbalanced RDD Blocks, and resulting task imbalance.png, 
> Unbalanced RDD Blocks, and resulting task imbalance.png, execution 
> timeline.png
>
>
> Before running an intensive iterative job (in this case a distributed topic 
> model training), we need to load a dataset and persist it across executors. 
> After loading from HDFS and persisting, the partitions are spread unevenly 
> across executors (based on the initial scheduling of the reads which are not 
> data locale sensitive). The partition sizes are even, just not their 
> distribution over executors. We currently have no way to force the partitions 
> to spread evenly, and as the iterative algorithm begins, tasks are 
> distributed to executors based on this initial load, forcing some very 
> unbalanced work.
> This has been mentioned a 
> [number|http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Partitions-not-distributed-evenly-to-executors-tt16988.html#a17059]
>  of 
> [times|http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html]
>  in 
> [various|http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html]
>  user/dev group threads.
> None of the discussions I could find had solutions that worked for me. Here 
> are examples of things I have tried. All resulted in partitions in memory 
> that were NOT evenly distributed to executors, causing future tasks to be 
> imbalanced across executors as well.
> *Reduce Locality*
> {code}spark.shuffle.reduceLocality.enabled=false/true{code}
> *"Legacy" memory mode*
> {code}spark.memory.useLegacyMode = true/false{code}
> *Basic load and repartition*
> {code}
> val numPartitions = 48*16
> val df = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions).
> persist
> df.count
> {code}
> *Load and repartition to 2x partitions, then shuffle repartition down to 
> desired partitions*
> {code}
> val numPartitions = 48*16
> val df2 = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions*2)
> val df = df2.repartition(numPartitions).
> persist
> df.count
> {code}
> It would be great if when persisting an RDD/DataFrame, if we could request 
> that those partitions be stored evenly across executors in preparation for 
> future tasks. 
> I'm not sure if this is a more general issue (I.E. not just involving 
> persisting RDDs), but for the persisted in-memory case, it can make a HUGE 
> difference in the over-all running time of the remaining work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20937) Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide

2017-11-14 Thread sw (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252942#comment-16252942
 ] 

sw commented on SPARK-20937:


++100 data already write. so how can I fix it?

> Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, 
> DataFrames and Datasets Guide
> -
>
> Key: SPARK-20937
> URL: https://issues.apache.org/jira/browse/SPARK-20937
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> As a follow-up to SPARK-20297 (and SPARK-10400) in which 
> {{spark.sql.parquet.writeLegacyFormat}} property was recommended for Impala 
> and Hive, Spark SQL docs for [Parquet 
> Files|https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration]
>  should have it documented.
> p.s. It was asked about in [Why can't Impala read parquet files after Spark 
> SQL's write?|https://stackoverflow.com/q/44279870/1305344] on StackOverflow 
> today.
> p.s. It's also covered in [~holden.ka...@gmail.com]'s "High Performance 
> Spark: Best Practices for Scaling and Optimizing Apache Spark" book (in Table 
> 3-10. Parquet data source options) that gives the option some wider publicity.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22522) Convert to apache-release to publish Maven artifacts to Nexus/repository.apache.org

2017-11-14 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252937#comment-16252937
 ] 

Felix Cheung commented on SPARK-22522:
--

and repository.apache.org is been the same place we are staging releases to 
maven central (RCs)

I think it gets automatically sync to repo.maven.apache.org and then to 
repo1.maven.org when we make a release.


> Convert to apache-release to publish Maven artifacts to 
> Nexus/repository.apache.org
> ---
>
> Key: SPARK-22522
> URL: https://issues.apache.org/jira/browse/SPARK-22522
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
>
> see http://www.apache.org/dev/publishing-maven-artifacts.html
> ...at the very least we need to revisit all the calls to curl (and/or gpg) in 
> the release-build.sh for the publish-release path - seems like some errors 
> are ignored (running into that myself) and it would be very easy to miss 
> publishing one or more or all files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22522) Convert to apache-release to publish Maven artifacts to Nexus/repository.apache.org

2017-11-14 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22522:
-
Description: 
see http://www.apache.org/dev/publishing-maven-artifacts.html

...at the very least we need to revisit all the calls to curl (and/or gpg) in 
the release-build.sh for the publish-release path - seems like some errors are 
ignored (running into that myself) and it would be very easy to miss publishing 
one or more or all files.


  was:
see http://www.apache.org/dev/publishing-maven-artifacts.html

...at the very least we need to revisit all the calls to curl (and/or gpg) in 
the release-build.sh for the publish-release path - seems like some errors are 
ignored (running into that myself) and it would be very easy to miss publishing 
one or more files.



> Convert to apache-release to publish Maven artifacts to 
> Nexus/repository.apache.org
> ---
>
> Key: SPARK-22522
> URL: https://issues.apache.org/jira/browse/SPARK-22522
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
>
> see http://www.apache.org/dev/publishing-maven-artifacts.html
> ...at the very least we need to revisit all the calls to curl (and/or gpg) in 
> the release-build.sh for the publish-release path - seems like some errors 
> are ignored (running into that myself) and it would be very easy to miss 
> publishing one or more or all files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22522) Convert to apache-release to publish Maven artifacts to Nexus/repository.apache.org

2017-11-14 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22522:
-
Description: 
see http://www.apache.org/dev/publishing-maven-artifacts.html

...at the very least we need to revisit all the calls to curl in the 
release-build.sh for the publish-release path - seems like some errors are 
ignored (running into that myself) and it would be very easy to miss publishing 
one or more files.


  was:see http://www.apache.org/dev/publishing-maven-artifacts.html


> Convert to apache-release to publish Maven artifacts to 
> Nexus/repository.apache.org
> ---
>
> Key: SPARK-22522
> URL: https://issues.apache.org/jira/browse/SPARK-22522
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
>
> see http://www.apache.org/dev/publishing-maven-artifacts.html
> ...at the very least we need to revisit all the calls to curl in the 
> release-build.sh for the publish-release path - seems like some errors are 
> ignored (running into that myself) and it would be very easy to miss 
> publishing one or more files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22522) Convert to apache-release to publish Maven artifacts to Nexus/repository.apache.org

2017-11-14 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22522:
-
Description: 
see http://www.apache.org/dev/publishing-maven-artifacts.html

...at the very least we need to revisit all the calls to curl (and/or gpg) in 
the release-build.sh for the publish-release path - seems like some errors are 
ignored (running into that myself) and it would be very easy to miss publishing 
one or more files.


  was:
see http://www.apache.org/dev/publishing-maven-artifacts.html

...at the very least we need to revisit all the calls to curl in the 
release-build.sh for the publish-release path - seems like some errors are 
ignored (running into that myself) and it would be very easy to miss publishing 
one or more files.



> Convert to apache-release to publish Maven artifacts to 
> Nexus/repository.apache.org
> ---
>
> Key: SPARK-22522
> URL: https://issues.apache.org/jira/browse/SPARK-22522
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
>
> see http://www.apache.org/dev/publishing-maven-artifacts.html
> ...at the very least we need to revisit all the calls to curl (and/or gpg) in 
> the release-build.sh for the publish-release path - seems like some errors 
> are ignored (running into that myself) and it would be very easy to miss 
> publishing one or more files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22522) Convert to apache-release to publish Maven artifacts to Nexus/repository.apache.org

2017-11-14 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252934#comment-16252934
 ] 

Felix Cheung commented on SPARK-22522:
--

the profile and associated config is inherited from the parent apache pom - I 
haven't looked through the specific though, but I don't think 
repository.apache.org needs to be listed in repositories in our pom - it's a 
destination to deploy/publish to but not pull from.


> Convert to apache-release to publish Maven artifacts to 
> Nexus/repository.apache.org
> ---
>
> Key: SPARK-22522
> URL: https://issues.apache.org/jira/browse/SPARK-22522
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
>
> see http://www.apache.org/dev/publishing-maven-artifacts.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22523) Janino throws StackOverflowError on nested structs with many fields

2017-11-14 Thread Utku Demir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Utku Demir updated SPARK-22523:
---
Description: 
When running the below application, Janino throws StackOverflow:

{code}
Exception in thread "main" java.lang.StackOverflowError
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370)
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
{code}

Problematic code:

{code:title=Example.scala|borderStyle=solid}
import org.apache.spark.sql._

case class Foo(
  f1: Int = 0,
  f2: Int = 0,
  f3: Int = 0,
  f4: Int = 0,
  f5: Int = 0,
  f6: Int = 0,
  f7: Int = 0,
  f8: Int = 0,
  f9: Int = 0,
  f10: Int = 0,
  f11: Int = 0,
  f12: Int = 0,
  f13: Int = 0,
  f14: Int = 0,
  f15: Int = 0,
  f16: Int = 0,
  f17: Int = 0,
  f18: Int = 0,
  f19: Int = 0,
  f20: Int = 0,
  f21: Int = 0,
  f22: Int = 0,
  f23: Int = 0,
  f24: Int = 0
)

case class Nest[T](
  a: T,
  b: T
)

object Nest {
  def apply[T](t: T): Nest[T] = new Nest(t, t)
}

object Main {
  def main(args: Array[String]) {
val spark: SparkSession = 
SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import spark.implicits._

val foo = Foo(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0)

Seq.fill(10)(Nest(Nest(foo))).toDS.groupByKey(identity).count.map(s => 
s).collect
  }
}
{code}



  was:
When run the below application, Janino throws StackOverflow:

{code}
Exception in thread "main" java.lang.StackOverflowError
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370)
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
{code}

Problematic code:

{code:title=Example.scala|borderStyle=solid}
import org.apache.spark.sql._

case class Foo(
  f1: Int = 0,
  f2: Int = 0,
  f3: Int = 0,
  f4: Int = 0,
  f5: Int = 0,
  f6: Int = 0,
  f7: Int = 0,
  f8: Int = 0,
  f9: Int = 0,
  f10: Int = 0,
  f11: Int = 0,
  f12: Int = 0,
  f13: Int = 0,
  f14: Int = 0,
  f15: Int = 0,
  f16: Int = 0,
  f17: Int = 0,
  f18: Int = 0,
  f19: Int = 0,
  f20: Int = 0,
  f21: Int = 0,
  f22: Int = 0,
  f23: Int = 0,
  f24: Int = 0
)

case class Nest[T](
  a: T,
  b: T
)

object Nest {
  def apply[T](t: T): Nest[T] = new Nest(t, t)
}

object Main {
  def main(args: Array[String]) {
val spark: SparkSession = 
SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import spark.implicits._

val foo = Foo(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0)

Seq.fill(10)(Nest(Nest(foo))).toDS.groupByKey(identity).count.map(s => 
s).collect
  }
}
{code}




> Janino throws StackOverflowError on nested structs with many fields
> ---
>
> Key: SPARK-22523
> URL: https://issues.apache.org/jira/browse/SPARK-22523
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: * Linux
> * Scala: 2.11.8
> * Spark: 2.2.0
>Reporter: Utku Demir
>Priority: Minor
>
> When running the below application, Janino throws StackOverflow:
> {code}
> Exception in thread "main" java.lang.StackOverflowError
>   at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370)
>   at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
>   at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
>   at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
>   at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
> {code}
> Problematic code:
> {code:title=Example.scala|borderStyle=solid}
> import org.apache.spark.sql._
> case class Foo(
>   f1: Int = 0,
>   f2: Int = 0,
>   f3: Int = 0,
>   f4: Int = 0,
>   f5: Int = 0,
>   f6: Int = 0,
>   f7: Int = 0,
>   f8: Int = 0,
>   f9: Int = 0,
>   f10: Int = 0,
>   f11: Int = 0,
>   f12: Int = 0,
>   f13: Int = 0,
>   f14: Int = 0,
>   f15: Int = 0,
>   f16: Int = 0,
>   f17: Int = 0,
>   f18: Int = 0,
>   f19: Int = 0,
>   f20: Int = 0,
>   f21: Int = 0,
>   f22: Int = 0,
>   f23: Int = 0,
>   f24: Int = 0
> )
> case class Nest[T](
>   a: T,
>   b: T
> )
> object Nest {
>   def apply[T](t: T): Nest[T] = new Nest(t, t)
> }
> object Main {
>   def main(args: Array[String]) {
> val spark: SparkSession = 
> 

[jira] [Created] (SPARK-22523) Janino throws StackOverflowError on nested structs with many fields

2017-11-14 Thread Utku Demir (JIRA)
Utku Demir created SPARK-22523:
--

 Summary: Janino throws StackOverflowError on nested structs with 
many fields
 Key: SPARK-22523
 URL: https://issues.apache.org/jira/browse/SPARK-22523
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.2.0
 Environment: * Linux
* Scala: 2.11.8
* Spark: 2.2.0
Reporter: Utku Demir
Priority: Minor


When run the below application, Janino throws StackOverflow:

{code}
Exception in thread "main" java.lang.StackOverflowError
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370)
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
{code}

Problematic code:

{code:title=Example.scala|borderStyle=solid}
import org.apache.spark.sql._

case class Foo(
  f1: Int = 0,
  f2: Int = 0,
  f3: Int = 0,
  f4: Int = 0,
  f5: Int = 0,
  f6: Int = 0,
  f7: Int = 0,
  f8: Int = 0,
  f9: Int = 0,
  f10: Int = 0,
  f11: Int = 0,
  f12: Int = 0,
  f13: Int = 0,
  f14: Int = 0,
  f15: Int = 0,
  f16: Int = 0,
  f17: Int = 0,
  f18: Int = 0,
  f19: Int = 0,
  f20: Int = 0,
  f21: Int = 0,
  f22: Int = 0,
  f23: Int = 0,
  f24: Int = 0
)

case class Nest[T](
  a: T,
  b: T
)

object Nest {
  def apply[T](t: T): Nest[T] = new Nest(t, t)
}

object Main {
  def main(args: Array[String]) {
val spark: SparkSession = 
SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import spark.implicits._

val foo = Foo(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0)

Seq.fill(10)(Nest(Nest(foo))).toDS.groupByKey(identity).count.map(s => 
s).collect
  }
}
{code}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22522) Convert to apache-release to publish Maven artifacts to Nexus/repository.apache.org

2017-11-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252864#comment-16252864
 ] 

Sean Owen commented on SPARK-22522:
---

Is that repo active in maven and SBT by default? Just wondering if publishing 
there instead would make it require a repo configuration to see the artifacts

> Convert to apache-release to publish Maven artifacts to 
> Nexus/repository.apache.org
> ---
>
> Key: SPARK-22522
> URL: https://issues.apache.org/jira/browse/SPARK-22522
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
>
> see http://www.apache.org/dev/publishing-maven-artifacts.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22522) Convert to apache-release to publish Maven artifacts to Nexus/repository.apache.org

2017-11-14 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-22522:


 Summary: Convert to apache-release to publish Maven artifacts to 
Nexus/repository.apache.org
 Key: SPARK-22522
 URL: https://issues.apache.org/jira/browse/SPARK-22522
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.3.0
Reporter: Felix Cheung
Priority: Minor


see http://www.apache.org/dev/publishing-maven-artifacts.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22521) VectorIndexerModel support handle unseen categories via handleInvalid: Python API

2017-11-14 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-22521:
--

 Summary: VectorIndexerModel support handle unseen categories via 
handleInvalid: Python API
 Key: SPARK-22521
 URL: https://issues.apache.org/jira/browse/SPARK-22521
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.2.0
Reporter: Weichen Xu


VectorIndexerModel support handle unseen categories via handleInvalid: Python 
API



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13846) VectorIndexer output on unknown feature should be more descriptive

2017-11-14 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-13846.
---
   Resolution: Fixed
 Assignee: Weichen Xu
Fix Version/s: 2.3.0

> VectorIndexer output on unknown feature should be more descriptive
> --
>
> Key: SPARK-13846
> URL: https://issues.apache.org/jira/browse/SPARK-13846
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.1
>Reporter: Dmitry Spikhalskiy
>Assignee: Weichen Xu
>Priority: Minor
> Fix For: 2.3.0
>
>
> I got the exception and looks like it's related to unknown categorical 
> variable value passed to indexing.
> java.util.NoSuchElementException: key not found: 20.0
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:58)
> at scala.collection.MapLike$class.apply(MapLike.scala:141)
> at scala.collection.AbstractMap.apply(Map.scala:58)
> at 
> org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:316)
> at 
> org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:315)
> at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
> at 
> org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:315)
> at 
> org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:309)
> at 
> org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:351)
> at 
> org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:351)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalExpr2$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
> at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> VectorIndexer created like
> val featureIndexer = new VectorIndexer()
> .setInputCol(DataFrameColumns.FEATURES)
> .setOutputCol("indexedFeatures")
> .setMaxCategories(5)
> .fit(trainingDF)
> Output should be not just default java.util.NoSuchElementException, but 
> something specific like UnknownCategoricalValue with information, that could 
> help to find the source element of vector (element index in vector maybe).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11373) Add metrics to the History Server and providers

2017-11-14 Thread Nick Dimiduk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252776#comment-16252776
 ] 

Nick Dimiduk commented on SPARK-11373:
--

I'm chasing a goose through the wild and have found my way here. It seems Spark 
has two independent subsystems for recording runtime information: 
history/SparkListener and Metrics. I'm startled to find a whole wealth of 
information exposed during job runtime over http/json via 
{{api/v1/applications}}, yet none of this is available to the Metrics systems 
configured with with metrics.properties file. Lovely details like number of 
input, output, and shuffle records per task are unavailable to my Grafana 
dashboards fed by the Ganglia reporter.

Is it an objective of this ticket to report such information through Metrics? 
Is there a separate ticket tracking such an effort? Is it a "simple" matter of 
implementing a {{SparkListener}} that bridges to Metrics?

> Add metrics to the History Server and providers
> ---
>
> Key: SPARK-11373
> URL: https://issues.apache.org/jira/browse/SPARK-11373
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Steve Loughran
>
> The History server doesn't publish metrics about JVM load or anything from 
> the history provider plugins. This means that performance problems from 
> massive job histories aren't visible to management tools, and nor are any 
> provider-generated metrics such as time to load histories, failed history 
> loads, the number of connectivity failures talking to remote services, etc.
> If the history server set up a metrics registry and offered the option to 
> publish its metrics, then management tools could view this data.
> # the metrics registry would need to be passed down to the instantiated 
> {{ApplicationHistoryProvider}}, in order for it to register its metrics.
> # if the codahale metrics servlet were registered under a path such as 
> {{/metrics}}, the values would be visible as HTML and JSON, without the need 
> for management tools.
> # Integration tests could also retrieve the JSON-formatted data and use it as 
> part of the test suites.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13846) VectorIndexer output on unknown feature should be more descriptive

2017-11-14 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252775#comment-16252775
 ] 

Joseph K. Bradley commented on SPARK-13846:
---

Linking JIRA for task which solves this issue.  Thanks for reporting this!

> VectorIndexer output on unknown feature should be more descriptive
> --
>
> Key: SPARK-13846
> URL: https://issues.apache.org/jira/browse/SPARK-13846
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.1
>Reporter: Dmitry Spikhalskiy
>Priority: Minor
>
> I got the exception and looks like it's related to unknown categorical 
> variable value passed to indexing.
> java.util.NoSuchElementException: key not found: 20.0
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:58)
> at scala.collection.MapLike$class.apply(MapLike.scala:141)
> at scala.collection.AbstractMap.apply(Map.scala:58)
> at 
> org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:316)
> at 
> org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:315)
> at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
> at 
> org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:315)
> at 
> org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:309)
> at 
> org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:351)
> at 
> org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:351)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalExpr2$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
> at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> VectorIndexer created like
> val featureIndexer = new VectorIndexer()
> .setInputCol(DataFrameColumns.FEATURES)
> .setOutputCol("indexedFeatures")
> .setMaxCategories(5)
> .fit(trainingDF)
> Output should be not just default java.util.NoSuchElementException, but 
> something specific like UnknownCategoricalValue with information, that could 
> help to find the source element of vector (element index in vector maybe).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12375) VectorIndexer: allow unknown categories

2017-11-14 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-12375.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19588
[https://github.com/apache/spark/pull/19588]

> VectorIndexer: allow unknown categories
> ---
>
> Key: SPARK-12375
> URL: https://issues.apache.org/jira/browse/SPARK-12375
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
> Fix For: 2.3.0
>
>
> Add option for allowing unknown categories, probably via a parameter like 
> "allowUnknownCategories."
> If true, then handle unknown categories during transform by assigning them to 
> an extra category index.
> The API should resemble the API used for StringIndexer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21087) CrossValidator, TrainValidationSplit should collect all models when fitting: Scala API

2017-11-14 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-21087.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19208
[https://github.com/apache/spark/pull/19208]

> CrossValidator, TrainValidationSplit should collect all models when fitting: 
> Scala API
> --
>
> Key: SPARK-21087
> URL: https://issues.apache.org/jira/browse/SPARK-21087
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
> Fix For: 2.3.0
>
>
> We add a parameter whether to collect the full model list when 
> CrossValidator/TrainValidationSplit training (Default is NOT, avoid the 
> change cause OOM)
> Add a method in CrossValidatorModel/TrainValidationSplitModel, allow user to 
> get the model list
> CrossValidatorModelWriter add a “option”, allow user to control whether to 
> persist the model list to disk.
> Note: when persisting the model list, use indices as the sub-model path



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22511) Update maven central repo address

2017-11-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-22511:
-

Assignee: Sean Owen

> Update maven central repo address
> -
>
> Key: SPARK-22511
> URL: https://issues.apache.org/jira/browse/SPARK-22511
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1, 2.3.0, 2.2.2
>Reporter: Felix Cheung
>Assignee: Sean Owen
> Fix For: 2.3.0, 2.2.2
>
>
> As a part of building 2.2.1, we hit an issue with sonatype
> https://issues.sonatype.org/browse/MVNCENTRAL-2870
> to workaround, we switch the address to repo.maven.apache.org, in branch-2.2.
> we should decide if we keep that or revert after 2.2.1 is released



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22511) Update maven central repo address

2017-11-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22511.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.2

Issue resolved by pull request 19742
[https://github.com/apache/spark/pull/19742]

> Update maven central repo address
> -
>
> Key: SPARK-22511
> URL: https://issues.apache.org/jira/browse/SPARK-22511
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1, 2.3.0, 2.2.2
>Reporter: Felix Cheung
> Fix For: 2.2.2, 2.3.0
>
>
> As a part of building 2.2.1, we hit an issue with sonatype
> https://issues.sonatype.org/browse/MVNCENTRAL-2870
> to workaround, we switch the address to repo.maven.apache.org, in branch-2.2.
> we should decide if we keep that or revert after 2.2.1 is released



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22519) Remove unnecessary stagingDirPath null check in ApplicationMaster.cleanupStagingDir()

2017-11-14 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-22519:
--

Assignee: Devaraj K

> Remove unnecessary stagingDirPath null check in 
> ApplicationMaster.cleanupStagingDir()
> -
>
> Key: SPARK-22519
> URL: https://issues.apache.org/jira/browse/SPARK-22519
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Devaraj K
>Assignee: Devaraj K
>Priority: Trivial
> Fix For: 2.3.0
>
>
> In the below, the condition checks whether the stagingDirPath is null but 
> stagingDirPath never becomes null. If SPARK_YARN_STAGING_DIR env var is null 
> then it throws NPE while creating the Path.
> {code:title=ApplicationMaster.scala|borderStyle=solid}
> stagingDirPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"))
> if (stagingDirPath == null) {
>   logError("Staging directory is null")
>   return
> }
> {code}
> Here we need to check whether the System.getenv("SPARK_YARN_STAGING_DIR") is 
> null or not, not the stagingDirPath.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22519) Remove unnecessary stagingDirPath null check in ApplicationMaster.cleanupStagingDir()

2017-11-14 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-22519.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19749
[https://github.com/apache/spark/pull/19749]

> Remove unnecessary stagingDirPath null check in 
> ApplicationMaster.cleanupStagingDir()
> -
>
> Key: SPARK-22519
> URL: https://issues.apache.org/jira/browse/SPARK-22519
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Devaraj K
>Priority: Trivial
> Fix For: 2.3.0
>
>
> In the below, the condition checks whether the stagingDirPath is null but 
> stagingDirPath never becomes null. If SPARK_YARN_STAGING_DIR env var is null 
> then it throws NPE while creating the Path.
> {code:title=ApplicationMaster.scala|borderStyle=solid}
> stagingDirPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"))
> if (stagingDirPath == null) {
>   logError("Staging directory is null")
>   return
> }
> {code}
> Here we need to check whether the System.getenv("SPARK_YARN_STAGING_DIR") is 
> null or not, not the stagingDirPath.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2017-11-14 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252611#comment-16252611
 ] 

DB Tsai edited comment on SPARK-22231 at 11/14/17 10:45 PM:


Thanks [~jeremyrsmith] for adding more details. There are couple technical 
challenging as [~jeremyrsmith] pointed out; we can agree on what APIs should 
look like in this JIRA first, and then address those challenging in PRs. 

# [~rxin], with {{dropColumn}} support in {{Column}}, we can have some share 
building block on {{Column}} like 
{code}
def computeNewFeature: Column => Column = (col: Column) => {
  col.withColumn(col("feature") * 2 as "newFeature").dropColumn("feature")
}
{code} I agree that this can be achieved with dataframe with {{drop}} as well, 
but sometimes, it's more convenient to work with {{Column}} API. Plus, as 
[~nkronenfeld] pointed out, it's nice to have {{Column}} and {{Dataset}} to 
share some signatures. 
# In our internal implementation, we use {{Column.withField}} instead of 
{{Column.withColumn}} as the discussion above. Since I prefer to have both 
{{Column}} and {{Dataset}} sharing the same method names, so in the example I 
wrote in this JIRA, I use {{Column.withColumn}}. But I agree that 
{{Column.withField}} is less confusing. Any more feedback on this?



was (Author: dbtsai):
Thanks [~jeremyrsmith] for adding more details. There are couple technical 
challenging as [~jeremyrsmith] pointed out; we can agree on what APIs should 
look like in this JIRA first, and then address those challenging in PRs. 

# [~rxin], with {{dropColumn}} support in {{Column}}, we can have some share 
building block on {{Column}} like 
{code}
def computeNewFeature: Column => Column = (col: Column) => {
  col.withColumn(col("feature") * 2 as "newFeature").dropColumn("feature")
}
{code} I agree that this can be achieved with dataframe with {{drop}} as well, 
but sometimes, it's more convenient to work with {{Column}} API. Plus, as 
[~nkronenfeld] pointed out, it's nice to have {{Column}} and {{Dataset}} to 
share some signatures. 
# In our internal implementation, we use {{Column.withField}} instead of 
{{Column.withColumn}}. Since I prefer to have both {{Column}} and {{Dataset}} 
sharing the same method names, so in the example I wrote above, I use 
{{Column.withColumn}}. But I agree that {{Column.withField}} is less confusing. 
Anymore feedback on this?


> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would 

[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2017-11-14 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252611#comment-16252611
 ] 

DB Tsai commented on SPARK-22231:
-

Thanks [~jeremyrsmith] for adding more details. There are couple technical 
challenging as [~jeremyrsmith] pointed out; we can agree on what APIs should 
look like in this JIRA first, and then address those challenging in PRs. 

# [~rxin], with {{dropColumn}} support in {{Column}}, we can have some share 
building block on {{Column}} like 
{code}
def computeNewFeature: Column => Column = (col: Column) => {
  col.withColumn(col("feature") * 2 as "newFeature").dropColumn("feature")
}
{code} I agree that this can be achieved with dataframe with {{drop}} as well, 
but sometimes, it's more convenient to work with {{Column}} API. Plus, as 
[~nkronenfeld] pointed out, it's nice to have {{Column}} and {{Dataset}} to 
share some signatures. 
# In our internal implementation, we use {{Column.withField}} instead of 
{{Column.withColumn}}. Since I prefer to have both {{Column}} and {{Dataset}} 
sharing the same method names, so in the example I wrote above, I use 
{{Column.withColumn}}. But I agree that {{Column.withField}} is less confusing. 
Anymore feedback on this?


> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the 

[jira] [Commented] (SPARK-20653) Add auto-cleanup of old elements to the new app state store

2017-11-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252560#comment-16252560
 ] 

Apache Spark commented on SPARK-20653:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19751

> Add auto-cleanup of old elements to the new app state store
> ---
>
> Key: SPARK-20653
> URL: https://issues.apache.org/jira/browse/SPARK-20653
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks restoring the functionality of the old UI where one could 
> set up a limit on the number of certain elements to be stored (e.g. only 
> store the most recent X tasks). This allows applications to control how much 
> UI data is kept around so that disk (or memory) usage is bounded.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20653) Add auto-cleanup of old elements to the new app state store

2017-11-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20653:


Assignee: (was: Apache Spark)

> Add auto-cleanup of old elements to the new app state store
> ---
>
> Key: SPARK-20653
> URL: https://issues.apache.org/jira/browse/SPARK-20653
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks restoring the functionality of the old UI where one could 
> set up a limit on the number of certain elements to be stored (e.g. only 
> store the most recent X tasks). This allows applications to control how much 
> UI data is kept around so that disk (or memory) usage is bounded.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20653) Add auto-cleanup of old elements to the new app state store

2017-11-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20653:


Assignee: Apache Spark

> Add auto-cleanup of old elements to the new app state store
> ---
>
> Key: SPARK-20653
> URL: https://issues.apache.org/jira/browse/SPARK-20653
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks restoring the functionality of the old UI where one could 
> set up a limit on the number of certain elements to be stored (e.g. only 
> store the most recent X tasks). This allows applications to control how much 
> UI data is kept around so that disk (or memory) usage is bounded.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22520) Support code generation also for complex CASE WHEN

2017-11-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22520:


Assignee: Apache Spark

> Support code generation also for complex CASE WHEN
> --
>
> Key: SPARK-22520
> URL: https://issues.apache.org/jira/browse/SPARK-22520
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Minor
>
> Code generation is disabled for CaseWhen when the number of branches is 
> higher than {{spark.sql.codegen.maxCaseBranches}} (which defaults to 20). 
> This was done in SPARK-13242 to prevent the well known 64KB method limit 
> exception.
> This tickets proposes to support code generation also in those cases (without 
> causing exceptions of course). As a side effect, we could get rid of the 
> {{spark.sql.codegen.maxCaseBranches}} configuration.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22520) Support code generation also for complex CASE WHEN

2017-11-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22520:


Assignee: (was: Apache Spark)

> Support code generation also for complex CASE WHEN
> --
>
> Key: SPARK-22520
> URL: https://issues.apache.org/jira/browse/SPARK-22520
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Minor
>
> Code generation is disabled for CaseWhen when the number of branches is 
> higher than {{spark.sql.codegen.maxCaseBranches}} (which defaults to 20). 
> This was done in SPARK-13242 to prevent the well known 64KB method limit 
> exception.
> This tickets proposes to support code generation also in those cases (without 
> causing exceptions of course). As a side effect, we could get rid of the 
> {{spark.sql.codegen.maxCaseBranches}} configuration.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22520) Support code generation also for complex CASE WHEN

2017-11-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252561#comment-16252561
 ] 

Apache Spark commented on SPARK-22520:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/19752

> Support code generation also for complex CASE WHEN
> --
>
> Key: SPARK-22520
> URL: https://issues.apache.org/jira/browse/SPARK-22520
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Minor
>
> Code generation is disabled for CaseWhen when the number of branches is 
> higher than {{spark.sql.codegen.maxCaseBranches}} (which defaults to 20). 
> This was done in SPARK-13242 to prevent the well known 64KB method limit 
> exception.
> This tickets proposes to support code generation also in those cases (without 
> causing exceptions of course). As a side effect, we could get rid of the 
> {{spark.sql.codegen.maxCaseBranches}} configuration.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22520) Support code generation also for complex CASE WHEN

2017-11-14 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22520:
---

 Summary: Support code generation also for complex CASE WHEN
 Key: SPARK-22520
 URL: https://issues.apache.org/jira/browse/SPARK-22520
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Marco Gaido
Priority: Minor


Code generation is disabled for CaseWhen when the number of branches is higher 
than {{spark.sql.codegen.maxCaseBranches}} (which defaults to 20). This was 
done in SPARK-13242 to prevent the well known 64KB method limit exception.

This tickets proposes to support code generation also in those cases (without 
causing exceptions of course). As a side effect, we could get rid of the 
{{spark.sql.codegen.maxCaseBranches}} configuration.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20650) Remove JobProgressListener (and other unneeded classes)

2017-11-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20650:


Assignee: Apache Spark

> Remove JobProgressListener (and other unneeded classes)
> ---
>
> Key: SPARK-20650
> URL: https://issues.apache.org/jira/browse/SPARK-20650
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks removing JobProgressListener and other classes that will be 
> made obsolete by the other changes in this project, and making adjustments to 
> parts of the code that still rely on them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20650) Remove JobProgressListener (and other unneeded classes)

2017-11-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20650:


Assignee: (was: Apache Spark)

> Remove JobProgressListener (and other unneeded classes)
> ---
>
> Key: SPARK-20650
> URL: https://issues.apache.org/jira/browse/SPARK-20650
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks removing JobProgressListener and other classes that will be 
> made obsolete by the other changes in this project, and making adjustments to 
> parts of the code that still rely on them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20650) Remove JobProgressListener (and other unneeded classes)

2017-11-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252441#comment-16252441
 ] 

Apache Spark commented on SPARK-20650:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19750

> Remove JobProgressListener (and other unneeded classes)
> ---
>
> Key: SPARK-20650
> URL: https://issues.apache.org/jira/browse/SPARK-20650
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks removing JobProgressListener and other classes that will be 
> made obsolete by the other changes in this project, and making adjustments to 
> parts of the code that still rely on them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20652) Make SQL UI use new app state store

2017-11-14 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-20652.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19681
[https://github.com/apache/spark/pull/19681]

> Make SQL UI use new app state store
> ---
>
> Key: SPARK-20652
> URL: https://issues.apache.org/jira/browse/SPARK-20652
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
> Fix For: 2.3.0
>
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks modifying the SQL listener and UI code to use the new app 
> state store.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20652) Make SQL UI use new app state store

2017-11-14 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-20652:


Assignee: Marcelo Vanzin

> Make SQL UI use new app state store
> ---
>
> Key: SPARK-20652
> URL: https://issues.apache.org/jira/browse/SPARK-20652
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.3.0
>
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks modifying the SQL listener and UI code to use the new app 
> state store.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22519) Remove unnecessary stagingDirPath null check in ApplicationMaster.cleanupStagingDir()

2017-11-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252110#comment-16252110
 ] 

Apache Spark commented on SPARK-22519:
--

User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/19749

> Remove unnecessary stagingDirPath null check in 
> ApplicationMaster.cleanupStagingDir()
> -
>
> Key: SPARK-22519
> URL: https://issues.apache.org/jira/browse/SPARK-22519
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Devaraj K
>Priority: Trivial
>
> In the below, the condition checks whether the stagingDirPath is null but 
> stagingDirPath never becomes null. If SPARK_YARN_STAGING_DIR env var is null 
> then it throws NPE while creating the Path.
> {code:title=ApplicationMaster.scala|borderStyle=solid}
> stagingDirPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"))
> if (stagingDirPath == null) {
>   logError("Staging directory is null")
>   return
> }
> {code}
> Here we need to check whether the System.getenv("SPARK_YARN_STAGING_DIR") is 
> null or not, not the stagingDirPath.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22519) Remove unnecessary stagingDirPath null check in ApplicationMaster.cleanupStagingDir()

2017-11-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22519:


Assignee: Apache Spark

> Remove unnecessary stagingDirPath null check in 
> ApplicationMaster.cleanupStagingDir()
> -
>
> Key: SPARK-22519
> URL: https://issues.apache.org/jira/browse/SPARK-22519
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Devaraj K
>Assignee: Apache Spark
>Priority: Trivial
>
> In the below, the condition checks whether the stagingDirPath is null but 
> stagingDirPath never becomes null. If SPARK_YARN_STAGING_DIR env var is null 
> then it throws NPE while creating the Path.
> {code:title=ApplicationMaster.scala|borderStyle=solid}
> stagingDirPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"))
> if (stagingDirPath == null) {
>   logError("Staging directory is null")
>   return
> }
> {code}
> Here we need to check whether the System.getenv("SPARK_YARN_STAGING_DIR") is 
> null or not, not the stagingDirPath.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22519) Remove unnecessary stagingDirPath null check in ApplicationMaster.cleanupStagingDir()

2017-11-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22519:


Assignee: (was: Apache Spark)

> Remove unnecessary stagingDirPath null check in 
> ApplicationMaster.cleanupStagingDir()
> -
>
> Key: SPARK-22519
> URL: https://issues.apache.org/jira/browse/SPARK-22519
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Devaraj K
>Priority: Trivial
>
> In the below, the condition checks whether the stagingDirPath is null but 
> stagingDirPath never becomes null. If SPARK_YARN_STAGING_DIR env var is null 
> then it throws NPE while creating the Path.
> {code:title=ApplicationMaster.scala|borderStyle=solid}
> stagingDirPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"))
> if (stagingDirPath == null) {
>   logError("Staging directory is null")
>   return
> }
> {code}
> Here we need to check whether the System.getenv("SPARK_YARN_STAGING_DIR") is 
> null or not, not the stagingDirPath.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22519) Remove unnecessary stagingDirPath null check in ApplicationMaster.cleanupStagingDir()

2017-11-14 Thread Devaraj K (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K updated SPARK-22519:
--
Summary: Remove unnecessary stagingDirPath null check in 
ApplicationMaster.cleanupStagingDir()  (was: 
ApplicationMaster.cleanupStagingDir() throws NPE when SPARK_YARN_STAGING_DIR 
env var is not available)

> Remove unnecessary stagingDirPath null check in 
> ApplicationMaster.cleanupStagingDir()
> -
>
> Key: SPARK-22519
> URL: https://issues.apache.org/jira/browse/SPARK-22519
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Devaraj K
>Priority: Minor
>
> In the below, the condition checks whether the stagingDirPath is null but 
> stagingDirPath never becomes null. If SPARK_YARN_STAGING_DIR env var is null 
> then it throws NPE while creating the Path.
> {code:title=ApplicationMaster.scala|borderStyle=solid}
> stagingDirPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"))
> if (stagingDirPath == null) {
>   logError("Staging directory is null")
>   return
> }
> {code}
> Here we need to check whether the System.getenv("SPARK_YARN_STAGING_DIR") is 
> null or not, not the stagingDirPath.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22519) Remove unnecessary stagingDirPath null check in ApplicationMaster.cleanupStagingDir()

2017-11-14 Thread Devaraj K (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K updated SPARK-22519:
--
Priority: Trivial  (was: Minor)

> Remove unnecessary stagingDirPath null check in 
> ApplicationMaster.cleanupStagingDir()
> -
>
> Key: SPARK-22519
> URL: https://issues.apache.org/jira/browse/SPARK-22519
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Devaraj K
>Priority: Trivial
>
> In the below, the condition checks whether the stagingDirPath is null but 
> stagingDirPath never becomes null. If SPARK_YARN_STAGING_DIR env var is null 
> then it throws NPE while creating the Path.
> {code:title=ApplicationMaster.scala|borderStyle=solid}
> stagingDirPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"))
> if (stagingDirPath == null) {
>   logError("Staging directory is null")
>   return
> }
> {code}
> Here we need to check whether the System.getenv("SPARK_YARN_STAGING_DIR") is 
> null or not, not the stagingDirPath.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2017-11-14 Thread Aihua Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252007#comment-16252007
 ] 

Aihua Xu commented on SPARK-18673:
--

HIVE-15016 has been committed, so now Hive supports hadoop-3. Can spark start 
to work on hadoop-3 support?

> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19371) Cannot spread cached partitions evenly across executors

2017-11-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251990#comment-16251990
 ] 

Sean Owen commented on SPARK-19371:
---

The question is whether this is something that needs a change in Spark. What do 
you propose?

I think the effect you cite relates to locality. See above? If so that is 
working as intended insofar as there are downsides to spending time sending 
cached partitions around and/or waiting for local execution. 

> Cannot spread cached partitions evenly across executors
> ---
>
> Key: SPARK-19371
> URL: https://issues.apache.org/jira/browse/SPARK-19371
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Thunder Stumpges
> Attachments: Unbalanced RDD Blocks, and resulting task imbalance.png, 
> Unbalanced RDD Blocks, and resulting task imbalance.png
>
>
> Before running an intensive iterative job (in this case a distributed topic 
> model training), we need to load a dataset and persist it across executors. 
> After loading from HDFS and persisting, the partitions are spread unevenly 
> across executors (based on the initial scheduling of the reads which are not 
> data locale sensitive). The partition sizes are even, just not their 
> distribution over executors. We currently have no way to force the partitions 
> to spread evenly, and as the iterative algorithm begins, tasks are 
> distributed to executors based on this initial load, forcing some very 
> unbalanced work.
> This has been mentioned a 
> [number|http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Partitions-not-distributed-evenly-to-executors-tt16988.html#a17059]
>  of 
> [times|http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html]
>  in 
> [various|http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html]
>  user/dev group threads.
> None of the discussions I could find had solutions that worked for me. Here 
> are examples of things I have tried. All resulted in partitions in memory 
> that were NOT evenly distributed to executors, causing future tasks to be 
> imbalanced across executors as well.
> *Reduce Locality*
> {code}spark.shuffle.reduceLocality.enabled=false/true{code}
> *"Legacy" memory mode*
> {code}spark.memory.useLegacyMode = true/false{code}
> *Basic load and repartition*
> {code}
> val numPartitions = 48*16
> val df = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions).
> persist
> df.count
> {code}
> *Load and repartition to 2x partitions, then shuffle repartition down to 
> desired partitions*
> {code}
> val numPartitions = 48*16
> val df2 = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions*2)
> val df = df2.repartition(numPartitions).
> persist
> df.count
> {code}
> It would be great if when persisting an RDD/DataFrame, if we could request 
> that those partitions be stored evenly across executors in preparation for 
> future tasks. 
> I'm not sure if this is a more general issue (I.E. not just involving 
> persisting RDDs), but for the persisted in-memory case, it can make a HUGE 
> difference in the over-all running time of the remaining work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22505) toDF() / createDataFrame() type inference doesn't work as expected

2017-11-14 Thread Ruslan Dautkhanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-22505:
--
Issue Type: Improvement  (was: Bug)

> toDF() / createDataFrame() type inference doesn't work as expected
> --
>
> Key: SPARK-22505
> URL: https://issues.apache.org/jira/browse/SPARK-22505
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Ruslan Dautkhanov
>  Labels: csvparser, inference, pyspark, schema, spark-sql
>
> {code}
> df = 
> sc.parallelize([('1','a'),('2','b'),('3','c')]).toDF(['should_be_int','should_be_str'])
> df.printSchema()
> {code}
> produces
> {noformat}
> root
>  |-- should_be_int: string (nullable = true)
>  |-- should_be_str: string (nullable = true)
> {noformat}
> Notice `should_be_int` has `string` datatype, according to documentation:
> https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
> {quote}
> Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the 
> datatypes. Rows are constructed by passing a list of key/value pairs as 
> kwargs to the Row class. The keys of this list define the column names of the 
> table, *and the types are inferred by sampling the whole dataset*, similar to 
> the inference that is performed on JSON files.
> {quote}
> Schema inference works as expected when reading delimited files like
> {code}
> spark.read.format('csv').option('inferSchema', True)...
> {code}
> but not when using toDF() / createDataFrame() API calls.
> Spark 2.2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22505) toDF() / createDataFrame() type inference doesn't work as expected

2017-11-14 Thread Ruslan Dautkhanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-22505:
--
Affects Version/s: 2.3.0

> toDF() / createDataFrame() type inference doesn't work as expected
> --
>
> Key: SPARK-22505
> URL: https://issues.apache.org/jira/browse/SPARK-22505
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Ruslan Dautkhanov
>  Labels: csvparser, inference, pyspark, schema, spark-sql
>
> {code}
> df = 
> sc.parallelize([('1','a'),('2','b'),('3','c')]).toDF(['should_be_int','should_be_str'])
> df.printSchema()
> {code}
> produces
> {noformat}
> root
>  |-- should_be_int: string (nullable = true)
>  |-- should_be_str: string (nullable = true)
> {noformat}
> Notice `should_be_int` has `string` datatype, according to documentation:
> https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
> {quote}
> Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the 
> datatypes. Rows are constructed by passing a list of key/value pairs as 
> kwargs to the Row class. The keys of this list define the column names of the 
> table, *and the types are inferred by sampling the whole dataset*, similar to 
> the inference that is performed on JSON files.
> {quote}
> Schema inference works as expected when reading delimited files like
> {code}
> spark.read.format('csv').option('inferSchema', True)...
> {code}
> but not when using toDF() / createDataFrame() API calls.
> Spark 2.2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19371) Cannot spread cached partitions evenly across executors

2017-11-14 Thread Thunder Stumpges (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251971#comment-16251971
 ] 

Thunder Stumpges edited comment on SPARK-19371 at 11/14/17 7:04 PM:


Hi [~sowen],

>From my perspective, you can attach whatever label you want (bug, feature, 
>story, PITA) but it is obviously causing troubles for more than just an 
>isolated case. 

And to re-state the issue, this is NOT (in my case at least) related to 
partitioner / hashing. My partitions are sized evenly, the keys are unique 
words (String) in a vocabulary. The issue is related to where the RDD blocks 
are distributed to the executors in cache storage. See the image below which 
shows the problem. 

What would solve this for me is not really a "shuffle" but a "rebalance 
storage" that moves RDD blocks in memory to balance them across executors.

!Unbalanced RDD Blocks, and resulting task imbalance.png!


was (Author: thunderstumpges):
Hi Sean,

>From my perspective, you can attach whatever label you want (bug, feature, 
>story, PITA) but it is obviously causing troubles for more than just an 
>isolated case. 

And to re-state the issue, this is NOT (in my case at least) related to 
partitioner / hashing. My partitions are sized evenly, the keys are unique 
words (String) in a vocabulary. The issue is related to where the RDD blocks 
are distributed to the executors in cache storage. See the image below which 
shows the problem. 

What would solve this for me is not really a "shuffle" but a "rebalance 
storage" that moves RDD blocks in memory to balance them across executors.

!Unbalanced RDD Blocks, and resulting task imbalance.png!

> Cannot spread cached partitions evenly across executors
> ---
>
> Key: SPARK-19371
> URL: https://issues.apache.org/jira/browse/SPARK-19371
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Thunder Stumpges
> Attachments: Unbalanced RDD Blocks, and resulting task imbalance.png, 
> Unbalanced RDD Blocks, and resulting task imbalance.png
>
>
> Before running an intensive iterative job (in this case a distributed topic 
> model training), we need to load a dataset and persist it across executors. 
> After loading from HDFS and persisting, the partitions are spread unevenly 
> across executors (based on the initial scheduling of the reads which are not 
> data locale sensitive). The partition sizes are even, just not their 
> distribution over executors. We currently have no way to force the partitions 
> to spread evenly, and as the iterative algorithm begins, tasks are 
> distributed to executors based on this initial load, forcing some very 
> unbalanced work.
> This has been mentioned a 
> [number|http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Partitions-not-distributed-evenly-to-executors-tt16988.html#a17059]
>  of 
> [times|http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html]
>  in 
> [various|http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html]
>  user/dev group threads.
> None of the discussions I could find had solutions that worked for me. Here 
> are examples of things I have tried. All resulted in partitions in memory 
> that were NOT evenly distributed to executors, causing future tasks to be 
> imbalanced across executors as well.
> *Reduce Locality*
> {code}spark.shuffle.reduceLocality.enabled=false/true{code}
> *"Legacy" memory mode*
> {code}spark.memory.useLegacyMode = true/false{code}
> *Basic load and repartition*
> {code}
> val numPartitions = 48*16
> val df = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions).
> persist
> df.count
> {code}
> *Load and repartition to 2x partitions, then shuffle repartition down to 
> desired partitions*
> {code}
> val numPartitions = 48*16
> val df2 = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions*2)
> val df = df2.repartition(numPartitions).
> persist
> df.count
> {code}
> It would be great if when persisting an RDD/DataFrame, if we could request 
> that those partitions be stored evenly across executors in preparation for 
> future tasks. 
> I'm not sure if this is a more general issue (I.E. not just involving 
> persisting RDDs), but for the persisted in-memory case, it can make a HUGE 
> difference in the over-all running time of the remaining work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19371) Cannot spread cached partitions evenly across executors

2017-11-14 Thread Thunder Stumpges (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251971#comment-16251971
 ] 

Thunder Stumpges commented on SPARK-19371:
--

Hi Sean,

>From my perspective, you can attach whatever label you want (bug, feature, 
>story, PITA) but it is obviously causing troubles for more than just an 
>isolated case. 

And to re-state the issue, this is NOT (in my case at least) related to 
partitioner / hashing. My partitions are sized evenly, the keys are unique 
words (String) in a vocabulary. The issue is related to where the RDD blocks 
are distributed to the executors in cache storage. See the image below which 
shows the problem. 

What would solve this for me is not really a "shuffle" but a "rebalance 
storage" that moves RDD blocks in memory to balance them across executors.

!Unbalanced RDD Blocks, and resulting task imbalance.png!

> Cannot spread cached partitions evenly across executors
> ---
>
> Key: SPARK-19371
> URL: https://issues.apache.org/jira/browse/SPARK-19371
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Thunder Stumpges
> Attachments: Unbalanced RDD Blocks, and resulting task imbalance.png, 
> Unbalanced RDD Blocks, and resulting task imbalance.png
>
>
> Before running an intensive iterative job (in this case a distributed topic 
> model training), we need to load a dataset and persist it across executors. 
> After loading from HDFS and persisting, the partitions are spread unevenly 
> across executors (based on the initial scheduling of the reads which are not 
> data locale sensitive). The partition sizes are even, just not their 
> distribution over executors. We currently have no way to force the partitions 
> to spread evenly, and as the iterative algorithm begins, tasks are 
> distributed to executors based on this initial load, forcing some very 
> unbalanced work.
> This has been mentioned a 
> [number|http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Partitions-not-distributed-evenly-to-executors-tt16988.html#a17059]
>  of 
> [times|http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html]
>  in 
> [various|http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html]
>  user/dev group threads.
> None of the discussions I could find had solutions that worked for me. Here 
> are examples of things I have tried. All resulted in partitions in memory 
> that were NOT evenly distributed to executors, causing future tasks to be 
> imbalanced across executors as well.
> *Reduce Locality*
> {code}spark.shuffle.reduceLocality.enabled=false/true{code}
> *"Legacy" memory mode*
> {code}spark.memory.useLegacyMode = true/false{code}
> *Basic load and repartition*
> {code}
> val numPartitions = 48*16
> val df = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions).
> persist
> df.count
> {code}
> *Load and repartition to 2x partitions, then shuffle repartition down to 
> desired partitions*
> {code}
> val numPartitions = 48*16
> val df2 = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions*2)
> val df = df2.repartition(numPartitions).
> persist
> df.count
> {code}
> It would be great if when persisting an RDD/DataFrame, if we could request 
> that those partitions be stored evenly across executors in preparation for 
> future tasks. 
> I'm not sure if this is a more general issue (I.E. not just involving 
> persisting RDDs), but for the persisted in-memory case, it can make a HUGE 
> difference in the over-all running time of the remaining work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19371) Cannot spread cached partitions evenly across executors

2017-11-14 Thread Thunder Stumpges (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thunder Stumpges updated SPARK-19371:
-
Attachment: Unbalanced RDD Blocks, and resulting task imbalance.png

> Cannot spread cached partitions evenly across executors
> ---
>
> Key: SPARK-19371
> URL: https://issues.apache.org/jira/browse/SPARK-19371
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Thunder Stumpges
> Attachments: Unbalanced RDD Blocks, and resulting task imbalance.png, 
> Unbalanced RDD Blocks, and resulting task imbalance.png
>
>
> Before running an intensive iterative job (in this case a distributed topic 
> model training), we need to load a dataset and persist it across executors. 
> After loading from HDFS and persisting, the partitions are spread unevenly 
> across executors (based on the initial scheduling of the reads which are not 
> data locale sensitive). The partition sizes are even, just not their 
> distribution over executors. We currently have no way to force the partitions 
> to spread evenly, and as the iterative algorithm begins, tasks are 
> distributed to executors based on this initial load, forcing some very 
> unbalanced work.
> This has been mentioned a 
> [number|http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Partitions-not-distributed-evenly-to-executors-tt16988.html#a17059]
>  of 
> [times|http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html]
>  in 
> [various|http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html]
>  user/dev group threads.
> None of the discussions I could find had solutions that worked for me. Here 
> are examples of things I have tried. All resulted in partitions in memory 
> that were NOT evenly distributed to executors, causing future tasks to be 
> imbalanced across executors as well.
> *Reduce Locality*
> {code}spark.shuffle.reduceLocality.enabled=false/true{code}
> *"Legacy" memory mode*
> {code}spark.memory.useLegacyMode = true/false{code}
> *Basic load and repartition*
> {code}
> val numPartitions = 48*16
> val df = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions).
> persist
> df.count
> {code}
> *Load and repartition to 2x partitions, then shuffle repartition down to 
> desired partitions*
> {code}
> val numPartitions = 48*16
> val df2 = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions*2)
> val df = df2.repartition(numPartitions).
> persist
> df.count
> {code}
> It would be great if when persisting an RDD/DataFrame, if we could request 
> that those partitions be stored evenly across executors in preparation for 
> future tasks. 
> I'm not sure if this is a more general issue (I.E. not just involving 
> persisting RDDs), but for the persisted in-memory case, it can make a HUGE 
> difference in the over-all running time of the remaining work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19371) Cannot spread cached partitions evenly across executors

2017-11-14 Thread Thunder Stumpges (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thunder Stumpges updated SPARK-19371:
-
Attachment: Unbalanced RDD Blocks, and resulting task imbalance.png

> Cannot spread cached partitions evenly across executors
> ---
>
> Key: SPARK-19371
> URL: https://issues.apache.org/jira/browse/SPARK-19371
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Thunder Stumpges
> Attachments: Unbalanced RDD Blocks, and resulting task imbalance.png
>
>
> Before running an intensive iterative job (in this case a distributed topic 
> model training), we need to load a dataset and persist it across executors. 
> After loading from HDFS and persisting, the partitions are spread unevenly 
> across executors (based on the initial scheduling of the reads which are not 
> data locale sensitive). The partition sizes are even, just not their 
> distribution over executors. We currently have no way to force the partitions 
> to spread evenly, and as the iterative algorithm begins, tasks are 
> distributed to executors based on this initial load, forcing some very 
> unbalanced work.
> This has been mentioned a 
> [number|http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Partitions-not-distributed-evenly-to-executors-tt16988.html#a17059]
>  of 
> [times|http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html]
>  in 
> [various|http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html]
>  user/dev group threads.
> None of the discussions I could find had solutions that worked for me. Here 
> are examples of things I have tried. All resulted in partitions in memory 
> that were NOT evenly distributed to executors, causing future tasks to be 
> imbalanced across executors as well.
> *Reduce Locality*
> {code}spark.shuffle.reduceLocality.enabled=false/true{code}
> *"Legacy" memory mode*
> {code}spark.memory.useLegacyMode = true/false{code}
> *Basic load and repartition*
> {code}
> val numPartitions = 48*16
> val df = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions).
> persist
> df.count
> {code}
> *Load and repartition to 2x partitions, then shuffle repartition down to 
> desired partitions*
> {code}
> val numPartitions = 48*16
> val df2 = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions*2)
> val df = df2.repartition(numPartitions).
> persist
> df.count
> {code}
> It would be great if when persisting an RDD/DataFrame, if we could request 
> that those partitions be stored evenly across executors in preparation for 
> future tasks. 
> I'm not sure if this is a more general issue (I.E. not just involving 
> persisting RDDs), but for the persisted in-memory case, it can make a HUGE 
> difference in the over-all running time of the remaining work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22519) ApplicationMaster.cleanupStagingDir() throws NPE when SPARK_YARN_STAGING_DIR env var is not available

2017-11-14 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251929#comment-16251929
 ] 

Marcelo Vanzin commented on SPARK-22519:


Yeah, that check seems bogus, but {{SPARK_YARN_STAGING_DIR}} should never 
really be unset.

> ApplicationMaster.cleanupStagingDir() throws NPE when SPARK_YARN_STAGING_DIR 
> env var is not available
> -
>
> Key: SPARK-22519
> URL: https://issues.apache.org/jira/browse/SPARK-22519
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Devaraj K
>Priority: Minor
>
> In the below, the condition checks whether the stagingDirPath is null but 
> stagingDirPath never becomes null. If SPARK_YARN_STAGING_DIR env var is null 
> then it throws NPE while creating the Path.
> {code:title=ApplicationMaster.scala|borderStyle=solid}
> stagingDirPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"))
> if (stagingDirPath == null) {
>   logError("Staging directory is null")
>   return
> }
> {code}
> Here we need to check whether the System.getenv("SPARK_YARN_STAGING_DIR") is 
> null or not, not the stagingDirPath.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22519) ApplicationMaster.cleanupStagingDir() throws NPE when SPARK_YARN_STAGING_DIR env var is not available

2017-11-14 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251925#comment-16251925
 ] 

Devaraj K commented on SPARK-22519:
---

It is not an usual case, I have seen this NPE while working SPARK-22404 when 
the SPARK_YARN_STAGING_DIR env doesn't exist. If you don't think to have a null 
check for env, atleast the *if (stagingDirPath == null) {* is never used.

> ApplicationMaster.cleanupStagingDir() throws NPE when SPARK_YARN_STAGING_DIR 
> env var is not available
> -
>
> Key: SPARK-22519
> URL: https://issues.apache.org/jira/browse/SPARK-22519
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Devaraj K
>Priority: Minor
>
> In the below, the condition checks whether the stagingDirPath is null but 
> stagingDirPath never becomes null. If SPARK_YARN_STAGING_DIR env var is null 
> then it throws NPE while creating the Path.
> {code:title=ApplicationMaster.scala|borderStyle=solid}
> stagingDirPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"))
> if (stagingDirPath == null) {
>   logError("Staging directory is null")
>   return
> }
> {code}
> Here we need to check whether the System.getenv("SPARK_YARN_STAGING_DIR") is 
> null or not, not the stagingDirPath.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22519) ApplicationMaster.cleanupStagingDir() throws NPE when SPARK_YARN_STAGING_DIR env var is not available

2017-11-14 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251866#comment-16251866
 ] 

Marcelo Vanzin commented on SPARK-22519:


Question is, when would it be null since it's set by Client.scala?

{code}
  private def setupLaunchEnv(
  stagingDirPath: Path,
  pySparkArchives: Seq[String]): HashMap[String, String] = {
...
env("SPARK_YARN_STAGING_DIR") = stagingDirPath.toString
{code}

> ApplicationMaster.cleanupStagingDir() throws NPE when SPARK_YARN_STAGING_DIR 
> env var is not available
> -
>
> Key: SPARK-22519
> URL: https://issues.apache.org/jira/browse/SPARK-22519
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Devaraj K
>Priority: Minor
>
> In the below, the condition checks whether the stagingDirPath is null but 
> stagingDirPath never becomes null. If SPARK_YARN_STAGING_DIR env var is null 
> then it throws NPE while creating the Path.
> {code:title=ApplicationMaster.scala|borderStyle=solid}
> stagingDirPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"))
> if (stagingDirPath == null) {
>   logError("Staging directory is null")
>   return
> }
> {code}
> Here we need to check whether the System.getenv("SPARK_YARN_STAGING_DIR") is 
> null or not, not the stagingDirPath.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22519) ApplicationMaster.cleanupStagingDir() throws NPE when SPARK_YARN_STAGING_DIR env var is not available

2017-11-14 Thread Devaraj K (JIRA)
Devaraj K created SPARK-22519:
-

 Summary: ApplicationMaster.cleanupStagingDir() throws NPE when 
SPARK_YARN_STAGING_DIR env var is not available
 Key: SPARK-22519
 URL: https://issues.apache.org/jira/browse/SPARK-22519
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.2.0
Reporter: Devaraj K
Priority: Minor


In the below, the condition checks whether the stagingDirPath is null but 
stagingDirPath never becomes null. If SPARK_YARN_STAGING_DIR env var is null 
then it throws NPE while creating the Path.

{code:title=ApplicationMaster.scala|borderStyle=solid}
stagingDirPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"))
if (stagingDirPath == null) {
  logError("Staging directory is null")
  return
}
{code}


Here we need to check whether the System.getenv("SPARK_YARN_STAGING_DIR") is 
null or not, not the stagingDirPath.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15428) Disable support for multiple streaming aggregations

2017-11-14 Thread Shashidhar Reddy Sudi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251836#comment-16251836
 ] 

Shashidhar Reddy Sudi commented on SPARK-15428:
---

Any plan for this availability in the coming release please ? 

> Disable support for multiple streaming aggregations
> ---
>
> Key: SPARK-15428
> URL: https://issues.apache.org/jira/browse/SPARK-15428
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.0.0
>
>
> Incrementalizing plans of with multiple streaming aggregation is tricky and 
> we dont have the necessary support for "delta" to implement correctly. So 
> disabling the support for multiple streaming aggregations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20649) Simplify REST API class hierarchy

2017-11-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20649:


Assignee: (was: Apache Spark)

> Simplify REST API class hierarchy
> -
>
> Key: SPARK-20649
> URL: https://issues.apache.org/jira/browse/SPARK-20649
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks simplifying the REST API hierarchy. Because a lot of the 
> code used for the current version of the REST API will be deleted as part of 
> the UI work (see SPARK-20648, SPARK-20647, SPARK-20646, SPARK-20645), we can 
> simplify the code by merging a bunch of the existing resource classes in the 
> REST API (and also getting rid of some duplication in the process).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20649) Simplify REST API class hierarchy

2017-11-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251825#comment-16251825
 ] 

Apache Spark commented on SPARK-20649:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19748

> Simplify REST API class hierarchy
> -
>
> Key: SPARK-20649
> URL: https://issues.apache.org/jira/browse/SPARK-20649
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks simplifying the REST API hierarchy. Because a lot of the 
> code used for the current version of the REST API will be deleted as part of 
> the UI work (see SPARK-20648, SPARK-20647, SPARK-20646, SPARK-20645), we can 
> simplify the code by merging a bunch of the existing resource classes in the 
> REST API (and also getting rid of some duplication in the process).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20649) Simplify REST API class hierarchy

2017-11-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20649:


Assignee: Apache Spark

> Simplify REST API class hierarchy
> -
>
> Key: SPARK-20649
> URL: https://issues.apache.org/jira/browse/SPARK-20649
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks simplifying the REST API hierarchy. Because a lot of the 
> code used for the current version of the REST API will be deleted as part of 
> the UI work (see SPARK-20648, SPARK-20647, SPARK-20646, SPARK-20645), we can 
> simplify the code by merging a bunch of the existing resource classes in the 
> REST API (and also getting rid of some duplication in the process).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22267) Spark SQL incorrectly reads ORC file when column order is different

2017-11-14 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251788#comment-16251788
 ] 

Dongjoon Hyun edited comment on SPARK-22267 at 11/14/17 5:36 PM:
-

[~cloud_fan]. This issue comes from old Hive reader path.
I'll check the PR and email.


was (Author: dongjoon):
[~cloud_fan]. This issue comes from old Hive reader path.

> Spark SQL incorrectly reads ORC file when column order is different
> ---
>
> Key: SPARK-22267
> URL: https://issues.apache.org/jira/browse/SPARK-22267
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>
> For a long time, Apache Spark SQL returns incorrect results when ORC file 
> schema is different from metastore schema order.
> {code}
> scala> Seq(1 -> 2).toDF("c1", 
> "c2").write.format("parquet").mode("overwrite").save("/tmp/p")
> scala> Seq(1 -> 2).toDF("c1", 
> "c2").write.format("orc").mode("overwrite").save("/tmp/o")
> scala> sql("CREATE EXTERNAL TABLE p(c2 INT, c1 INT) STORED AS parquet 
> LOCATION '/tmp/p'")
> scala> sql("CREATE EXTERNAL TABLE o(c2 INT, c1 INT) STORED AS orc LOCATION 
> '/tmp/o'")
> scala> spark.table("p").show  // Parquet is good.
> +---+---+
> | c2| c1|
> +---+---+
> |  2|  1|
> +---+---+
> scala> spark.table("o").show// This is wrong.
> +---+---+
> | c2| c1|
> +---+---+
> |  1|  2|
> +---+---+
> scala> spark.read.orc("/tmp/o").show  // This is correct.
> +---+---+
> | c1| c2|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> *TESTCASE*
> {code}
>   test("SPARK-22267 Spark SQL incorrectly reads ORC files when column order 
> is different") {
> withTempDir { dir =>
>   val path = dir.getCanonicalPath
>   Seq(1 -> 2).toDF("c1", 
> "c2").write.format("orc").mode("overwrite").save(path)
>   checkAnswer(spark.read.orc(path), Row(1, 2))
>   Seq("true", "false").foreach { value =>
> withTable("t") {
>   withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> value) {
> sql(s"CREATE EXTERNAL TABLE t(c2 INT, c1 INT) STORED AS ORC 
> LOCATION '$path'")
> checkAnswer(spark.table("t"), Row(2, 1))
>   }
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22267) Spark SQL incorrectly reads ORC file when column order is different

2017-11-14 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251788#comment-16251788
 ] 

Dongjoon Hyun commented on SPARK-22267:
---

[~cloud_fan]. This issue comes from old Hive reader path.

> Spark SQL incorrectly reads ORC file when column order is different
> ---
>
> Key: SPARK-22267
> URL: https://issues.apache.org/jira/browse/SPARK-22267
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>
> For a long time, Apache Spark SQL returns incorrect results when ORC file 
> schema is different from metastore schema order.
> {code}
> scala> Seq(1 -> 2).toDF("c1", 
> "c2").write.format("parquet").mode("overwrite").save("/tmp/p")
> scala> Seq(1 -> 2).toDF("c1", 
> "c2").write.format("orc").mode("overwrite").save("/tmp/o")
> scala> sql("CREATE EXTERNAL TABLE p(c2 INT, c1 INT) STORED AS parquet 
> LOCATION '/tmp/p'")
> scala> sql("CREATE EXTERNAL TABLE o(c2 INT, c1 INT) STORED AS orc LOCATION 
> '/tmp/o'")
> scala> spark.table("p").show  // Parquet is good.
> +---+---+
> | c2| c1|
> +---+---+
> |  2|  1|
> +---+---+
> scala> spark.table("o").show// This is wrong.
> +---+---+
> | c2| c1|
> +---+---+
> |  1|  2|
> +---+---+
> scala> spark.read.orc("/tmp/o").show  // This is correct.
> +---+---+
> | c1| c2|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> *TESTCASE*
> {code}
>   test("SPARK-22267 Spark SQL incorrectly reads ORC files when column order 
> is different") {
> withTempDir { dir =>
>   val path = dir.getCanonicalPath
>   Seq(1 -> 2).toDF("c1", 
> "c2").write.format("orc").mode("overwrite").save(path)
>   checkAnswer(spark.read.orc(path), Row(1, 2))
>   Seq("true", "false").foreach { value =>
> withTable("t") {
>   withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> value) {
> sql(s"CREATE EXTERNAL TABLE t(c2 INT, c1 INT) STORED AS ORC 
> LOCATION '$path'")
> checkAnswer(spark.table("t"), Row(2, 1))
>   }
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22431) Creating Permanent view with illegal type

2017-11-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22431:


Assignee: Apache Spark

> Creating Permanent view with illegal type
> -
>
> Key: SPARK-22431
> URL: https://issues.apache.org/jira/browse/SPARK-22431
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>
> It is possible in Spark SQL to create a permanent view that uses an nested 
> field with an illegal name.
> For example if we create the following view:
> {noformat}
> create view x as select struct('a' as `$q`, 1 as b) q
> {noformat}
> A simple select fails with the following exception:
> {noformat}
> select * from x;
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> struct<$q:string,b:int>
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
> ...
> {noformat}
> Dropping the view isn't possible either:
> {noformat}
> drop view x;
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> struct<$q:string,b:int>
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22431) Creating Permanent view with illegal type

2017-11-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251739#comment-16251739
 ] 

Apache Spark commented on SPARK-22431:
--

User 'skambha' has created a pull request for this issue:
https://github.com/apache/spark/pull/19747

> Creating Permanent view with illegal type
> -
>
> Key: SPARK-22431
> URL: https://issues.apache.org/jira/browse/SPARK-22431
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>
> It is possible in Spark SQL to create a permanent view that uses an nested 
> field with an illegal name.
> For example if we create the following view:
> {noformat}
> create view x as select struct('a' as `$q`, 1 as b) q
> {noformat}
> A simple select fails with the following exception:
> {noformat}
> select * from x;
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> struct<$q:string,b:int>
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
> ...
> {noformat}
> Dropping the view isn't possible either:
> {noformat}
> drop view x;
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> struct<$q:string,b:int>
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22431) Creating Permanent view with illegal type

2017-11-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22431:


Assignee: (was: Apache Spark)

> Creating Permanent view with illegal type
> -
>
> Key: SPARK-22431
> URL: https://issues.apache.org/jira/browse/SPARK-22431
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>
> It is possible in Spark SQL to create a permanent view that uses an nested 
> field with an illegal name.
> For example if we create the following view:
> {noformat}
> create view x as select struct('a' as `$q`, 1 as b) q
> {noformat}
> A simple select fails with the following exception:
> {noformat}
> select * from x;
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> struct<$q:string,b:int>
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
> ...
> {noformat}
> Dropping the view isn't possible either:
> {noformat}
> drop view x;
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> struct<$q:string,b:int>
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20648) Make Jobs and Stages pages use the new app state store

2017-11-14 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-20648:


Assignee: Marcelo Vanzin

> Make Jobs and Stages pages use the new app state store
> --
>
> Key: SPARK-20648
> URL: https://issues.apache.org/jira/browse/SPARK-20648
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.3.0
>
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks making both the Jobs and Stages pages use the new app state 
> store. Because these two pages are very tightly coupled, it's easier to 
> modify both in one go.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20648) Make Jobs and Stages pages use the new app state store

2017-11-14 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-20648.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19698
[https://github.com/apache/spark/pull/19698]

> Make Jobs and Stages pages use the new app state store
> --
>
> Key: SPARK-20648
> URL: https://issues.apache.org/jira/browse/SPARK-20648
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
> Fix For: 2.3.0
>
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks making both the Jobs and Stages pages use the new app state 
> store. Because these two pages are very tightly coupled, it's easier to 
> modify both in one go.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2489) Unsupported parquet datatype optional fixed_len_byte_array

2017-11-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2489:
---

Assignee: Apache Spark

> Unsupported parquet datatype optional fixed_len_byte_array
> --
>
> Key: SPARK-2489
> URL: https://issues.apache.org/jira/browse/SPARK-2489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 2.2.0
>Reporter: Pei-Lun Lee
>Assignee: Apache Spark
>
> tested against commit 9fe693b5
> {noformat}
> scala> sqlContext.parquetFile("/tmp/foo")
> java.lang.RuntimeException: Unsupported parquet datatype optional 
> fixed_len_byte_array(4) b
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:58)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:109)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:282)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:279)
> {noformat}
> example avro schema
> {noformat}
> protocol Test {
> fixed Bytes4(4);
> record Foo {
> union {null, Bytes4} b;
> }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2489) Unsupported parquet datatype optional fixed_len_byte_array

2017-11-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2489:
---

Assignee: (was: Apache Spark)

> Unsupported parquet datatype optional fixed_len_byte_array
> --
>
> Key: SPARK-2489
> URL: https://issues.apache.org/jira/browse/SPARK-2489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 2.2.0
>Reporter: Pei-Lun Lee
>
> tested against commit 9fe693b5
> {noformat}
> scala> sqlContext.parquetFile("/tmp/foo")
> java.lang.RuntimeException: Unsupported parquet datatype optional 
> fixed_len_byte_array(4) b
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:58)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:109)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:282)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:279)
> {noformat}
> example avro schema
> {noformat}
> protocol Test {
> fixed Bytes4(4);
> record Foo {
> union {null, Bytes4} b;
> }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17074) generate equi-height histogram for column

2017-11-14 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-17074:
---

Assignee: Zhenhua Wang

> generate equi-height histogram for column
> -
>
> Key: SPARK-17074
> URL: https://issues.apache.org/jira/browse/SPARK-17074
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.3.0
>Reporter: Ron Hu
>Assignee: Zhenhua Wang
> Fix For: 2.3.0
>
>
> Equi-height histogram is effective in handling skewed data distribution.
> For equi-height histogram, the heights of all bins(intervals) are the same. 
> The default number of bins we use is 254.
> Now we use a two-step method to generate an equi-height histogram:
> 1. use percentile_approx to get percentiles (end points of the equi-height 
> bin intervals);
> 2. use a new aggregate function to get distinct counts in each of these bins.
> Note that this method takes two table scans. In the future we may provide 
> other algorithms which need only one table scan.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17074) generate equi-height histogram for column

2017-11-14 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17074.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19479
[https://github.com/apache/spark/pull/19479]

> generate equi-height histogram for column
> -
>
> Key: SPARK-17074
> URL: https://issues.apache.org/jira/browse/SPARK-17074
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.3.0
>Reporter: Ron Hu
> Fix For: 2.3.0
>
>
> Equi-height histogram is effective in handling skewed data distribution.
> For equi-height histogram, the heights of all bins(intervals) are the same. 
> The default number of bins we use is 254.
> Now we use a two-step method to generate an equi-height histogram:
> 1. use percentile_approx to get percentiles (end points of the equi-height 
> bin intervals);
> 2. use a new aggregate function to get distinct counts in each of these bins.
> Note that this method takes two table scans. In the future we may provide 
> other algorithms which need only one table scan.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22346) Update VectorAssembler to work with Structured Streaming

2017-11-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251518#comment-16251518
 ] 

Apache Spark commented on SPARK-22346:
--

User 'MrBago' has created a pull request for this issue:
https://github.com/apache/spark/pull/19746

> Update VectorAssembler to work with Structured Streaming
> 
>
> Key: SPARK-22346
> URL: https://issues.apache.org/jira/browse/SPARK-22346
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> The issue
> In batch mode, VectorAssembler can take multiple columns of VectorType and 
> assemble a output a new column of VectorType containing the concatenated 
> vectors. In streaming mode, this transformation can fail because 
> VectorAssembler does not have enough information to produce metadata 
> (AttributeGroup) for the new column. Because VectorAssembler is such a 
> ubiquitous part of mllib pipelines, this issue effectively means spark 
> structured streaming does not support prediction using mllib pipelines.
> I've created this ticket so we can discuss ways to potentially improve 
> VectorAssembler. Please let me know if there are any issues I have not 
> considered or potential fixes I haven't outlined. I'm happy to submit a patch 
> once I know which strategy is the best approach.
> Potential fixes
> 1) Replace VectorAssembler with an estimator/model pair like was recently 
> done with OneHotEncoder, 
> [SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The 
> Estimator can "learn" the size of the inputs vectors during training and save 
> it to use during prediction.
> Pros:
> * Possibly simplest of the potential fixes
> Cons:
> * We'll need to deprecate current VectorAssembler
> 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty 
> major change, but it could be done in stages. We could first ensure that 
> metadata is not used during prediction and allow the VectorAssembler to drop 
> metadata for streaming dataframes. Going forward, it would be important to 
> not use any metadata on Vector columns for any prediction tasks.
> Pros:
> * Potentially, easy short term fix for VectorAssembler
> (drop metadata for vector columns in streaming).
> * Current Attributes implementation is also causing other issues, eg 
> [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141].
> Cons:
> * To fully remove ML Attributes would be a major refactor of MLlib and would 
> most likely require breaking changings.
> * A partial removal of ML attributes (eg: ensure ML attributes are not used 
> during transform, only during fit) might be tricky. This would require 
> testing or other enforcement mechanism to prevent regressions.
> 3) Require Vector columns to have fixed length vectors. Most mllib 
> transformers that produce vectors already include the size of the vector in 
> the column metadata. This change would be to deprecate APIs that allow 
> creating a vector column of unknown length and replace those APIs with 
> equivalents that enforce a fixed size.
> Pros:
> * We already treat vectors as fixed size, for example VectorAssembler assumes 
> the inputs * output col are fixed size vectors and creates metadata 
> accordingly. In the spirit of explicit is better than implicit, we would be 
> codifying something we already assume.
> * This could potentially enable performance optimizations that are only 
> possible if the Vector size of a column is fixed & known.
> Cons:
> * This would require breaking changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22346) Update VectorAssembler to work with Structured Streaming

2017-11-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22346:


Assignee: (was: Apache Spark)

> Update VectorAssembler to work with Structured Streaming
> 
>
> Key: SPARK-22346
> URL: https://issues.apache.org/jira/browse/SPARK-22346
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> The issue
> In batch mode, VectorAssembler can take multiple columns of VectorType and 
> assemble a output a new column of VectorType containing the concatenated 
> vectors. In streaming mode, this transformation can fail because 
> VectorAssembler does not have enough information to produce metadata 
> (AttributeGroup) for the new column. Because VectorAssembler is such a 
> ubiquitous part of mllib pipelines, this issue effectively means spark 
> structured streaming does not support prediction using mllib pipelines.
> I've created this ticket so we can discuss ways to potentially improve 
> VectorAssembler. Please let me know if there are any issues I have not 
> considered or potential fixes I haven't outlined. I'm happy to submit a patch 
> once I know which strategy is the best approach.
> Potential fixes
> 1) Replace VectorAssembler with an estimator/model pair like was recently 
> done with OneHotEncoder, 
> [SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The 
> Estimator can "learn" the size of the inputs vectors during training and save 
> it to use during prediction.
> Pros:
> * Possibly simplest of the potential fixes
> Cons:
> * We'll need to deprecate current VectorAssembler
> 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty 
> major change, but it could be done in stages. We could first ensure that 
> metadata is not used during prediction and allow the VectorAssembler to drop 
> metadata for streaming dataframes. Going forward, it would be important to 
> not use any metadata on Vector columns for any prediction tasks.
> Pros:
> * Potentially, easy short term fix for VectorAssembler
> (drop metadata for vector columns in streaming).
> * Current Attributes implementation is also causing other issues, eg 
> [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141].
> Cons:
> * To fully remove ML Attributes would be a major refactor of MLlib and would 
> most likely require breaking changings.
> * A partial removal of ML attributes (eg: ensure ML attributes are not used 
> during transform, only during fit) might be tricky. This would require 
> testing or other enforcement mechanism to prevent regressions.
> 3) Require Vector columns to have fixed length vectors. Most mllib 
> transformers that produce vectors already include the size of the vector in 
> the column metadata. This change would be to deprecate APIs that allow 
> creating a vector column of unknown length and replace those APIs with 
> equivalents that enforce a fixed size.
> Pros:
> * We already treat vectors as fixed size, for example VectorAssembler assumes 
> the inputs * output col are fixed size vectors and creates metadata 
> accordingly. In the spirit of explicit is better than implicit, we would be 
> codifying something we already assume.
> * This could potentially enable performance optimizations that are only 
> possible if the Vector size of a column is fixed & known.
> Cons:
> * This would require breaking changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22346) Update VectorAssembler to work with Structured Streaming

2017-11-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22346:


Assignee: Apache Spark

> Update VectorAssembler to work with Structured Streaming
> 
>
> Key: SPARK-22346
> URL: https://issues.apache.org/jira/browse/SPARK-22346
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>Assignee: Apache Spark
>Priority: Critical
>
> The issue
> In batch mode, VectorAssembler can take multiple columns of VectorType and 
> assemble a output a new column of VectorType containing the concatenated 
> vectors. In streaming mode, this transformation can fail because 
> VectorAssembler does not have enough information to produce metadata 
> (AttributeGroup) for the new column. Because VectorAssembler is such a 
> ubiquitous part of mllib pipelines, this issue effectively means spark 
> structured streaming does not support prediction using mllib pipelines.
> I've created this ticket so we can discuss ways to potentially improve 
> VectorAssembler. Please let me know if there are any issues I have not 
> considered or potential fixes I haven't outlined. I'm happy to submit a patch 
> once I know which strategy is the best approach.
> Potential fixes
> 1) Replace VectorAssembler with an estimator/model pair like was recently 
> done with OneHotEncoder, 
> [SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The 
> Estimator can "learn" the size of the inputs vectors during training and save 
> it to use during prediction.
> Pros:
> * Possibly simplest of the potential fixes
> Cons:
> * We'll need to deprecate current VectorAssembler
> 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty 
> major change, but it could be done in stages. We could first ensure that 
> metadata is not used during prediction and allow the VectorAssembler to drop 
> metadata for streaming dataframes. Going forward, it would be important to 
> not use any metadata on Vector columns for any prediction tasks.
> Pros:
> * Potentially, easy short term fix for VectorAssembler
> (drop metadata for vector columns in streaming).
> * Current Attributes implementation is also causing other issues, eg 
> [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141].
> Cons:
> * To fully remove ML Attributes would be a major refactor of MLlib and would 
> most likely require breaking changings.
> * A partial removal of ML attributes (eg: ensure ML attributes are not used 
> during transform, only during fit) might be tricky. This would require 
> testing or other enforcement mechanism to prevent regressions.
> 3) Require Vector columns to have fixed length vectors. Most mllib 
> transformers that produce vectors already include the size of the vector in 
> the column metadata. This change would be to deprecate APIs that allow 
> creating a vector column of unknown length and replace those APIs with 
> equivalents that enforce a fixed size.
> Pros:
> * We already treat vectors as fixed size, for example VectorAssembler assumes 
> the inputs * output col are fixed size vectors and creates metadata 
> accordingly. In the spirit of explicit is better than implicit, we would be 
> codifying something we already assume.
> * This could potentially enable performance optimizations that are only 
> possible if the Vector size of a column is fixed & known.
> Cons:
> * This would require breaking changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2017-11-14 Thread Li Yuanjian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251457#comment-16251457
 ] 

Li Yuanjian commented on SPARK-2926:


I just giving a preview PR above, I'll collect more suggestions about this and 
maybe raise a SPIP vote later.

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> --
>
> Key: SPARK-2926
> URL: https://issues.apache.org/jira/browse/SPARK-2926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Attachments: SortBasedShuffleRead.pdf, SortBasedShuffleReader on 
> Spark 2.x.pdf, Spark Shuffle Test Report(contd).pdf, Spark Shuffle Test 
> Report.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2017-11-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251452#comment-16251452
 ] 

Apache Spark commented on SPARK-2926:
-

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/19745

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> --
>
> Key: SPARK-2926
> URL: https://issues.apache.org/jira/browse/SPARK-2926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Attachments: SortBasedShuffleRead.pdf, SortBasedShuffleReader on 
> Spark 2.x.pdf, Spark Shuffle Test Report(contd).pdf, Spark Shuffle Test 
> Report.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15428) Disable support for multiple streaming aggregations

2017-11-14 Thread Hristo Angelov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251407#comment-16251407
 ] 

Hristo Angelov commented on SPARK-15428:


[~tdas] Is enabling of multiple aggregations in structured streaming something 
that is planned in future releases?

> Disable support for multiple streaming aggregations
> ---
>
> Key: SPARK-15428
> URL: https://issues.apache.org/jira/browse/SPARK-15428
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.0.0
>
>
> Incrementalizing plans of with multiple streaming aggregation is tricky and 
> we dont have the necessary support for "delta" to implement correctly. So 
> disabling the support for multiple streaming aggregations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2017-11-14 Thread Li Yuanjian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Yuanjian updated SPARK-2926:
---
Comment: was deleted

(was: The follow up work for SortShuffleReader in current master branch, detail 
test cases and benchmark description.)

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> --
>
> Key: SPARK-2926
> URL: https://issues.apache.org/jira/browse/SPARK-2926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Attachments: SortBasedShuffleRead.pdf, SortBasedShuffleReader on 
> Spark 2.x.pdf, Spark Shuffle Test Report(contd).pdf, Spark Shuffle Test 
> Report.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2017-11-14 Thread Li Yuanjian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Yuanjian updated SPARK-2926:
---
Attachment: SortBasedShuffleReader on Spark 2.x.pdf

The follow up work for SortShuffleReader in current master branch, detail test 
cases and benchmark description.

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> --
>
> Key: SPARK-2926
> URL: https://issues.apache.org/jira/browse/SPARK-2926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Attachments: SortBasedShuffleRead.pdf, SortBasedShuffleReader on 
> Spark 2.x.pdf, Spark Shuffle Test Report(contd).pdf, Spark Shuffle Test 
> Report.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2017-11-14 Thread Li Yuanjian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251398#comment-16251398
 ] 

Li Yuanjian edited comment on SPARK-2926 at 11/14/17 1:54 PM:
--

During our work of migrating some old Hadoop job to Spark, I noticed this JIRA 
and the code based on spark 1.x.

I re-implemented the old PR based on Spark 2.1 and current master branch. After 
produced some scenario and ran some benchmark tests, I found that this shuffle 
mode can bring {color:red}12x~30x boosting in task duration and reduce peak 
execution memory to 1/12 ~ 1/50{color} vs current master version(see detail 
screenshot and test data in attatched pdf), especially the memory reducing, in 
this shuffle mode Spark can support more data size in less memory usage. The 
detail doc attached in this jira named "SortShuffleReader on Spark 2.x.pdf".

I know that DataSet API will have better optimization and performance, but RDD 
API may still useful for flexible control and old Spark/Hadoop jobs. For the 
better performance in ordering cases and more cost-effective memory usage, 
maybe this PR is still worth to merge in to master.

I'll sort out current code base and give a PR soon. Any comments and trying out 
would be greatly appreciated.


was (Author: xuanyuan):
During our work of migrating some old Hadoop job to Spark, I noticed this JIRA 
and the code based on spark 1.x.

I re-implemented the old PR based on Spark 2.1 and current master branch. After 
produced some scenario and ran some benchmark tests, I found that this shuffle 
mode can bring {color:red}12x~30x boosting in task duration and reduce peak 
execution memory to 1/12 ~ 1/50{color}(see detail screenshot and test data in 
attatched pdf) vs current master version, especially the memory reducing, in 
this shuffle mode Spark can support more data size in less memory usage. The 
detail doc attached in this jira named "SortShuffleReader on Spark 2.x.pdf".

I know that DataSet API will have better optimization and performance, but RDD 
API may still useful for flexible control and old Spark/Hadoop jobs. For the 
better performance in ordering cases and more cost-effective memory usage, 
maybe this PR is still worth to merge in to master.

I'll sort out current code base and give a PR soon. Any comments and trying out 
would be greatly appreciated.

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> --
>
> Key: SPARK-2926
> URL: https://issues.apache.org/jira/browse/SPARK-2926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test 
> Report(contd).pdf, Spark Shuffle Test Report.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2017-11-14 Thread Li Yuanjian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251398#comment-16251398
 ] 

Li Yuanjian edited comment on SPARK-2926 at 11/14/17 1:53 PM:
--

During our work of migrating some old Hadoop job to Spark, I noticed this JIRA 
and the code based on spark 1.x.

I re-implemented the old PR based on Spark 2.1 and current master branch. After 
produced some scenario and ran some benchmark tests, I found that this shuffle 
mode can bring {color:red}12x~30x boosting in task duration and reduce peak 
execution memory to 1/12 ~ 1/50{color}(see detail screenshot and test data in 
attatched pdf) vs current master version, especially the memory reducing, in 
this shuffle mode Spark can support more data size in less memory usage. The 
detail doc attached in this jira named "SortShuffleReader on Spark 2.x.pdf".

I know that DataSet API will have better optimization and performance, but RDD 
API may still useful for flexible control and old Spark/Hadoop jobs. For the 
better performance in ordering cases and more cost-effective memory usage, 
maybe this PR is still worth to merge in to master.

I'll sort out current code base and give a PR soon. Any comments and trying out 
would be greatly appreciated.


was (Author: xuanyuan):
During our work of migrating some old Hadoop job to Spark, I noticed this JIRA 
and the code based on spark 1.x.

I re-implemented the old PR based on Spark 2.1 and current master branch. After 
produced some scenario and ran some benchmark tests, I found that this shuffle 
mode can bring {color:red}12x~30x boosting in task duration and reduce peak 
execution memory to 1/12 ~ 1/50{color} vs current master version, especially 
the memory reducing, in this shuffle mode Spark can support more data size in 
less memory usage. The detail doc attached in this jira named 
"SortShuffleReader on Spark 2.x".

I know that DataSet API will have better optimization and performance, but RDD 
API may still useful for flexible control and old Spark/Hadoop jobs. For the 
better performance in ordering cases and more cost-effective memory usage, 
maybe this PR is still worth to merge in to master.

I'll sort out current code base and give a PR soon. Any comments and trying out 
would be greatly appreciated.

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> --
>
> Key: SPARK-2926
> URL: https://issues.apache.org/jira/browse/SPARK-2926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test 
> Report(contd).pdf, Spark Shuffle Test Report.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2017-11-14 Thread Li Yuanjian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251398#comment-16251398
 ] 

Li Yuanjian commented on SPARK-2926:


During our work of migrating some old Hadoop job to Spark, I noticed this JIRA 
and the code based on spark 1.x.

I re-implemented the old PR based on Spark 2.1 and current master branch. After 
produced some scenario and ran some benchmark tests, I found that this shuffle 
mode can bring {color:red}12x~30x boosting in task duration and reduce peak 
execution memory to 1/12 ~ 1/50{color} vs current master version, especially 
the memory reducing, in this shuffle mode Spark can support more data size in 
less memory usage. The detail doc attached in this jira named 
"SortShuffleReader on Spark 2.x".

I know that DataSet API will have better optimization and performance, but RDD 
API may still useful for flexible control and old Spark/Hadoop jobs. For the 
better performance in ordering cases and more cost-effective memory usage, 
maybe this PR is still worth to merge in to master.

I'll sort out current code base and give a PR soon. Any comments and trying out 
would be greatly appreciated.

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> --
>
> Key: SPARK-2926
> URL: https://issues.apache.org/jira/browse/SPARK-2926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test 
> Report(contd).pdf, Spark Shuffle Test Report.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2017-11-14 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-21337:
-
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-22510

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
> Fix For: 2.1.1
>
> Attachments: test.JPG, test1.JPG, test2.JPG
>
>
> when there are large 'case when ' expressions in spark sql,the CodeGenerator 
> failed to compile it. 
> Error message is followed by a huge dump of generated source code,at last 
> failed.
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> apply_9$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V
>  of class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  grows beyond 64 KB.
> It seems that SPARK-13242 has solved this problem in spark-1.6.2,however it  
> apparence in spark-2.1.1 again. 
> https://issues.apache.org/jira/browse/SPARK-13242.
> is there something wrong ? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2017-11-14 Thread Damian Momot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251354#comment-16251354
 ] 

Damian Momot commented on SPARK-4502:
-

Well this PR is ready: https://github.com/apache/spark/pull/16578 but it still 
awaits acceptance

> Spark SQL reads unneccesary nested fields from Parquet
> --
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Priority: Critical
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22516:
--
Priority: Minor  (was: Major)

> CSV Read breaks: When "multiLine" = "true", if "comment" option is set as 
> last line's first character
> -
>
> Key: SPARK-22516
> URL: https://issues.apache.org/jira/browse/SPARK-22516
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>Priority: Minor
>  Labels: csvparser
> Attachments: testCommentChar.csv
>
>
> Try to read attached CSV file with following parse properties,
> scala> *val csvFile = 
> spark.read.option("header","true").option("inferSchema", 
> "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/test
> CommentChar.csv");   *
>   
>   
> csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]  
>   
>  
>   
>   
>  
> scala> csvFile.show   
>   
>  
> +---+---+ 
>   
>  
> |  a|  b| 
>   
>  
> +---+---+ 
>   
>  
> +---+---+   
> {color:#8eb021}*Noticed that it works fine.*{color}
> If we add an option "multiLine" = "true", it fails with below exception. This 
> happens only if we pass "comment" == input dataset's last line's first 
> character
> scala> val csvFile = 
> *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
>  "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/testCommentChar.csv");*
> 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
> com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End 
> of input reached
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> Header extraction enabled=null
> Headers=null
> Ignore leading whitespaces=false
> Ignore trailing whitespaces=false
> Input buffer size=128
> Input reading on separate thread=false
> Keep escape sequences=false
> Keep quotes=false
> Length of content displayed on error=-1
> Line separator detection enabled=false
> Maximum number of characters per column=-1
> Maximum number of columns=20480
> Normalize escaped line separators=true
> Null value=
> Number of records to read=all
> Processor=none
> Restricting data in exceptions=false
> RowProcessor error handler=null
> Selected fields=none
> Skip empty lines=true
> Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
> CsvFormat:
> Comment character=c
> Field delimiter=,
> Line separator (normalized)=\n
> Line separator sequence=\r\n
> Quote character="
> Quote escape character=\
> Quote escape escape character=null
> Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
> at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
> 

[jira] [Updated] (SPARK-22518) Make default cache storage level configurable

2017-11-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22518:
--
Issue Type: Improvement  (was: Bug)

You can choose the storage level with persist(). You can of course just write 
your app to use the same level in every call you make to persist(). I don't see 
additional value in making yet another config value that alters the default 
globally. Where storage level matters, it's typically not true that every call 
to cache() in the app and Spark and everywhere should be the same.

> Make default cache storage level configurable
> -
>
> Key: SPARK-22518
> URL: https://issues.apache.org/jira/browse/SPARK-22518
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Rares Mirica
>Priority: Minor
>
> Caching defaults to the hard-coded value MEMORY_ONLY, and as most users call 
> the convenient .cache() method this value is not configurable in a global 
> way. Please make this configurable through a spark config option.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22267) Spark SQL incorrectly reads ORC file when column order is different

2017-11-14 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251305#comment-16251305
 ] 

Wenchen Fan commented on SPARK-22267:
-

[~dongjoon] will this be fixed by the new orc reader?

> Spark SQL incorrectly reads ORC file when column order is different
> ---
>
> Key: SPARK-22267
> URL: https://issues.apache.org/jira/browse/SPARK-22267
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>
> For a long time, Apache Spark SQL returns incorrect results when ORC file 
> schema is different from metastore schema order.
> {code}
> scala> Seq(1 -> 2).toDF("c1", 
> "c2").write.format("parquet").mode("overwrite").save("/tmp/p")
> scala> Seq(1 -> 2).toDF("c1", 
> "c2").write.format("orc").mode("overwrite").save("/tmp/o")
> scala> sql("CREATE EXTERNAL TABLE p(c2 INT, c1 INT) STORED AS parquet 
> LOCATION '/tmp/p'")
> scala> sql("CREATE EXTERNAL TABLE o(c2 INT, c1 INT) STORED AS orc LOCATION 
> '/tmp/o'")
> scala> spark.table("p").show  // Parquet is good.
> +---+---+
> | c2| c1|
> +---+---+
> |  2|  1|
> +---+---+
> scala> spark.table("o").show// This is wrong.
> +---+---+
> | c2| c1|
> +---+---+
> |  1|  2|
> +---+---+
> scala> spark.read.orc("/tmp/o").show  // This is correct.
> +---+---+
> | c1| c2|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> *TESTCASE*
> {code}
>   test("SPARK-22267 Spark SQL incorrectly reads ORC files when column order 
> is different") {
> withTempDir { dir =>
>   val path = dir.getCanonicalPath
>   Seq(1 -> 2).toDF("c1", 
> "c2").write.format("orc").mode("overwrite").save(path)
>   checkAnswer(spark.read.orc(path), Row(1, 2))
>   Seq("true", "false").foreach { value =>
> withTable("t") {
>   withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> value) {
> sql(s"CREATE EXTERNAL TABLE t(c2 INT, c1 INT) STORED AS ORC 
> LOCATION '$path'")
> checkAnswer(spark.table("t"), Row(2, 1))
>   }
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17310) Disable Parquet's record-by-record filter in normal parquet reader and do it in Spark-side

2017-11-14 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-17310:
---

Assignee: Hyukjin Kwon

> Disable Parquet's record-by-record filter in normal parquet reader and do it 
> in Spark-side
> --
>
> Key: SPARK-17310
> URL: https://issues.apache.org/jira/browse/SPARK-17310
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.3.0
>
>
> Currently, we are pushing filters down for normal Parquet reader which also 
> filters record-by-record.
> It seems Spark-side codegen row-by-row filtering might be faster than 
> Parquet's one in general due to type-boxing and virtual function calls which 
> Spark's one tries to avoid.
> Maybe we should perform a benchmark and disable this. This ticket was from 
> https://github.com/apache/spark/pull/14671
> Please refer the discussion in the PR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17310) Disable Parquet's record-by-record filter in normal parquet reader and do it in Spark-side

2017-11-14 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17310.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 15049
[https://github.com/apache/spark/pull/15049]

> Disable Parquet's record-by-record filter in normal parquet reader and do it 
> in Spark-side
> --
>
> Key: SPARK-17310
> URL: https://issues.apache.org/jira/browse/SPARK-17310
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
> Fix For: 2.3.0
>
>
> Currently, we are pushing filters down for normal Parquet reader which also 
> filters record-by-record.
> It seems Spark-side codegen row-by-row filtering might be faster than 
> Parquet's one in general due to type-boxing and virtual function calls which 
> Spark's one tries to avoid.
> Maybe we should perform a benchmark and disable this. This ticket was from 
> https://github.com/apache/spark/pull/14671
> Please refer the discussion in the PR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22267) Spark SQL incorrectly reads ORC file when column order is different

2017-11-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22267:


Assignee: (was: Apache Spark)

> Spark SQL incorrectly reads ORC file when column order is different
> ---
>
> Key: SPARK-22267
> URL: https://issues.apache.org/jira/browse/SPARK-22267
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>
> For a long time, Apache Spark SQL returns incorrect results when ORC file 
> schema is different from metastore schema order.
> {code}
> scala> Seq(1 -> 2).toDF("c1", 
> "c2").write.format("parquet").mode("overwrite").save("/tmp/p")
> scala> Seq(1 -> 2).toDF("c1", 
> "c2").write.format("orc").mode("overwrite").save("/tmp/o")
> scala> sql("CREATE EXTERNAL TABLE p(c2 INT, c1 INT) STORED AS parquet 
> LOCATION '/tmp/p'")
> scala> sql("CREATE EXTERNAL TABLE o(c2 INT, c1 INT) STORED AS orc LOCATION 
> '/tmp/o'")
> scala> spark.table("p").show  // Parquet is good.
> +---+---+
> | c2| c1|
> +---+---+
> |  2|  1|
> +---+---+
> scala> spark.table("o").show// This is wrong.
> +---+---+
> | c2| c1|
> +---+---+
> |  1|  2|
> +---+---+
> scala> spark.read.orc("/tmp/o").show  // This is correct.
> +---+---+
> | c1| c2|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> *TESTCASE*
> {code}
>   test("SPARK-22267 Spark SQL incorrectly reads ORC files when column order 
> is different") {
> withTempDir { dir =>
>   val path = dir.getCanonicalPath
>   Seq(1 -> 2).toDF("c1", 
> "c2").write.format("orc").mode("overwrite").save(path)
>   checkAnswer(spark.read.orc(path), Row(1, 2))
>   Seq("true", "false").foreach { value =>
> withTable("t") {
>   withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> value) {
> sql(s"CREATE EXTERNAL TABLE t(c2 INT, c1 INT) STORED AS ORC 
> LOCATION '$path'")
> checkAnswer(spark.table("t"), Row(2, 1))
>   }
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22267) Spark SQL incorrectly reads ORC file when column order is different

2017-11-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22267:


Assignee: Apache Spark

> Spark SQL incorrectly reads ORC file when column order is different
> ---
>
> Key: SPARK-22267
> URL: https://issues.apache.org/jira/browse/SPARK-22267
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> For a long time, Apache Spark SQL returns incorrect results when ORC file 
> schema is different from metastore schema order.
> {code}
> scala> Seq(1 -> 2).toDF("c1", 
> "c2").write.format("parquet").mode("overwrite").save("/tmp/p")
> scala> Seq(1 -> 2).toDF("c1", 
> "c2").write.format("orc").mode("overwrite").save("/tmp/o")
> scala> sql("CREATE EXTERNAL TABLE p(c2 INT, c1 INT) STORED AS parquet 
> LOCATION '/tmp/p'")
> scala> sql("CREATE EXTERNAL TABLE o(c2 INT, c1 INT) STORED AS orc LOCATION 
> '/tmp/o'")
> scala> spark.table("p").show  // Parquet is good.
> +---+---+
> | c2| c1|
> +---+---+
> |  2|  1|
> +---+---+
> scala> spark.table("o").show// This is wrong.
> +---+---+
> | c2| c1|
> +---+---+
> |  1|  2|
> +---+---+
> scala> spark.read.orc("/tmp/o").show  // This is correct.
> +---+---+
> | c1| c2|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> *TESTCASE*
> {code}
>   test("SPARK-22267 Spark SQL incorrectly reads ORC files when column order 
> is different") {
> withTempDir { dir =>
>   val path = dir.getCanonicalPath
>   Seq(1 -> 2).toDF("c1", 
> "c2").write.format("orc").mode("overwrite").save(path)
>   checkAnswer(spark.read.orc(path), Row(1, 2))
>   Seq("true", "false").foreach { value =>
> withTable("t") {
>   withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> value) {
> sql(s"CREATE EXTERNAL TABLE t(c2 INT, c1 INT) STORED AS ORC 
> LOCATION '$path'")
> checkAnswer(spark.table("t"), Row(2, 1))
>   }
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22267) Spark SQL incorrectly reads ORC file when column order is different

2017-11-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251250#comment-16251250
 ] 

Apache Spark commented on SPARK-22267:
--

User 'mpetruska' has created a pull request for this issue:
https://github.com/apache/spark/pull/19744

> Spark SQL incorrectly reads ORC file when column order is different
> ---
>
> Key: SPARK-22267
> URL: https://issues.apache.org/jira/browse/SPARK-22267
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>
> For a long time, Apache Spark SQL returns incorrect results when ORC file 
> schema is different from metastore schema order.
> {code}
> scala> Seq(1 -> 2).toDF("c1", 
> "c2").write.format("parquet").mode("overwrite").save("/tmp/p")
> scala> Seq(1 -> 2).toDF("c1", 
> "c2").write.format("orc").mode("overwrite").save("/tmp/o")
> scala> sql("CREATE EXTERNAL TABLE p(c2 INT, c1 INT) STORED AS parquet 
> LOCATION '/tmp/p'")
> scala> sql("CREATE EXTERNAL TABLE o(c2 INT, c1 INT) STORED AS orc LOCATION 
> '/tmp/o'")
> scala> spark.table("p").show  // Parquet is good.
> +---+---+
> | c2| c1|
> +---+---+
> |  2|  1|
> +---+---+
> scala> spark.table("o").show// This is wrong.
> +---+---+
> | c2| c1|
> +---+---+
> |  1|  2|
> +---+---+
> scala> spark.read.orc("/tmp/o").show  // This is correct.
> +---+---+
> | c1| c2|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> *TESTCASE*
> {code}
>   test("SPARK-22267 Spark SQL incorrectly reads ORC files when column order 
> is different") {
> withTempDir { dir =>
>   val path = dir.getCanonicalPath
>   Seq(1 -> 2).toDF("c1", 
> "c2").write.format("orc").mode("overwrite").save(path)
>   checkAnswer(spark.read.orc(path), Row(1, 2))
>   Seq("true", "false").foreach { value =>
> withTable("t") {
>   withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> value) {
> sql(s"CREATE EXTERNAL TABLE t(c2 INT, c1 INT) STORED AS ORC 
> LOCATION '$path'")
> checkAnswer(spark.table("t"), Row(2, 1))
>   }
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22491) union all can't execute parallel with group by

2017-11-14 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251194#comment-16251194
 ] 

Liang-Chi Hsieh commented on SPARK-22491:
-

If the aggregation is removed, there is no shuffle exchange actually. So the 
exchange coordinator won't intervene in even the config is enabled.

> union all can't execute parallel with group by 
> ---
>
> Key: SPARK-22491
> URL: https://issues.apache.org/jira/browse/SPARK-22491
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.2
>Reporter: cen yuhai
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> test sql
> set spark.sql.adaptive.enabled=true;
> {code}
> create table temp.test_yuhai_group 
> as 
> select 
>city_name
>,sum(amount)
>  from  temp.table_01 
>  group by city_name
>  union all
>  select 
>city_name
>,sum(amount)
>  from  temp.table_01  
>  group by city_name
>  union all
>  select city_name
>,sum(amount)
>  from  temp.table_01
>  group by city_name
> {code}
> if I remove group by ,it is ok
> {code}
> create table temp.test_yuhai_group 
> as 
> select 
>city_name
>  from  temp.table_01 
>  union all
>  select 
>city_name
>  from  temp.table_01  
>  union all
>  select city_name
>  from  temp.table_01
> {code}
> In the snapshot, the first time I execute ths sql, it run 3 job one by one 
> (200 tasks per job).
> If I remove group by, it run just one job (600 tasks per job).
> workaround:
> set spark.sql.adaptive.enabled=false;



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22518) Make default cache storage level configurable

2017-11-14 Thread Rares Mirica (JIRA)
Rares Mirica created SPARK-22518:


 Summary: Make default cache storage level configurable
 Key: SPARK-22518
 URL: https://issues.apache.org/jira/browse/SPARK-22518
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Rares Mirica
Priority: Minor


Caching defaults to the hard-coded value MEMORY_ONLY, and as most users call 
the convenient .cache() method this value is not configurable in a global way. 
Please make this configurable through a spark config option.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22517) NullPointerException in ShuffleExternalSorter.spill()

2017-11-14 Thread Andreas Maier (JIRA)
Andreas Maier created SPARK-22517:
-

 Summary: NullPointerException in ShuffleExternalSorter.spill()
 Key: SPARK-22517
 URL: https://issues.apache.org/jira/browse/SPARK-22517
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Andreas Maier


I see a NullPointerException during sorting with the following stacktrace:
{code}
17/11/13 15:02:56 ERROR Executor: Exception in task 138.0 in stage 9.0 (TID 
13497)
java.lang.NullPointerException
at 
org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:193)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:254)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
at 
org.apache.spark.shuffle.sort.ShuffleInMemorySorter.reset(ShuffleInMemorySorter.java:100)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:256)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.growPointerArrayIfNecessary(ShuffleExternalSorter.java:328)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:379)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22515) Estimation relation size based on numRows * rowSize

2017-11-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22515:


Assignee: Apache Spark

> Estimation relation size based on numRows * rowSize
> ---
>
> Key: SPARK-22515
> URL: https://issues.apache.org/jira/browse/SPARK-22515
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>
> Currently, relation size is computed as the sum of file size, which is 
> error-prone because storage format like parquet may have a much smaller file 
> size compared to in-memory size. When we choose broadcast join based on file 
> size, there's a risk of OOM. But if the number of rows is available in 
> statistics, we can get a better estimation by `numRows * rowSize`, which 
> helps to alleviate this problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >