[jira] [Commented] (SPARK-20496) KafkaWriter Uses Unanalyzed Logical Plan

2017-04-28 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988953#comment-15988953
 ] 

Bill Chambers commented on SPARK-20496:
---

This should probably be backported too.

> KafkaWriter Uses Unanalyzed Logical Plan
> 
>
> Key: SPARK-20496
> URL: https://issues.apache.org/jira/browse/SPARK-20496
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Bill Chambers
>
> Right now we use the unanalyzed logical plan for writing to Kafka, we should 
> use the analyzed plan.
> https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala#L50



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20496) KafkaWriter Uses Unanalyzed Logical Plan

2017-04-28 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-20496:
--
Affects Version/s: 2.1.0

> KafkaWriter Uses Unanalyzed Logical Plan
> 
>
> Key: SPARK-20496
> URL: https://issues.apache.org/jira/browse/SPARK-20496
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Bill Chambers
>
> Right now we use the unanalyzed logical plan for writing to Kafka, we should 
> use the analyzed plan.
> https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala#L50



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20496) KafkaWriter Uses Unanalyzed Logical Plan

2017-04-27 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-20496:
--
Description: 
Right now we use the unanalyzed logical plan for writing to Kafka, we should 
use the analyzed plan.

https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala#L50

  was:Right now we use the unanalyzed logical plan for writing to Kafka, we 
should use the analyzed plan.


> KafkaWriter Uses Unanalyzed Logical Plan
> 
>
> Key: SPARK-20496
> URL: https://issues.apache.org/jira/browse/SPARK-20496
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bill Chambers
>
> Right now we use the unanalyzed logical plan for writing to Kafka, we should 
> use the analyzed plan.
> https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala#L50



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20496) KafkaWriter Uses Unanalyzed Logical Plan

2017-04-27 Thread Bill Chambers (JIRA)
Bill Chambers created SPARK-20496:
-

 Summary: KafkaWriter Uses Unanalyzed Logical Plan
 Key: SPARK-20496
 URL: https://issues.apache.org/jira/browse/SPARK-20496
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Bill Chambers


Right now we use the unanalyzed logical plan for writing to Kafka, we should 
use the analyzed plan.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20400) Remove References to Third Party Vendors from Spark ASF Documentation

2017-04-19 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976102#comment-15976102
 ] 

Bill Chambers commented on SPARK-20400:
---

I'd like to see what others have to say, maybe this isn't a big deal. But it 
does seem like it's a fairly explicit vendor reference.

I cede the discussion to the community, happy either way but wanted to mention 
it.

> Remove References to Third Party Vendors from Spark ASF Documentation
> -
>
> Key: SPARK-20400
> URL: https://issues.apache.org/jira/browse/SPARK-20400
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
> Fix For: 2.3.0
>
>
> Similar to SPARK-17445, vendors should probably not be referenced on the ASF 
> documentation.
> Related:
> https://github.com/apache/spark/commit/dc0a4c916151c795dc41b5714e9d23b4937f4636



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20400) Remove References to Third Party Vendors from Spark ASF Documentation

2017-04-19 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-20400:
--
Description: 
Similar to SPARK-17445, vendors should probably not be referenced on the ASF 
documentation.

Related:
https://github.com/apache/spark/commit/dc0a4c916151c795dc41b5714e9d23b4937f4636

> Remove References to Third Party Vendors from Spark ASF Documentation
> -
>
> Key: SPARK-20400
> URL: https://issues.apache.org/jira/browse/SPARK-20400
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
> Fix For: 2.3.0
>
>
> Similar to SPARK-17445, vendors should probably not be referenced on the ASF 
> documentation.
> Related:
> https://github.com/apache/spark/commit/dc0a4c916151c795dc41b5714e9d23b4937f4636



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20400) Remove References to Third Party Vendors from Spark ASF Documentation

2017-04-19 Thread Bill Chambers (JIRA)
Bill Chambers created SPARK-20400:
-

 Summary: Remove References to Third Party Vendors from Spark ASF 
Documentation
 Key: SPARK-20400
 URL: https://issues.apache.org/jira/browse/SPARK-20400
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.1.0
Reporter: Bill Chambers
 Fix For: 2.3.0






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-02-27 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15886109#comment-15886109
 ] 

Bill Chambers commented on SPARK-19714:
---

Agree with your first and second paragraphs.

Regarding the third, it's worth a discussion certainly, but it's a pretty big 
departure from the current definition which is worrisome.

> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---
>
> Key: SPARK-19714
> URL: https://issues.apache.org/jira/browse/SPARK-19714
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of 
> Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the 
> lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-02-24 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883125#comment-15883125
 ] 

Bill Chambers edited comment on SPARK-19714 at 2/24/17 5:15 PM:


The thing is QuantileDiscretizer and Bucketizer do fundamentally different 
things so there are different use cases there (quantiles vs actual values). 
It's more of a nuisance than anything and an unclear parameter that seems to 
imply things that are not actually the case.

Here's where it *really* falls apart, if I have a bucket and I provide one 
split, how many buckets do I have?

In Bucketizer I have none! That makes little sense. Splits is not the correct 
word here either because they aren't splits! They're bucket boundaries. I think 
this is more than a documentation issue, even though those aren't very clear 
themselves.

> Parameter for mapping continuous features into buckets. With n+1 splits, 
> there are n buckets. A bucket defined by splits x,y holds values in the range 
> [x,y) except the last bucket, which also includes y. Splits should be of 
> length greater than or equal to 3 and strictly increasing. Values at -inf, 
> inf must be explicitly provided to cover all Double values; otherwise, values 
> outside the splits specified will be treated as errors.

I also realize I'm being a pain here :) and that this stuff is always 
difficult. I empathize with that, it's just that this method doesn't seem to 
use correct terminology or a conceptually relevant implementation for what it 
aims to do.


was (Author: bill_chambers):
The thing is QuantileDiscretizer and Bucketizer do fundamentally different 
things so there are different use cases there (quantiles vs actual values). 
It's more of a nuisance than anything and an unclear parameter that seems to 
imply things that are not actually the case.

Here's where it *really* falls apart, if I have a bucket and I provide one 
split, how many buckets do I have?

In Bucketizer I have none! That makes no sense. Splits is not the correct word 
here either because they aren't splits! They're bounds or containers or buckets 
themselves. I think this is more than a documentation issue, even though those 
aren't very clear themselves.

> Parameter for mapping continuous features into buckets. With n+1 splits, 
> there are n buckets. A bucket defined by splits x,y holds values in the range 
> [x,y) except the last bucket, which also includes y. Splits should be of 
> length greater than or equal to 3 and strictly increasing. Values at -inf, 
> inf must be explicitly provided to cover all Double values; otherwise, values 
> outside the splits specified will be treated as errors.



> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---
>
> Key: SPARK-19714
> URL: https://issues.apache.org/jira/browse/SPARK-19714
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of 
> Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the 
> lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-02-24 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883125#comment-15883125
 ] 

Bill Chambers commented on SPARK-19714:
---

The thing is QuantileDiscretizer and Bucketizer do fundamentally different 
things so there are different use cases there (quantiles vs actual values). 
It's more of a nuisance than anything and an unclear parameter that seems to 
imply things that are not actually the case.

Here's where it *really* falls apart, if I have a bucket and I provide one 
split, how many buckets do I have?

In Bucketizer I have none! That makes no sense. Splits is not the correct word 
here either because they aren't splits! They're bounds or containers or buckets 
themselves. I think this is more than a documentation issue, even though those 
aren't very clear themselves.

> Parameter for mapping continuous features into buckets. With n+1 splits, 
> there are n buckets. A bucket defined by splits x,y holds values in the range 
> [x,y) except the last bucket, which also includes y. Splits should be of 
> length greater than or equal to 3 and strictly increasing. Values at -inf, 
> inf must be explicitly provided to cover all Double values; otherwise, values 
> outside the splits specified will be treated as errors.



> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---
>
> Key: SPARK-19714
> URL: https://issues.apache.org/jira/browse/SPARK-19714
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of 
> Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the 
> lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-02-23 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881676#comment-15881676
 ] 

Bill Chambers commented on SPARK-19714:
---

"Invalid" is a poor descriptor IMO. Invalid should be defined as "not defined 
in this range". If it's null, why isn't it just "handleNull" or something since 
it only applies to null/missing values?

A doc update would definitely help. I've got my own opinions about how this 
should work but I'll leave it up to you. Be curious if anyone else has 
thoughts, maybe I'm the only one in which case... whatever :)

> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---
>
> Key: SPARK-19714
> URL: https://issues.apache.org/jira/browse/SPARK-19714
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of 
> Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the 
> lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-02-23 Thread Bill Chambers (JIRA)
Bill Chambers created SPARK-19714:
-

 Summary: Bucketizer Bug Regarding Handling Unbucketed Inputs
 Key: SPARK-19714
 URL: https://issues.apache.org/jira/browse/SPARK-19714
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 2.1.0
Reporter: Bill Chambers


{code}
contDF = spark.range(500).selectExpr("cast(id as double) as id")
import org.apache.spark.ml.feature.Bucketizer

val splits = Array(5.0, 10.0, 250.0, 500.0)

val bucketer = new Bucketizer()
  .setSplits(splits)
  .setInputCol("id")
  .setHandleInvalid("skip")

bucketer.transform(contDF).show()
{code}

You would expect that this would handle the invalid buckets. However it fails
{code}
Caused by: org.apache.spark.SparkException: Feature value 0.0 out of Bucketizer 
bounds [5.0, 500.0].  Check your features, or loosen the lower/upper bound 
constraints.
{code} 
It seems strange that handleInvalud doesn't actually handleInvalid inputs.

Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19127) Inconsistencies in dense_rank and rank documentation

2017-01-08 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15809726#comment-15809726
 ] 

Bill Chambers commented on SPARK-19127:
---

https://github.com/apache/spark/pull/16505

> Inconsistencies in dense_rank and rank documentation
> 
>
> Key: SPARK-19127
> URL: https://issues.apache.org/jira/browse/SPARK-19127
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Priority: Minor
>
> The docs were not updated during the change from things like denseRank to 
> dense_rank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19126) Join Documentation Improvements

2017-01-08 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-19126:
--
Priority: Minor  (was: Major)

> Join Documentation Improvements
> ---
>
> Key: SPARK-19126
> URL: https://issues.apache.org/jira/browse/SPARK-19126
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Priority: Minor
>
> - Some join types are missing (no mention of anti join)
> - Joins are labelled inconsistently both within each language and between 
> languages.
> - Update according to new join spec for `crossJoin`
> Pull request coming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19127) Inconsistencies in dense_rank and rank documentation

2017-01-08 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-19127:
--
Priority: Minor  (was: Major)

> Inconsistencies in dense_rank and rank documentation
> 
>
> Key: SPARK-19127
> URL: https://issues.apache.org/jira/browse/SPARK-19127
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Priority: Minor
>
> The docs were not updated during the change from things like denseRank to 
> dense_rank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19127) Inconsistencies in dense_rank and rank documentation

2017-01-08 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-19127:
--
Summary: Inconsistencies in dense_rank and rank documentation  (was: Errors 
in Window Functions Documentation)

> Inconsistencies in dense_rank and rank documentation
> 
>
> Key: SPARK-19127
> URL: https://issues.apache.org/jira/browse/SPARK-19127
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>
> The docs were not updated during the change from things like denseRank to 
> dense_rank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19127) Errors in Window Functions Documentation

2017-01-08 Thread Bill Chambers (JIRA)
Bill Chambers created SPARK-19127:
-

 Summary: Errors in Window Functions Documentation
 Key: SPARK-19127
 URL: https://issues.apache.org/jira/browse/SPARK-19127
 Project: Spark
  Issue Type: Improvement
Reporter: Bill Chambers


The docs were not updated during the change from things like denseRank to 
dense_rank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19127) Errors in Window Functions Documentation

2017-01-08 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15809697#comment-15809697
 ] 

Bill Chambers commented on SPARK-19127:
---

PR coming


> Errors in Window Functions Documentation
> 
>
> Key: SPARK-19127
> URL: https://issues.apache.org/jira/browse/SPARK-19127
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>
> The docs were not updated during the change from things like denseRank to 
> dense_rank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19126) Join Documentation Improvements

2017-01-08 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15809690#comment-15809690
 ] 

Bill Chambers commented on SPARK-19126:
---

PR Ready: https://github.com/apache/spark/pull/16504

> Join Documentation Improvements
> ---
>
> Key: SPARK-19126
> URL: https://issues.apache.org/jira/browse/SPARK-19126
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>
> - Some join types are missing (no mention of anti join)
> - Joins are labelled inconsistently both within each language and between 
> languages.
> - Update according to new join spec for `crossJoin`
> Pull request coming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19126) Join Documentation Improvements

2017-01-08 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-19126:
--
Description: 
- Some join types are missing (no mention of anti join)
- Joins are labelled inconsistently both within each language and between 
languages.
- Update according to new join spec for `crossJoin`


Pull request coming...


  was:
- Some join types are missing (no mention of anti join)
- Joins are labelled inconsistently both within each language and between 
languages.


Pull request coming...



> Join Documentation Improvements
> ---
>
> Key: SPARK-19126
> URL: https://issues.apache.org/jira/browse/SPARK-19126
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>
> - Some join types are missing (no mention of anti join)
> - Joins are labelled inconsistently both within each language and between 
> languages.
> - Update according to new join spec for `crossJoin`
> Pull request coming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19126) Join Documentation Improvements

2017-01-08 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-19126:
--
Description: 
- Some join types are missing (no mention of anti join)
- Joins are labelled inconsistently both within each language and between 
languages.


Pull request coming...


  was:
- Some join types are missing or inconsistent (no mention of anti join)


Pull request coming...



> Join Documentation Improvements
> ---
>
> Key: SPARK-19126
> URL: https://issues.apache.org/jira/browse/SPARK-19126
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>
> - Some join types are missing (no mention of anti join)
> - Joins are labelled inconsistently both within each language and between 
> languages.
> Pull request coming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19126) Join Documentation Improvements

2017-01-08 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-19126:
--
Description: 
- Some join types are missing or inconsistent (no mention of anti join)


Pull request coming...


  was:
Pull request coming...
Some join types are missing or inconsistent.

Summary: Join Documentation Improvements  (was: Join Documentation 
Incomplete)

> Join Documentation Improvements
> ---
>
> Key: SPARK-19126
> URL: https://issues.apache.org/jira/browse/SPARK-19126
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>
> - Some join types are missing or inconsistent (no mention of anti join)
> Pull request coming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19126) Join Documentation Incomplete

2017-01-08 Thread Bill Chambers (JIRA)
Bill Chambers created SPARK-19126:
-

 Summary: Join Documentation Incomplete
 Key: SPARK-19126
 URL: https://issues.apache.org/jira/browse/SPARK-19126
 Project: Spark
  Issue Type: Improvement
Reporter: Bill Chambers


Pull request coming...
Some join types are missing or inconsistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18424) Single Function for Parsing Dates and Times with Formats

2016-11-16 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers resolved SPARK-18424.
---
Resolution: Duplicate

This is a duplicate of SPARK-16609. Work will continue there.

> Single Function for Parsing Dates and Times with Formats
> 
>
> Key: SPARK-18424
> URL: https://issues.apache.org/jira/browse/SPARK-18424
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Assignee: Bill Chambers
>Priority: Minor
>
> I've found it quite cumbersome to work with dates thus far in Spark, it can 
> be hard to reason about the timeformat and what type you're working with, for 
> instance:
> say that I have a date in the format
> {code}
> 2017-20-12
> // Y-D-M
> {code}
> In order to parse that into a Date, I have to perform several conversions.
> {code}
>   to_date(
> unix_timestamp(col("date"), dateFormat)
> .cast("timestamp"))
>.alias("date")
> {code}
> I propose simplifying this by adding a to_date function (exists) but adding 
> one that accepts a format for that date. I also propose a to_timestamp 
> function that also supports a format.
> so that you can avoid entirely the above conversion.
> It's also worth mentioning that many other databases support this. For 
> instance, mysql has the STR_TO_DATE function, netezza supports the 
> to_timestamp semantic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16609) Single function for parsing timestamps/dates

2016-11-16 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671632#comment-15671632
 ] 

Bill Chambers commented on SPARK-16609:
---

I am working on this.

> Single function for parsing timestamps/dates
> 
>
> Key: SPARK-16609
> URL: https://issues.apache.org/jira/browse/SPARK-16609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Reynold Xin
>
> Today, if you want to parse a date or timestamp, you have to use the unix 
> time function and then cast to a timestamp.  Its a little odd there isn't a 
> single function that does both.  I propose we add
> {code}
> to_date(, )/to_timestamp(, ).
> {code}
> For reference, in other systems there are:
> MS SQL: {{convert(, )}}. See: 
> https://technet.microsoft.com/en-us/library/ms174450(v=sql.110).aspx
> Netezza: {{to_timestamp(, )}}. See: 
> https://www.ibm.com/support/knowledgecenter/SSULQD_7.0.3/com.ibm.nz.dbu.doc/r_dbuser_ntz_sql_extns_conversion_funcs.html
> Teradata has special casting functionality: {{cast( as timestamp 
> format '')}}
> MySql: {{STR_TO_DATE(, )}}. This returns a datetime when you 
> define both date and time parts. See: 
> https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18424) Single Function for Parsing Dates and Times with Formats

2016-11-16 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-18424:
--
Summary: Single Function for Parsing Dates and Times with Formats  (was: 
Single Funct)

> Single Function for Parsing Dates and Times with Formats
> 
>
> Key: SPARK-18424
> URL: https://issues.apache.org/jira/browse/SPARK-18424
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Assignee: Bill Chambers
>Priority: Minor
>
> I've found it quite cumbersome to work with dates thus far in Spark, it can 
> be hard to reason about the timeformat and what type you're working with, for 
> instance:
> say that I have a date in the format
> {code}
> 2017-20-12
> // Y-D-M
> {code}
> In order to parse that into a Date, I have to perform several conversions.
> {code}
>   to_date(
> unix_timestamp(col("date"), dateFormat)
> .cast("timestamp"))
>.alias("date")
> {code}
> I propose simplifying this by adding a to_date function (exists) but adding 
> one that accepts a format for that date. I also propose a to_timestamp 
> function that also supports a format.
> so that you can avoid entirely the above conversion.
> It's also worth mentioning that many other databases support this. For 
> instance, mysql has the STR_TO_DATE function, netezza supports the 
> to_timestamp semantic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18424) Single Funct

2016-11-16 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-18424:
--
Summary: Single Funct  (was: Improve Date Parsing Semantics & Functionality)

> Single Funct
> 
>
> Key: SPARK-18424
> URL: https://issues.apache.org/jira/browse/SPARK-18424
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Assignee: Bill Chambers
>Priority: Minor
>
> I've found it quite cumbersome to work with dates thus far in Spark, it can 
> be hard to reason about the timeformat and what type you're working with, for 
> instance:
> say that I have a date in the format
> {code}
> 2017-20-12
> // Y-D-M
> {code}
> In order to parse that into a Date, I have to perform several conversions.
> {code}
>   to_date(
> unix_timestamp(col("date"), dateFormat)
> .cast("timestamp"))
>.alias("date")
> {code}
> I propose simplifying this by adding a to_date function (exists) but adding 
> one that accepts a format for that date. I also propose a to_timestamp 
> function that also supports a format.
> so that you can avoid entirely the above conversion.
> It's also worth mentioning that many other databases support this. For 
> instance, mysql has the STR_TO_DATE function, netezza supports the 
> to_timestamp semantic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18424) Improve Date Parsing Semantics & Functionality

2016-11-15 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-18424:
--
Summary: Improve Date Parsing Semantics & Functionality  (was: Improve Date 
Parsing Functionality)

> Improve Date Parsing Semantics & Functionality
> --
>
> Key: SPARK-18424
> URL: https://issues.apache.org/jira/browse/SPARK-18424
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Assignee: Bill Chambers
>Priority: Minor
>
> I've found it quite cumbersome to work with dates thus far in Spark, it can 
> be hard to reason about the timeformat and what type you're working with, for 
> instance:
> say that I have a date in the format
> {code}
> 2017-20-12
> // Y-D-M
> {code}
> In order to parse that into a Date, I have to perform several conversions.
> {code}
>   to_date(
> unix_timestamp(col("date"), dateFormat)
> .cast("timestamp"))
>.alias("date")
> {code}
> I propose simplifying this by adding a to_date function (exists) but adding 
> one that accepts a format for that date. I also propose a to_timestamp 
> function that also supports a format.
> so that you can avoid entirely the above conversion.
> It's also worth mentioning that many other databases support this. For 
> instance, mysql has the STR_TO_DATE function, netezza supports the 
> to_timestamp semantic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18424) Improve Date Parsing Functionality

2016-11-12 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660291#comment-15660291
 ] 

Bill Chambers edited comment on SPARK-18424 at 11/12/16 10:09 PM:
--

For the record I would like to work on this one.

Define Function here:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala

Register Function here:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala


Add tests:
Here: 
https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala
Here: 
https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala


was (Author: bill_chambers):
For the record I would like to work on this one.

Define Function here:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala

Register Function here:
?


Add tests:
Here: 
https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala
Here: 
https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala

> Improve Date Parsing Functionality
> --
>
> Key: SPARK-18424
> URL: https://issues.apache.org/jira/browse/SPARK-18424
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Priority: Minor
>
> I've found it quite cumbersome to work with dates thus far in Spark, it can 
> be hard to reason about the timeformat and what type you're working with, for 
> instance:
> say that I have a date in the format
> {code}
> 2017-20-12
> // Y-D-M
> {code}
> In order to parse that into a Date, I have to perform several conversions.
> {code}
>   to_date(
> unix_timestamp(col("date"), dateFormat)
> .cast("timestamp"))
>.alias("date")
> {code}
> I propose simplifying this by adding a to_date function (exists) but adding 
> one that accepts a format for that date. I also propose a to_timestamp 
> function that also supports a format.
> so that you can avoid entirely the above conversion.
> It's also worth mentioning that many other databases support this. For 
> instance, mysql has the STR_TO_DATE function, netezza supports the 
> to_timestamp semantic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18424) Improve Date Parsing Functionality

2016-11-12 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660291#comment-15660291
 ] 

Bill Chambers edited comment on SPARK-18424 at 11/12/16 9:31 PM:
-

For the record I would like to work on this one.

Define Function here:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala

Register Function here:
?


Add tests:
Here: 
https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala
Here: 
https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala


was (Author: bill_chambers):
For the record I would like to work on this one.

It seems that I will have to add some tests:
Here: 
https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala
Here: 
https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala

> Improve Date Parsing Functionality
> --
>
> Key: SPARK-18424
> URL: https://issues.apache.org/jira/browse/SPARK-18424
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Priority: Minor
>
> I've found it quite cumbersome to work with dates thus far in Spark, it can 
> be hard to reason about the timeformat and what type you're working with, for 
> instance:
> say that I have a date in the format
> {code}
> 2017-20-12
> // Y-D-M
> {code}
> In order to parse that into a Date, I have to perform several conversions.
> {code}
>   to_date(
> unix_timestamp(col("date"), dateFormat)
> .cast("timestamp"))
>.alias("date")
> {code}
> I propose simplifying this by adding a to_date function (exists) but adding 
> one that accepts a format for that date. I also propose a to_timestamp 
> function that also supports a format.
> so that you can avoid entirely the above conversion.
> It's also worth mentioning that many other databases support this. For 
> instance, mysql has the STR_TO_DATE function, netezza supports the 
> to_timestamp semantic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18424) Improve Date Parsing Functionality

2016-11-12 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660291#comment-15660291
 ] 

Bill Chambers edited comment on SPARK-18424 at 11/12/16 9:30 PM:
-

For the record I would like to work on this one.

It seems that I will have to add some tests:
Here: 
https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala
Here: 
https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala


was (Author: bill_chambers):
For the record I would like to work on this one.

> Improve Date Parsing Functionality
> --
>
> Key: SPARK-18424
> URL: https://issues.apache.org/jira/browse/SPARK-18424
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Priority: Minor
>
> I've found it quite cumbersome to work with dates thus far in Spark, it can 
> be hard to reason about the timeformat and what type you're working with, for 
> instance:
> say that I have a date in the format
> {code}
> 2017-20-12
> // Y-D-M
> {code}
> In order to parse that into a Date, I have to perform several conversions.
> {code}
>   to_date(
> unix_timestamp(col("date"), dateFormat)
> .cast("timestamp"))
>.alias("date")
> {code}
> I propose simplifying this by adding a to_date function (exists) but adding 
> one that accepts a format for that date. I also propose a to_timestamp 
> function that also supports a format.
> so that you can avoid entirely the above conversion.
> It's also worth mentioning that many other databases support this. For 
> instance, mysql has the STR_TO_DATE function, netezza supports the 
> to_timestamp semantic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18424) Improve Date Parsing Functionality

2016-11-12 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660291#comment-15660291
 ] 

Bill Chambers commented on SPARK-18424:
---

For the record I would like to work on this one.

> Improve Date Parsing Functionality
> --
>
> Key: SPARK-18424
> URL: https://issues.apache.org/jira/browse/SPARK-18424
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Priority: Minor
>
> I've found it quite cumbersome to work with dates thus far in Spark, it can 
> be hard to reason about the timeformat and what type you're working with, for 
> instance:
> say that I have a date in the format
> {code}
> 2017-20-12
> // Y-D-M
> {code}
> In order to parse that into a Date, I have to perform several conversions.
> {code}
>   to_date(
> unix_timestamp(col("date"), dateFormat)
> .cast("timestamp"))
>.alias("date")
> {code}
> I propose simplifying this by adding a to_date function (exists) but adding 
> one that accepts a format for that date. I also propose a to_timestamp 
> function that also supports a format.
> so that you can avoid entirely the above conversion.
> It's also worth mentioning that many other databases support this. For 
> instance, mysql has the STR_TO_DATE function, netezza supports the 
> to_timestamp semantic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18424) Improve Date Parsing Functionality

2016-11-12 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-18424:
--
Summary: Improve Date Parsing Functionality  (was: Cumbersome Date 
Manipulation)

> Improve Date Parsing Functionality
> --
>
> Key: SPARK-18424
> URL: https://issues.apache.org/jira/browse/SPARK-18424
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Priority: Minor
>
> I've found it quite cumbersome to work with dates thus far in Spark, it can 
> be hard to reason about the timeformat and what type you're working with, for 
> instance:
> say that I have a date in the format
> {code}
> 2017-20-12
> // Y-D-M
> {code}
> In order to parse that into a Date, I have to perform several conversions.
> {code}
>   to_date(
> unix_timestamp(col("date"), dateFormat)
> .cast("timestamp"))
>.alias("date")
> {code}
> I propose simplifying this by adding a to_date function (exists) but adding 
> one that accepts a format for that date. so that you can avoid entirely the 
> above conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18424) Improve Date Parsing Functionality

2016-11-12 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-18424:
--
Description: 
I've found it quite cumbersome to work with dates thus far in Spark, it can be 
hard to reason about the timeformat and what type you're working with, for 
instance:

say that I have a date in the format

{code}
2017-20-12
// Y-D-M
{code}

In order to parse that into a Date, I have to perform several conversions.
{code}
  to_date(
unix_timestamp(col("date"), dateFormat)
.cast("timestamp"))
   .alias("date")
{code}

I propose simplifying this by adding a to_date function (exists) but adding one 
that accepts a format for that date. I also propose a to_timestamp function 
that also supports a format.

so that you can avoid entirely the above conversion.

It's also worth mentioning that many other databases support this. For 
instance, mysql has the STR_TO_DATE function, netezza supports the to_timestamp 
semantic.

  was:
I've found it quite cumbersome to work with dates thus far in Spark, it can be 
hard to reason about the timeformat and what type you're working with, for 
instance:

say that I have a date in the format

{code}
2017-20-12
// Y-D-M
{code}

In order to parse that into a Date, I have to perform several conversions.
{code}
  to_date(
unix_timestamp(col("date"), dateFormat)
.cast("timestamp"))
   .alias("date")
{code}

I propose simplifying this by adding a to_date function (exists) but adding one 
that accepts a format for that date. so that you can avoid entirely the above 
conversion.


> Improve Date Parsing Functionality
> --
>
> Key: SPARK-18424
> URL: https://issues.apache.org/jira/browse/SPARK-18424
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Priority: Minor
>
> I've found it quite cumbersome to work with dates thus far in Spark, it can 
> be hard to reason about the timeformat and what type you're working with, for 
> instance:
> say that I have a date in the format
> {code}
> 2017-20-12
> // Y-D-M
> {code}
> In order to parse that into a Date, I have to perform several conversions.
> {code}
>   to_date(
> unix_timestamp(col("date"), dateFormat)
> .cast("timestamp"))
>.alias("date")
> {code}
> I propose simplifying this by adding a to_date function (exists) but adding 
> one that accepts a format for that date. I also propose a to_timestamp 
> function that also supports a format.
> so that you can avoid entirely the above conversion.
> It's also worth mentioning that many other databases support this. For 
> instance, mysql has the STR_TO_DATE function, netezza supports the 
> to_timestamp semantic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18424) Cumbersome Date Manipulation

2016-11-12 Thread Bill Chambers (JIRA)
Bill Chambers created SPARK-18424:
-

 Summary: Cumbersome Date Manipulation
 Key: SPARK-18424
 URL: https://issues.apache.org/jira/browse/SPARK-18424
 Project: Spark
  Issue Type: Improvement
Reporter: Bill Chambers
Priority: Minor


I've found it quite cumbersome to work with dates thus far in Spark, it can be 
hard to reason about the timeformat and what type you're working with, for 
instance:

say that I have a date in the format

{code}
2017-20-12
// Y-D-M
{code}

In order to parse that into a Date, I have to perform several conversions.
{code}
  to_date(
unix_timestamp(col("date"), dateFormat)
.cast("timestamp"))
   .alias("date")
{code}

I propose simplifying this by adding a to_date function (exists) but adding one 
that accepts a format for that date. so that you can avoid entirely the above 
conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18365) Improve Documentation for Sample Methods

2016-11-12 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-18365:
--
Description: The documentation for sample is a little unintuitive. It was 
difficult to understand why I wasn't getting exactly the fraction specified of 
my total DataFrame rows. The PR clarifies the documentation for  Scala, Python, 
and R to explain that that is expected behavior.  (was: The parameter 
documentation is switched.

PR coming shortly.)

> Improve Documentation for Sample Methods
> 
>
> Key: SPARK-18365
> URL: https://issues.apache.org/jira/browse/SPARK-18365
> Project: Spark
>  Issue Type: Bug
>Reporter: Bill Chambers
>
> The documentation for sample is a little unintuitive. It was difficult to 
> understand why I wasn't getting exactly the fraction specified of my total 
> DataFrame rows. The PR clarifies the documentation for  Scala, Python, and R 
> to explain that that is expected behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18365) Improve Documentation for Sample Methods

2016-11-12 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-18365:
--
Summary: Improve Documentation for Sample Methods  (was: Improve 
Documentation for Sample Method)

> Improve Documentation for Sample Methods
> 
>
> Key: SPARK-18365
> URL: https://issues.apache.org/jira/browse/SPARK-18365
> Project: Spark
>  Issue Type: Bug
>Reporter: Bill Chambers
>
> The parameter documentation is switched.
> PR coming shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18365) Improve Documentation for Sample Method

2016-11-08 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-18365:
--
Summary: Improve Documentation for Sample Method  (was: Documentation for 
Sampling is Incorrect)

> Improve Documentation for Sample Method
> ---
>
> Key: SPARK-18365
> URL: https://issues.apache.org/jira/browse/SPARK-18365
> Project: Spark
>  Issue Type: Bug
>Reporter: Bill Chambers
>
> The parameter documentation is switched.
> PR coming shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18365) Documentation for Sampling is Incorrect

2016-11-08 Thread Bill Chambers (JIRA)
Bill Chambers created SPARK-18365:
-

 Summary: Documentation for Sampling is Incorrect
 Key: SPARK-18365
 URL: https://issues.apache.org/jira/browse/SPARK-18365
 Project: Spark
  Issue Type: Bug
Reporter: Bill Chambers


The parameter documentation is switched.

PR coming shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16234) Speculative Task may not be able to overwrite file

2016-06-27 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-16234:
--
Description: resolved...  (was: given spark.speculative set to true, I'm 
running a large spark job with parquet and savemode overwrite.

Spark will speculatively try to create a task to deal with a straggler. 
However, doing this comes with risk because EVEN THOUGH savemode overwrite is 
selected, if the straggler completes before the original task or the original 
task completes before the straggler then the job will fail due to the file 
already existing.

java.io.IOException: 
/...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet 
already exists)

> Speculative Task may not be able to overwrite file
> --
>
> Key: SPARK-16234
> URL: https://issues.apache.org/jira/browse/SPARK-16234
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Bill Chambers
>
> resolved...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16234) Speculative Task may not be able to overwrite file

2016-06-27 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers closed SPARK-16234.
-
Resolution: Resolved

> Speculative Task may not be able to overwrite file
> --
>
> Key: SPARK-16234
> URL: https://issues.apache.org/jira/browse/SPARK-16234
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Bill Chambers
>
> given spark.speculative set to true, I'm running a large spark job with 
> parquet and savemode overwrite.
> Spark will speculatively try to create a task to deal with a straggler. 
> However, doing this comes with risk because EVEN THOUGH savemode overwrite is 
> selected, if the straggler completes before the original task or the original 
> task completes before the straggler then the job will fail due to the file 
> already existing.
> java.io.IOException: 
> /...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet 
> already exists



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16234) Speculative Task may not be able to overwrite file

2016-06-27 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-16234:
--
Description: 
given spark.speculative set to true, I'm running a large spark job with parquet 
and savemode overwrite.

Spark will speculatively try to create a task to deal with a straggler. 
However, doing this comes with risk because EVEN THOUGH savemode overwrite is 
selected, if the straggler completes before the original task or the original 
task completes before the straggler then the job will fail due to the file 
already existing.

java.io.IOException: 
/...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet 
already exists

  was:
given spark.speculative set to true, I'm running a large spark job with parquet 
and savemode overwrite.

Spark will speculatively try to create a task to deal with this straggler. 
However, doing this comes with risk because EVEN THOUGH savemode overwrite is 
selected, if the straggler completes before the original task or the original 
task completes before the straggler then the job will fail due to the file 
already existing.

java.io.IOException: 
/...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet 
already exists


> Speculative Task may not be able to overwrite file
> --
>
> Key: SPARK-16234
> URL: https://issues.apache.org/jira/browse/SPARK-16234
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Bill Chambers
>
> given spark.speculative set to true, I'm running a large spark job with 
> parquet and savemode overwrite.
> Spark will speculatively try to create a task to deal with a straggler. 
> However, doing this comes with risk because EVEN THOUGH savemode overwrite is 
> selected, if the straggler completes before the original task or the original 
> task completes before the straggler then the job will fail due to the file 
> already existing.
> java.io.IOException: 
> /...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet 
> already exists



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16234) Speculative Task may not be able to overwrite file

2016-06-27 Thread Bill Chambers (JIRA)
Bill Chambers created SPARK-16234:
-

 Summary: Speculative Task may not be able to overwrite file
 Key: SPARK-16234
 URL: https://issues.apache.org/jira/browse/SPARK-16234
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Bill Chambers


given spark.speculative set to true, I'm running a large spark job with parquet 
and savemode overwrite.

Spark will speculatively try to create a task to deal with this straggler. 
However, doing this comes with risk because EVEN THOUGH savemode overwrite is 
selected, if the straggler completes before the original task or the original 
task completes before the straggler then the job will fail due to the file 
already existing.

java.io.IOException: 
/...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet 
already exists



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16220) Revert ShowFunctions/ListFunctions in 2.0 to Reflect 1.6 Functionality

2016-06-27 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351686#comment-15351686
 ] 

Bill Chambers edited comment on SPARK-16220 at 6/27/16 7:53 PM:


happy to take a look when it's all done :)


was (Author: bill_chambers):
[~hvanhovell] I imagine I should just resolve this as well?

> Revert ShowFunctions/ListFunctions in 2.0 to Reflect 1.6 Functionality
> --
>
> Key: SPARK-16220
> URL: https://issues.apache.org/jira/browse/SPARK-16220
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Bill Chambers
>
> After discussing this with [~marmbrus] and [~rxin]. We've decided to revert 
> SPARK-15663. After doing some research it seems like this is an unnecessary 
> departure from 1.X functionality and does not have a reasonable substitute 
> that gives the same functionality.
> The first step is to revert the change. After doing that there are a couple 
> of different ways to approachs to getting at user defined functions.
> 1. SHOW FUNCTIONS (shows all of them) + SHOW USER FUNCTIONS (Snowflake does 
> this)
> 2. SHOW FUNCTIONS + SHOW USER FUNCTIONS + SHOW ALL FUNCTIONS
> 3. SHOW FUNCTIONS + SHOW SYSTEM FUNCTIONS (or something similar)
> 4. SHOW FUNCTIONS + some column to designate if it's system designed or user 
> defined.
> 1. This aligns with previous functionality and then supplements it with 
> something a bit more specific. 
> 2. Is unclear because "all" is just unclear why does the default refer to 
> only user defined functions. This doesn't seem like the right approach.
> 3. Same kind of issue, I'm not sure why the user functions should be the 
> default over the system functions. That doesn't seem like the correct 
> approach.
> 4. This one seems nice because it kind of achieves #1, keeps existing 
> functionality, but then supplants it with some more. This also allows you, 
> for example, to create your own set of date functions and then search them 
> all in one go as opposed to searching system and then user functions. This 
> would have to return two columns though, which could potentially be an issue?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16220) Revert ShowFunctions/ListFunctions in 2.0 to Reflect 1.6 Functionality

2016-06-27 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351686#comment-15351686
 ] 

Bill Chambers commented on SPARK-16220:
---

[~hvanhovell] I imagine I should just resolve this as well?

> Revert ShowFunctions/ListFunctions in 2.0 to Reflect 1.6 Functionality
> --
>
> Key: SPARK-16220
> URL: https://issues.apache.org/jira/browse/SPARK-16220
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Bill Chambers
>
> After discussing this with [~marmbrus] and [~rxin]. We've decided to revert 
> SPARK-15663. After doing some research it seems like this is an unnecessary 
> departure from 1.X functionality and does not have a reasonable substitute 
> that gives the same functionality.
> The first step is to revert the change. After doing that there are a couple 
> of different ways to approachs to getting at user defined functions.
> 1. SHOW FUNCTIONS (shows all of them) + SHOW USER FUNCTIONS (Snowflake does 
> this)
> 2. SHOW FUNCTIONS + SHOW USER FUNCTIONS + SHOW ALL FUNCTIONS
> 3. SHOW FUNCTIONS + SHOW SYSTEM FUNCTIONS (or something similar)
> 4. SHOW FUNCTIONS + some column to designate if it's system designed or user 
> defined.
> 1. This aligns with previous functionality and then supplements it with 
> something a bit more specific. 
> 2. Is unclear because "all" is just unclear why does the default refer to 
> only user defined functions. This doesn't seem like the right approach.
> 3. Same kind of issue, I'm not sure why the user functions should be the 
> default over the system functions. That doesn't seem like the correct 
> approach.
> 4. This one seems nice because it kind of achieves #1, keeps existing 
> functionality, but then supplants it with some more. This also allows you, 
> for example, to create your own set of date functions and then search them 
> all in one go as opposed to searching system and then user functions. This 
> would have to return two columns though, which could potentially be an issue?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16220) Revert ShowFunctions/ListFunctions in 2.0 to Reflect 1.6 Functionality

2016-06-26 Thread Bill Chambers (JIRA)
Bill Chambers created SPARK-16220:
-

 Summary: Revert ShowFunctions/ListFunctions in 2.0 to Reflect 1.6 
Functionality
 Key: SPARK-16220
 URL: https://issues.apache.org/jira/browse/SPARK-16220
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.0.0, 2.0.1, 2.1.0
Reporter: Bill Chambers


After discussing this with [~marmbrus] and [~rxin]. We've decided to revert 
SPARK-15663. After doing some research it seems like this is an unnecessary 
departure from 1.X functionality and does not have a reasonable substitute that 
gives the same functionality.

The first step is to revert the change. After doing that there are a couple of 
different ways to approachs to getting at user defined functions.
1. SHOW FUNCTIONS (shows all of them) + SHOW USER FUNCTIONS (Snowflake does 
this)
2. SHOW FUNCTIONS + SHOW USER FUNCTIONS + SHOW ALL FUNCTIONS
3. SHOW FUNCTIONS + SHOW SYSTEM FUNCTIONS (or something similar)
4. SHOW FUNCTIONS + some column to designate if it's system designed or user 
defined.

1. This aligns with previous functionality and then supplements it with 
something a bit more specific. 
2. Is unclear because "all" is just unclear why does the default refer to only 
user defined functions. This doesn't seem like the right approach.
3. Same kind of issue, I'm not sure why the user functions should be the 
default over the system functions. That doesn't seem like the correct approach.
4. This one seems nice because it kind of achieves #1, keeps existing 
functionality, but then supplants it with some more. This also allows you, for 
example, to create your own set of date functions and then search them all in 
one go as opposed to searching system and then user functions. This would have 
to return two columns though, which could potentially be an issue?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16077) Python UDF may fail because of six

2016-06-20 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15340459#comment-15340459
 ] 

Bill Chambers commented on SPARK-16077:
---

[~davies]

was this one the one that I had reported?

> Python UDF may fail because of six
> --
>
> Key: SPARK-16077
> URL: https://issues.apache.org/jira/browse/SPARK-16077
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Davies Liu
>
> six or other package may break pickle.whichmodule() in pickle:
> https://bitbucket.org/gutworth/six/issues/63/importing-six-breaks-pickling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15264) Spark 2.0 CSV Reader: Error on Blank Column Names

2016-05-10 Thread Bill Chambers (JIRA)
Bill Chambers created SPARK-15264:
-

 Summary: Spark 2.0 CSV Reader: Error on Blank Column Names
 Key: SPARK-15264
 URL: https://issues.apache.org/jira/browse/SPARK-15264
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Bill Chambers


When you read in a csv file that starts with blank column names the read fails 
when you specify that you want a header.

Pull request coming shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14708) Repl Serialization Issue

2016-04-18 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246230#comment-15246230
 ] 

Bill Chambers commented on SPARK-14708:
---

cc:[~joshrosen]

> Repl Serialization Issue
> 
>
> Key: SPARK-14708
> URL: https://issues.apache.org/jira/browse/SPARK-14708
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Bill Chambers
>Priority: Critical
>
> Run this code 6 times with the :paste command in Spark. You'll see 
> exponential slow downs.
> class IntWrapper(val i: Int) extends Serializable {  }
> var pairs = sc.parallelize(Array((0, new IntWrapper(0
> for (_ <- 0 until 3) {
>   val wrapper = pairs.values.reduce((x,_) => x)
>   pairs = pairs.mapValues(_ => wrapper)
> }
> val result = pairs.collect()
> https://forums.databricks.com/questions/7729/delays-when-running-program-multiple-times-in-note.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14708) Repl Serialization Issue

2016-04-18 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-14708:
--
Description: 
Run this code 6 times with the :paste command in Spark. You'll see exponential 
slow downs.

class IntWrapper(val i: Int) extends Serializable {  }
var pairs = sc.parallelize(Array((0, new IntWrapper(0
for (_ <- 0 until 3) {
  val wrapper = pairs.values.reduce((x,_) => x)
  pairs = pairs.mapValues(_ => wrapper)
}
val result = pairs.collect()

https://forums.databricks.com/questions/7729/delays-when-running-program-multiple-times-in-note.html

  was:
Run this code 6 times with the :paste command in Spark. You'll see exponential 
slow downs.

class IntWrapper(val i: Int) extends Serializable {  }
 
var pairs = sc.parallelize(Array((0, new IntWrapper(0
 
for (_ <- 0 until 3) {
  val wrapper = pairs.values.reduce((x,_) => x)
  pairs = pairs.mapValues(_ => wrapper)
}
val result = pairs.collect()


https://forums.databricks.com/questions/7729/delays-when-running-program-multiple-times-in-note.html


> Repl Serialization Issue
> 
>
> Key: SPARK-14708
> URL: https://issues.apache.org/jira/browse/SPARK-14708
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Bill Chambers
>Priority: Critical
>
> Run this code 6 times with the :paste command in Spark. You'll see 
> exponential slow downs.
> class IntWrapper(val i: Int) extends Serializable {  }
> var pairs = sc.parallelize(Array((0, new IntWrapper(0
> for (_ <- 0 until 3) {
>   val wrapper = pairs.values.reduce((x,_) => x)
>   pairs = pairs.mapValues(_ => wrapper)
> }
> val result = pairs.collect()
> https://forums.databricks.com/questions/7729/delays-when-running-program-multiple-times-in-note.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14708) Repl Serialization Issue

2016-04-18 Thread Bill Chambers (JIRA)
Bill Chambers created SPARK-14708:
-

 Summary: Repl Serialization Issue
 Key: SPARK-14708
 URL: https://issues.apache.org/jira/browse/SPARK-14708
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Bill Chambers
Priority: Critical


Run this code 6 times with the :paste command in Spark. You'll see exponential 
slow downs.

class IntWrapper(val i: Int) extends Serializable {  }
 
var pairs = sc.parallelize(Array((0, new IntWrapper(0
 
for (_ <- 0 until 3) {
  val wrapper = pairs.values.reduce((x,_) => x)
  pairs = pairs.mapValues(_ => wrapper)
}
val result = pairs.collect()


https://forums.databricks.com/questions/7729/delays-when-running-program-multiple-times-in-note.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13214) Fix dynamic allocation docs

2016-02-05 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-13214:
--
Description: Update docs to reflect dynamicAllocation to be available for 
all cluster managers
Summary: Fix dynamic allocation docs  (was: Update docs to reflect 
dynamicAllocation to be true)

> Fix dynamic allocation docs
> ---
>
> Key: SPARK-13214
> URL: https://issues.apache.org/jira/browse/SPARK-13214
> Project: Spark
>  Issue Type: Documentation
>Reporter: Bill Chambers
>Priority: Trivial
>
> Update docs to reflect dynamicAllocation to be available for all cluster 
> managers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13214) Update docs to reflect dynamicAllocation to be true

2016-02-05 Thread Bill Chambers (JIRA)
Bill Chambers created SPARK-13214:
-

 Summary: Update docs to reflect dynamicAllocation to be true
 Key: SPARK-13214
 URL: https://issues.apache.org/jira/browse/SPARK-13214
 Project: Spark
  Issue Type: Documentation
Reporter: Bill Chambers
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11964) Create user guide section explaining export/import

2015-12-02 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035480#comment-15035480
 ] 

Bill Chambers commented on SPARK-11964:
---

quick question, am I to assume that all pieces mentioned in this jira: 
https://issues.apache.org/jira/browse/SPARK-6725 are to be included in the new 
release [and the user guide]?

> Create user guide section explaining export/import
> --
>
> Key: SPARK-11964
> URL: https://issues.apache.org/jira/browse/SPARK-11964
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> I'm envisioning a single section in the main guide explaining how it works 
> with an example and noting major missing coverage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11964) Create user guide section explaining export/import

2015-12-02 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035480#comment-15035480
 ] 

Bill Chambers edited comment on SPARK-11964 at 12/2/15 8:52 AM:


quick question, am I to assume that all pieces mentioned in this jira: 
https://issues.apache.org/jira/browse/SPARK-6725 are to be included, even those 
that are unresolved, in the new release [and the user guide]?


was (Author: bill_chambers):
quick question, am I to assume that all pieces mentioned in this jira: 
https://issues.apache.org/jira/browse/SPARK-6725 are to be included in the new 
release [and the user guide]?

> Create user guide section explaining export/import
> --
>
> Key: SPARK-11964
> URL: https://issues.apache.org/jira/browse/SPARK-11964
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> I'm envisioning a single section in the main guide explaining how it works 
> with an example and noting major missing coverage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11964) Create user guide section explaining export/import

2015-12-01 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034178#comment-15034178
 ] 

Bill Chambers edited comment on SPARK-11964 at 12/1/15 8:27 PM:


Happy to help out with this. Should this belong in a new file or should it just 
be a part of one that already exists?
https://github.com/apache/spark/tree/master/docs

-Since 
[pmml-export|https://github.com/apache/spark/blob/master/docs/mllib-pmml-model-export.md]
 is its own file, it seems to me that in the guide it might be best to just 
have a new file. and they would follow one another in [the 
guide|https://github.com/apache/spark/blob/master/docs/mllib-guide.md]. 
However, I defer to your judgement! Let me know and I'll try to get it written 
up today.-

It seems like the best place might actually be at the bottom of the ML guide 
since all of this just refers to the ML api.

https://github.com/apache/spark/blob/master/docs/ml-guide.md


was (Author: bill_chambers):
Happy to help out with this. Should this belong in a new file or should it just 
be a part of one that already exists?
https://github.com/apache/spark/tree/master/docs

Since 
[pmml-export|https://github.com/apache/spark/blob/master/docs/mllib-pmml-model-export.md]
 is its own file, it seems to me that in the guide it might be best to just 
have a new file. and they would follow one another in [the 
guide|https://github.com/apache/spark/blob/master/docs/mllib-guide.md]. 
However, I defer to your judgement! Let me know and I'll try to get it written 
up today.

> Create user guide section explaining export/import
> --
>
> Key: SPARK-11964
> URL: https://issues.apache.org/jira/browse/SPARK-11964
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> I'm envisioning a single section in the main guide explaining how it works 
> with an example and noting major missing coverage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11964) Create user guide section explaining export/import

2015-12-01 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034178#comment-15034178
 ] 

Bill Chambers commented on SPARK-11964:
---

Happy to help out with this. Should this belong in a new file or should it just 
be a part of one that already exists?
https://github.com/apache/spark/tree/master/docs

Since 
[pmml-export|https://github.com/apache/spark/blob/master/docs/mllib-pmml-model-export.md]
 is its own file, it seems to me that in the guide it might be best to just 
have a new file. and they would follow one another in [the 
guide|https://github.com/apache/spark/blob/master/docs/mllib-guide.md]. 
However, I defer to your judgement! Let me know and I'll try to get it written 
up today.

> Create user guide section explaining export/import
> --
>
> Key: SPARK-11964
> URL: https://issues.apache.org/jira/browse/SPARK-11964
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> I'm envisioning a single section in the main guide explaining how it works 
> with an example and noting major missing coverage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7130) spark.ml RandomForest* should always do bootstrapping

2015-11-06 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994307#comment-14994307
 ] 

Bill Chambers edited comment on SPARK-7130 at 11/6/15 7:39 PM:
---

Looking at this issue, the change needs to occur within the [RandomForest 
File|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala]

Specifically around [lines 
88|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L88]
 and 91.

I'd like to submit a pull request but want to make sure that there's nothing 
else I need to be aware of! Is there anything else that needs to change?


was (Author: bill_chambers):
Looking at this issue, the change needs to occur within the [RandomForest 
File|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala]

Specifically around [lines 
88|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L88]
 and 91.

I'd like to submit a pull request but want to make sure that there's nothing 
else I need to be aware of!

> spark.ml RandomForest* should always do bootstrapping
> -
>
> Key: SPARK-7130
> URL: https://issues.apache.org/jira/browse/SPARK-7130
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Currently, spark.ml RandomForest does not do bootstrapping if numTrees = 1.  
> For consistency and a simpler API, it should always do bootstrapping.  The 
> current behavior is an artifact of the old API, in which RandomForest and 
> DecisionTree share the same implementation.  This change should happen after 
> the implementation is moved to spark.ml (which we need to do so that the 
> implementation can be generalized).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7130) spark.ml RandomForest* should always do bootstrapping

2015-11-06 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994307#comment-14994307
 ] 

Bill Chambers commented on SPARK-7130:
--

Looking at this issue, the change needs to occur within the [RandomForest 
File|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala]

Specifically around [lines 
88|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L88]
 and 91.

I'd like to submit a pull request but want to make sure that there's nothing 
else I need to be aware of!

> spark.ml RandomForest* should always do bootstrapping
> -
>
> Key: SPARK-7130
> URL: https://issues.apache.org/jira/browse/SPARK-7130
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Currently, spark.ml RandomForest does not do bootstrapping if numTrees = 1.  
> For consistency and a simpler API, it should always do bootstrapping.  The 
> current behavior is an artifact of the old API, in which RandomForest and 
> DecisionTree share the same implementation.  This change should happen after 
> the implementation is moved to spark.ml (which we need to do so that the 
> implementation can be generalized).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2015-09-11 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741715#comment-14741715
 ] 

Bill Chambers edited comment on SPARK-10528 at 9/11/15 11:17 PM:
-

This came up for me when I used the spark_ec2 launcher. When I tried to enter 
the spark shell I received the same error on AWS.

Running:
ephermeral-hdfs/bin/hadoop fs -chmod 777 /tmp/hive

allowed the SQLContext to get pulled in and created correctly.

It's a workaround for now, but something that might want to be fixed in the 
future. 



was (Author: bill_chambers):
This came up for me when I used the spark_ec2 launcher. When I tried to enter 
the spark shell I received the same error.

Running:
ephermeral-hdfs/bin/hadoop fs -chmod 777 /tmp/hive

allowed the SQLContext to get pulled in and created correctly.

It's a workaround for now, but something that might want to be fixed in the 
future. 


> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2015-09-11 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741715#comment-14741715
 ] 

Bill Chambers commented on SPARK-10528:
---

This came up for me when I used the spark_ec2 launcher. When I tried to enter 
the spark shell I received the same error.

Running:
ephermeral-hdfs/bin/hadoop fs -chmod 777 /tmp/hive

allowed the SQLContext to get pulled in and created correctly.

It's a workaround for now, but something that might want to be fixed in the 
future. 


> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org