[jira] [Commented] (SPARK-27780) Shuffle server & client should be versioned to enable smoother upgrade

2020-06-30 Thread Jason Moore (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149083#comment-17149083
 ] 

Jason Moore commented on SPARK-27780:
-

I encounter this on 3.0.0 running with a much older shuffle service (I think 
like 2.1.1).  Is there any documentation on which shuffle services 3.0.0 will 
work with?

> Shuffle server & client should be versioned to enable smoother upgrade
> --
>
> Key: SPARK-27780
> URL: https://issues.apache.org/jira/browse/SPARK-27780
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Imran Rashid
>Priority: Major
>
> The external shuffle service is often upgraded at a different time than spark 
> itself.  However, this causes problems when the protocol changes between the 
> shuffle service and the spark runtime -- this forces users to upgrade 
> everything simultaneously.
> We should add versioning to the shuffle client & server, so they know what 
> messages the other will support.  This would allow better handling of mixed 
> versions, from better error msgs to allowing some mismatched versions (with 
> reduced capabilities).
> This originally came up in a discussion here: 
> https://github.com/apache/spark/pull/24565#issuecomment-493496466
> There are a few ways we could do the versioning which we still need to 
> discuss:
> 1) Version specified by config.  This allows for mixed versions across the 
> cluster and rolling upgrades.  It also will let a spark 3.0 client talk to a 
> 2.4 shuffle service.  But, may be a nuisance for users to get this right.
> 2) Auto-detection during registration with local shuffle service.  This makes 
> the versioning easy for the end user, and can even handle a 2.4 shuffle 
> service though it does not support the new versioning.  However, it will not 
> handle a rolling upgrade correctly -- if the local shuffle service has been 
> upgraded, but other nodes in the cluster have not, it will get the version 
> wrong.
> 3) Exchange versions per-connection.  When a connection is opened, the server 
> & client could first exchange messages with their versions, so they know how 
> to continue communication after that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32136) Spark producing incorrect groupBy results when key is a struct with nullable properties

2020-06-30 Thread Jason Moore (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149073#comment-17149073
 ] 

Jason Moore commented on SPARK-32136:
-

Here is a similar test, and why it's a problem for what I'm needing to do:


{noformat}
case class C(d: Double)
case class B(c: Option[C])
case class A(b: Option[B])
val df = Seq(
  A(None),
  A(Some(B(None))),
  A(Some(B(Some(C(1.0)
).toDF
val res = df.groupBy("b").agg(count("*"))

> res.show
+---++
|  b|count(1)|
+---++
|   [[]]|   2|
|[[1.0]]|   1|
+---++

> res.as[(Option[B], Long)].collect
java.lang.RuntimeException: Error while decoding: 
java.lang.NullPointerException: Null value appeared in non-nullable field:
- field (class: "scala.Double", name: "d")
- option value class: "C"
- field (class: "scala.Option", name: "c")
- option value class: "B"
- field (class: "scala.Option", name: "_1")
- root class: "scala.Tuple2"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please 
try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
instead of int/scala.Int).
newInstance(class scala.Tuple2)
{noformat}

Interestingly, and potentially usefully to know, that using an Int instead of a 
Double above works as expected.

> Spark producing incorrect groupBy results when key is a struct with nullable 
> properties
> ---
>
> Key: SPARK-32136
> URL: https://issues.apache.org/jira/browse/SPARK-32136
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Moore
>Priority: Major
>
> I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm 
> noticing a behaviour change in a particular aggregation we're doing, and I 
> think I've tracked it down to how Spark is now treating nullable properties 
> within the column being grouped by.
>  
> Here's a simple test I've been able to set up to repro it:
>  
> {code:scala}
> case class B(c: Option[Double])
> case class A(b: Option[B])
> val df = Seq(
> A(None),
> A(Some(B(None))),
> A(Some(B(Some(1.0
> ).toDF
> val res = df.groupBy("b").agg(count("*"))
> {code}
> Spark 2.4.6 has the expected result:
> {noformat}
> > res.show
> +-++
> |b|count(1)|
> +-++
> |   []|   1|
> | null|   1|
> |[1.0]|   1|
> +-++
> > res.collect.foreach(println)
> [[null],1]
> [null,1]
> [[1.0],1]
> {noformat}
> But Spark 3.0.0 has an unexpected result:
> {noformat}
> > res.show
> +-++
> |b|count(1)|
> +-++
> |   []|   2|
> |[1.0]|   1|
> +-++
> > res.collect.foreach(println)
> [[null],2]
> [[1.0],1]
> {noformat}
> Notice how it has keyed one of the values in be as `[null]`; that is, an 
> instance of B with a null value for the `c` property instead of a null for 
> the overall value itself.
> Is this an intended change?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32136) Spark producing incorrect groupBy results when key is a struct with nullable properties

2020-06-30 Thread Jason Moore (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Moore updated SPARK-32136:

Description: 
I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm 
noticing a behaviour change in a particular aggregation we're doing, and I 
think I've tracked it down to how Spark is now treating nullable properties 
within the column being grouped by.
 
Here's a simple test I've been able to set up to repro it:
 

{code:scala}
case class B(c: Option[Double])
case class A(b: Option[B])
val df = Seq(
A(None),
A(Some(B(None))),
A(Some(B(Some(1.0
).toDF
val res = df.groupBy("b").agg(count("*"))
{code}

Spark 2.4.6 has the expected result:

{noformat}
> res.show
+-++
|b|count(1)|
+-++
|   []|   1|
| null|   1|
|[1.0]|   1|
+-++

> res.collect.foreach(println)
[[null],1]
[null,1]
[[1.0],1]
{noformat}

But Spark 3.0.0 has an unexpected result:

{noformat}
> res.show
+-++
|b|count(1)|
+-++
|   []|   2|
|[1.0]|   1|
+-++

> res.collect.foreach(println)
[[null],2]
[[1.0],1]
{noformat}

Notice how it has keyed one of the values in be as `[null]`; that is, an 
instance of B with a null value for the `c` property instead of a null for the 
overall value itself.

Is this an intended change?

  was:
I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm 
noticing a behaviour change in a particular aggregation we're doing, and I 
think I've tracked it down to how Spark is now treating nullable properties 
within the column being grouped by.
 
Here's a simple test I've been able to set up to repro it:
 

{code:scala}
case class B(c: Option[Double])
case class A(b: Option[B])
val df = Seq(
A(None),
A(Some(B(None))),
A(Some(B(Some(1.0
).toDF
val res = df.groupBy("b").agg(count("*"))
{code}

Spark 2.4.6 has the expected result:

{noformat}
> res.show
+-++
|b|count(1)|
+-++
|   []|   1|
| null|   1|
|[1.0]|   1|
+-++
> res.collect.foreach(println)
[[null],1]
[null,1]
[[1.0],1]
{noformat}

But Spark 3.0.0 has an unexpected result:

{noformat}
> res.show
+-++
|b|count(1)|
+-++
|   []|   2|
|[1.0]|   1|
+-++

> res.collect.foreach(println)
[[null],2]
[[1.0],1]
{noformat}

Notice how it has keyed one of the values in be as `[null]`; that is, an 
instance of B with a null value for the `c` property instead of a null for the 
overall value itself.

Is this an intended change?


> Spark producing incorrect groupBy results when key is a struct with nullable 
> properties
> ---
>
> Key: SPARK-32136
> URL: https://issues.apache.org/jira/browse/SPARK-32136
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Moore
>Priority: Major
>
> I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm 
> noticing a behaviour change in a particular aggregation we're doing, and I 
> think I've tracked it down to how Spark is now treating nullable properties 
> within the column being grouped by.
>  
> Here's a simple test I've been able to set up to repro it:
>  
> {code:scala}
> case class B(c: Option[Double])
> case class A(b: Option[B])
> val df = Seq(
> A(None),
> A(Some(B(None))),
> A(Some(B(Some(1.0
> ).toDF
> val res = df.groupBy("b").agg(count("*"))
> {code}
> Spark 2.4.6 has the expected result:
> {noformat}
> > res.show
> +-++
> |b|count(1)|
> +-++
> |   []|   1|
> | null|   1|
> |[1.0]|   1|
> +-++
> > res.collect.foreach(println)
> [[null],1]
> [null,1]
> [[1.0],1]
> {noformat}
> But Spark 3.0.0 has an unexpected result:
> {noformat}
> > res.show
> +-++
> |b|count(1)|
> +-++
> |   []|   2|
> |[1.0]|   1|
> +-++
> > res.collect.foreach(println)
> [[null],2]
> [[1.0],1]
> {noformat}
> Notice how it has keyed one of the values in be as `[null]`; that is, an 
> instance of B with a null value for the `c` property instead of a null for 
> the overall value itself.
> Is this an intended change?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32136) Spark producing incorrect groupBy results when key is a struct with nullable properties

2020-06-30 Thread Jason Moore (Jira)
Jason Moore created SPARK-32136:
---

 Summary: Spark producing incorrect groupBy results when key is a 
struct with nullable properties
 Key: SPARK-32136
 URL: https://issues.apache.org/jira/browse/SPARK-32136
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Jason Moore


I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm 
noticing a behaviour change in a particular aggregation we're doing, and I 
think I've tracked it down to how Spark is now treating nullable properties 
within the column being grouped by.
 
Here's a simple test I've been able to set up to repro it:
 

{code:scala}
case class B(c: Option[Double])
case class A(b: Option[B])
val df = Seq(
A(None),
A(Some(B(None))),
A(Some(B(Some(1.0
).toDF
val res = df.groupBy("b").agg(count("*"))
{code}

Spark 2.4.6 has the expected result:

{noformat}
> res.show
+-++
|b|count(1)|
+-++
|   []|   1|
| null|   1|
|[1.0]|   1|
+-++
> res.collect.foreach(println)
[[null],1]
[null,1]
[[1.0],1]
{noformat}

But Spark 3.0.0 has an unexpected result:

{noformat}
> res.show
+-++
|b|count(1)|
+-++
|   []|   2|
|[1.0]|   1|
+-++

> res.collect.foreach(println)
[[null],2]
[[1.0],1]
{noformat}

Notice how it has keyed one of the values in be as `[null]`; that is, an 
instance of B with a null value for the `c` property instead of a null for the 
overall value itself.

Is this an intended change?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26027) Unable to build Spark for Scala 2.12 with Maven script provided

2018-11-15 Thread Jason Moore (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688957#comment-16688957
 ] 

Jason Moore edited comment on SPARK-26027 at 11/16/18 3:10 AM:
---

I was originally going to withdraw the ticket when I discovered my actual 
issue, and happy for that to happen.  The main concern I was left with was that 
the build scripts download based on the default Scala version (2.11.12 on 
v2.4.0 tag) rather than taking the profile flag into account).  If you don't 
see this as an issue to worry about, close this ticket and forget all about it.


was (Author: jasonmoore2k):
I was originally going to withdraw the ticket when I discovered my actual 
issue, and happy for that to happen.  The main concern I was left with was that 
the build scripts download based on the default Scala version (2.11.12 on 
v2.4.0 tag) rater than taking the profile flag into account).  If you don't see 
this as an issue to worry about, close this ticket and forget all about it.

> Unable to build Spark for Scala 2.12 with Maven script provided
> ---
>
> Key: SPARK-26027
> URL: https://issues.apache.org/jira/browse/SPARK-26027
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Jason Moore
>Priority: Minor
>
> In ./build/mvn,  from pom.xml is used to determine which Scala 
> library to fetch but it doesn't seem to use the value under the scala-2.12 
> profile even if that is set.
> The result is that the maven build still uses scala-library 2.11.12 and 
> compilation fails.
> Am I missing a step? (I do run ./dev/change-scala-version.sh 2.12 but I think 
> that only updates scala.binary.version)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26027) Unable to build Spark for Scala 2.12 with Maven script provided

2018-11-15 Thread Jason Moore (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688957#comment-16688957
 ] 

Jason Moore commented on SPARK-26027:
-

I was originally going to withdraw the ticket when I discovered my actual 
issue, and happy for that to happen.  The main concern I was left with was that 
the build scripts download based on the default Scala version (2.11.12 on 
v2.4.0 tag) rater than taking the profile flag into account).  If you don't see 
this as an issue to worry about, close this ticket and forget all about it.

> Unable to build Spark for Scala 2.12 with Maven script provided
> ---
>
> Key: SPARK-26027
> URL: https://issues.apache.org/jira/browse/SPARK-26027
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Jason Moore
>Priority: Minor
>
> In ./build/mvn,  from pom.xml is used to determine which Scala 
> library to fetch but it doesn't seem to use the value under the scala-2.12 
> profile even if that is set.
> The result is that the maven build still uses scala-library 2.11.12 and 
> compilation fails.
> Am I missing a step? (I do run ./dev/change-scala-version.sh 2.12 but I think 
> that only updates scala.binary.version)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26027) Unable to build Spark for Scala 2.12 with Maven script provided

2018-11-12 Thread Jason Moore (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684567#comment-16684567
 ] 

Jason Moore edited comment on SPARK-26027 at 11/13/18 12:49 AM:


Actually, the fact that scala-library 2.11.12 is downloaded to launch the zinc 
server turned out to be a red herring and my issue was not properly setting the 
profile to scala-2.12.  I'm not sure of the repercussion of having the zinc 
server started by ./build/mvn having 2.11.12 though (see 
https://github.com/apache/spark/blob/v2.4.0/build/mvn#L155) so maybe someone 
with a more intimate knowledge of it can comment.


was (Author: jasonmoore2k):
Actually, the fact that scala-library 2.11.12 is downloaded to launch the zinc 
server turned out to be a red herring and my issue was not properly setting the 
profile to scala-2.12.  I'm not sure of the repercussion of having the zinc 
server started by ./build/mvn having 2.11.12 though (see 
[https://github.com/apache/spark/blob/master/build/mvn#L155|https://github.com/apache/spark/blob/master/build/mvn#L155).])
 so maybe someone with a more intimate knowledge of it can comment.

> Unable to build Spark for Scala 2.12 with Maven script provided
> ---
>
> Key: SPARK-26027
> URL: https://issues.apache.org/jira/browse/SPARK-26027
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Jason Moore
>Priority: Minor
>
> In ./build/mvn,  from pom.xml is used to determine which Scala 
> library to fetch but it doesn't seem to use the value under the scala-2.12 
> profile even if that is set.
> The result is that the maven build still uses scala-library 2.11.12 and 
> compilation fails.
> Am I missing a step? (I do run ./dev/change-scala-version.sh 2.12 but I think 
> that only updates scala.binary.version)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26027) Unable to build Spark for Scala 2.12 with Maven script provided

2018-11-12 Thread Jason Moore (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684567#comment-16684567
 ] 

Jason Moore commented on SPARK-26027:
-

Actually, the fact that scala-library 2.11.12 is downloaded to launch the zinc 
server turned out to be a red herring and my issue was not properly setting the 
profile to scala-2.12.  I'm not sure of the repercussion of having the zinc 
server started by ./build/mvn having 2.11.12 though (see 
[https://github.com/apache/spark/blob/master/build/mvn#L155|https://github.com/apache/spark/blob/master/build/mvn#L155).])
 so maybe someone with a more intimate knowledge of it can comment.

> Unable to build Spark for Scala 2.12 with Maven script provided
> ---
>
> Key: SPARK-26027
> URL: https://issues.apache.org/jira/browse/SPARK-26027
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Jason Moore
>Priority: Minor
>
> In ./build/mvn,  from pom.xml is used to determine which Scala 
> library to fetch but it doesn't seem to use the value under the scala-2.12 
> profile even if that is set.
> The result is that the maven build still uses scala-library 2.11.12 and 
> compilation fails.
> Am I missing a step? (I do run ./dev/change-scala-version.sh 2.12 but I think 
> that only updates scala.binary.version)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26027) Unable to build Spark for Scala 2.12 with Maven script provided

2018-11-12 Thread Jason Moore (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Moore updated SPARK-26027:

Priority: Minor  (was: Major)

> Unable to build Spark for Scala 2.12 with Maven script provided
> ---
>
> Key: SPARK-26027
> URL: https://issues.apache.org/jira/browse/SPARK-26027
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Jason Moore
>Priority: Minor
>
> In ./build/mvn,  from pom.xml is used to determine which Scala 
> library to fetch but it doesn't seem to use the value under the scala-2.12 
> profile even if that is set.
> The result is that the maven build still uses scala-library 2.11.12 and 
> compilation fails.
> Am I missing a step? (I do run ./dev/change-scala-version.sh 2.12 but I think 
> that only updates scala.binary.version)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26027) Unable to build Spark for Scala 2.12 with Maven script provided

2018-11-12 Thread Jason Moore (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Moore updated SPARK-26027:

Description: 
In ./build/mvn,  from pom.xml is used to determine which Scala 
library to fetch but it doesn't seem to use the value under the scala-2.12 
profile even if that is set.

The result is that the maven build still uses scala-library 2.11.12 and 
compilation fails.

Am I missing a step? (I do run ./dev/change-scala-version.sh 2.12 but I think 
that only updates scala.binary.version)

  was:
In ./build/mvn,  from pom.xml is used to determine which Scala 
library to fetch but it doesn't seem to use the value under the scala-2.12 
profile even if that is set. 

The result is that the maven build still uses scala-library 2.11.12 and 
compilation fails.

Am I missing a step?


> Unable to build Spark for Scala 2.12 with Maven script provided
> ---
>
> Key: SPARK-26027
> URL: https://issues.apache.org/jira/browse/SPARK-26027
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Jason Moore
>Priority: Major
>
> In ./build/mvn,  from pom.xml is used to determine which Scala 
> library to fetch but it doesn't seem to use the value under the scala-2.12 
> profile even if that is set.
> The result is that the maven build still uses scala-library 2.11.12 and 
> compilation fails.
> Am I missing a step? (I do run ./dev/change-scala-version.sh 2.12 but I think 
> that only updates scala.binary.version)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26027) Unable to build Spark for Scala 2.12 with Maven script provided

2018-11-12 Thread Jason Moore (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Moore updated SPARK-26027:

Description: 
In ./build/mvn,  from pom.xml is used to determine which Scala 
library to fetch but it doesn't seem to use the value under the scala-2.12 
profile even if that is set. 

The result is that the maven build still uses scala-library 2.11.12 and 
compilation fails.

Am I missing a step?

  was:
In ./build/mvn,  from pom.xml is used to determine which Scala 
library to fetch but it doesn't seem to use the value under the scala-2.12 
profile even if that is set.

 

The impact is that the maven build still uses scala-library 2.11.12 and 
compilation fails.

Am I missing a step?


> Unable to build Spark for Scala 2.12 with Maven script provided
> ---
>
> Key: SPARK-26027
> URL: https://issues.apache.org/jira/browse/SPARK-26027
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Jason Moore
>Priority: Major
>
> In ./build/mvn,  from pom.xml is used to determine which Scala 
> library to fetch but it doesn't seem to use the value under the scala-2.12 
> profile even if that is set. 
> The result is that the maven build still uses scala-library 2.11.12 and 
> compilation fails.
> Am I missing a step?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26027) Unable to build Spark for Scala 2.12 with Maven script provided

2018-11-12 Thread Jason Moore (JIRA)
Jason Moore created SPARK-26027:
---

 Summary: Unable to build Spark for Scala 2.12 with Maven script 
provided
 Key: SPARK-26027
 URL: https://issues.apache.org/jira/browse/SPARK-26027
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.0
Reporter: Jason Moore


In ./build/mvn,  from pom.xml is used to determine which Scala 
library to fetch but it doesn't seem to use the value under the scala-2.12 
profile even if that is set.

 

The impact is that the maven build still uses scala-library 2.11.12 and 
compilation fails.

Am I missing a step?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20411) New features for expression.scalalang.typed

2017-05-02 Thread Jason Moore (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992517#comment-15992517
 ] 

Jason Moore commented on SPARK-20411:
-

And, ideally, anything else within org.apache.spark.sql.functions (e.g. 
countDistinct).  We're looking to replace our use of DataFrames with Datasets, 
which means finding a replacement for all the aggregation functions that we 
use.  If I end up putting together some functions myself, I'll pop back here to 
contribute them.

> New features for expression.scalalang.typed
> ---
>
> Key: SPARK-20411
> URL: https://issues.apache.org/jira/browse/SPARK-20411
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Loic Descotte
>Priority: Minor
>
> In Spark 2 it is possible to use typed expressions for aggregation methods: 
> {code}
> import org.apache.spark.sql.expressions.scalalang._ 
> dataset.groupByKey(_.productId).agg(typed.sum[Token](_.score)).toDF("productId",
>  "sum").orderBy('productId).show
> {code}
> It seems that only avg, count and sum are defined : 
> https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/expressions/scalalang/typed.html
> It is very nice to be able to use a typesafe DSL, but it would be good to 
> have more possibilities, like min and max functions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2017-02-20 Thread Jason Moore (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15875115#comment-15875115
 ] 

Jason Moore commented on SPARK-18105:
-

I've hit the same using a very recent build from branch-2.1 
(b083ec5115f53a79ac54b85024c358510a03a459).

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2017-01-24 Thread Jason Moore (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15836847#comment-15836847
 ] 

Jason Moore commented on SPARK-17436:
-

Ahh, I think you are correct.  The issue on the write seems to be resolved as 
of 2.1+ (checked using parquet-tools).

On a read, some optimizations were made in 
https://github.com/apache/spark/pull/12095 that would mean that some files may 
get merged into a single partition (which will almost certainly be not sorted). 
 I mentioned the issue mailing list, but had missed the (useful) responses: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Sorting-within-partitions-is-not-maintained-in-parquet-td18618.html

If I set spark.sql.files.openCostInBytes very high, the re-read test now seems 
fine.

Thanks!

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>Priority: Minor
>
> update
> ***
> It seems that in spark 2.1 code, the sorting issue is resolved.
> The sorter does consider inner sorting in the sorting key - but I think it 
> will be faster to just insert the rows to a list in a hash map.
> ***
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2017-01-19 Thread Jason Moore (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830889#comment-15830889
 ] 

Jason Moore commented on SPARK-17436:
-

Hi [~ran.h...@optimalplus.com], [~srowen],

How sure are we that this is not a problem anymore?  I've been testing some of 
my application code that relies on preserving ordering (for performance) on 2.1 
(build from head of branch-2.1) and it still seems that after sorting within 
partitions, saving to parquet, then re-reading from parquet, that the rows are 
not correctly ordered.

{code}
val random = new scala.util.Random
val raw = sc.parallelize((1 to 100).map { _ => random.nextInt 
}).toDF("A").repartition(100)
raw.sortWithinPartitions("A").write.mode("overwrite").parquet("maybeSorted.parquet")
val maybeSorted = spark.read.parquet("maybeSorted.parquet")
val isSorted = maybeSorted.mapPartitions { rows =>
  val isSorted = rows.map(_.getInt(0)).sliding(2).forall { case List(x, y) => x 
<= y }
  Iterator(isSorted)
}.reduce(_ && _)
{code}

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>Priority: Minor
>
> update
> ***
> It seems that in spark 2.1 code, the sorting issue is resolved.
> The sorter does consider inner sorting in the sorting key - but I think it 
> will be faster to just insert the rows to a list in a hash map.
> ***
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17195) Dealing with JDBC column nullability when it is not reliable

2016-08-31 Thread Jason Moore (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453801#comment-15453801
 ] 

Jason Moore commented on SPARK-17195:
-

That's right, and I totally agree that's where the fix needs to be.  And I'm 
pressing them to make this fix.  I guess that means that this ticket can be 
closed, as it seems a reasonable workaround within Spark itself isn't possible. 
 Once the TD driver has been fixed I'll return here to mention the version it 
is fixed in.

> Dealing with JDBC column nullability when it is not reliable
> 
>
> Key: SPARK-17195
> URL: https://issues.apache.org/jira/browse/SPARK-17195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jason Moore
>
> Starting with Spark 2.0.0, the column "nullable" property is important to 
> have correct for the code generation to work properly.  Marking the column as 
> nullable = false used to (<2.0.0) allow null values to be operated on, but 
> now this will result in:
> {noformat}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> {noformat}
> I'm all for the change towards a more ridged behavior (enforcing correct 
> input).  But the problem I'm facing now is that when I used JDBC to read from 
> a Teradata server, the column nullability is often not correct (particularly 
> when sub-queries are involved).
> This is the line in question:
> https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L140
> I'm trying to work out what would be the way forward for me on this.  I know 
> that it's really the fault of the Teradata database server not returning the 
> correct schema, but I'll need to make Spark itself or my application 
> resilient to this behavior.
> One of the Teradata JDBC Driver tech leads has told me that "when the 
> rsmd.getSchemaName and rsmd.getTableName methods return an empty zero-length 
> string, then the other metadata values may not be completely accurate" - so 
> one option could be to treat the nullability (at least) the same way as the 
> "unknown" case (as nullable = true).  For reference, see the rest of our 
> discussion here: 
> http://forums.teradata.com/forum/connectivity/teradata-jdbc-driver-returns-the-wrong-schema-column-nullability
> Any other thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17195) Dealing with JDBC column nullability when it is not reliable

2016-08-23 Thread Jason Moore (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432314#comment-15432314
 ] 

Jason Moore commented on SPARK-17195:
-

That's right.

The JDBC API has ResultSetMetaData.isNullable returning:

* ResultSetMetaData.columnNoNulls (= 0) which means the column does not allow 
NULL values
* ResultSetMetaData.columnNullable (= 1) which means the column allows NULL 
values
* ResultSetMetaData.columnNullableUnknown (= 2) which means the nullability of 
a column's values is unknown

In Spark we take this result and do as you've described: If something is not 
non null then it is nullable.  See first link in the ticket description above.

> Dealing with JDBC column nullability when it is not reliable
> 
>
> Key: SPARK-17195
> URL: https://issues.apache.org/jira/browse/SPARK-17195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jason Moore
>
> Starting with Spark 2.0.0, the column "nullable" property is important to 
> have correct for the code generation to work properly.  Marking the column as 
> nullable = false used to (<2.0.0) allow null values to be operated on, but 
> now this will result in:
> {noformat}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> {noformat}
> I'm all for the change towards a more ridged behavior (enforcing correct 
> input).  But the problem I'm facing now is that when I used JDBC to read from 
> a Teradata server, the column nullability is often not correct (particularly 
> when sub-queries are involved).
> This is the line in question:
> https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L140
> I'm trying to work out what would be the way forward for me on this.  I know 
> that it's really the fault of the Teradata database server not returning the 
> correct schema, but I'll need to make Spark itself or my application 
> resilient to this behavior.
> One of the Teradata JDBC Driver tech leads has told me that "when the 
> rsmd.getSchemaName and rsmd.getTableName methods return an empty zero-length 
> string, then the other metadata values may not be completely accurate" - so 
> one option could be to treat the nullability (at least) the same way as the 
> "unknown" case (as nullable = true).  For reference, see the rest of our 
> discussion here: 
> http://forums.teradata.com/forum/connectivity/teradata-jdbc-driver-returns-the-wrong-schema-column-nullability
> Any other thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17195) Dealing with JDBC column nullability when it is not reliable

2016-08-23 Thread Jason Moore (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432285#comment-15432285
 ] 

Jason Moore commented on SPARK-17195:
-

Correct, currently Spark doesn't allow us to override what the driver 
determines as the schema to apply.  It wasn't that I was able to control this 
in the past, but that everything worked fine regardless of the value of 
StructField.nullable.  But now in 2.0, it appears to me that the new code 
generation will break when processing a (in my case String) field expected to 
not be null, but is null.

It's definitely not Spark's fault (I hope I've made that clear enough) and it's 
either in the JDBC driver, or further downstream in the database server itself, 
that the blame lies.  Over on the Teradata forums (link in the description 
above) I'm trying to raise with their team the possibility that they could be 
marking the column as "columnNullableUnknown" which Spark will then map to 
nullable=true.  We'll see if they can manage to do that, and then I hope my 
problem will be solved.

> Dealing with JDBC column nullability when it is not reliable
> 
>
> Key: SPARK-17195
> URL: https://issues.apache.org/jira/browse/SPARK-17195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jason Moore
>
> Starting with Spark 2.0.0, the column "nullable" property is important to 
> have correct for the code generation to work properly.  Marking the column as 
> nullable = false used to (<2.0.0) allow null values to be operated on, but 
> now this will result in:
> {noformat}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> {noformat}
> I'm all for the change towards a more ridged behavior (enforcing correct 
> input).  But the problem I'm facing now is that when I used JDBC to read from 
> a Teradata server, the column nullability is often not correct (particularly 
> when sub-queries are involved).
> This is the line in question:
> https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L140
> I'm trying to work out what would be the way forward for me on this.  I know 
> that it's really the fault of the Teradata database server not returning the 
> correct schema, but I'll need to make Spark itself or my application 
> resilient to this behavior.
> One of the Teradata JDBC Driver tech leads has told me that "when the 
> rsmd.getSchemaName and rsmd.getTableName methods return an empty zero-length 
> string, then the other metadata values may not be completely accurate" - so 
> one option could be to treat the nullability (at least) the same way as the 
> "unknown" case (as nullable = true).  For reference, see the rest of our 
> discussion here: 
> http://forums.teradata.com/forum/connectivity/teradata-jdbc-driver-returns-the-wrong-schema-column-nullability
> Any other thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17195) Dealing with JDBC column nullability when it is not reliable

2016-08-22 Thread Jason Moore (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432129#comment-15432129
 ] 

Jason Moore commented on SPARK-17195:
-

> I think it might be sensible to support this by implementing 
> `SchemaRelationProvider`.

That was one of the thoughts I had too.

> it should be fixed in Teradata

I more or less agree with this, but I'm not certain their JDBC driver is being 
provided with everything they need from the server to decide that it should be 
"columnNullableUnknown".  Maybe I'll shoot some more questions their way on 
this.

> Dealing with JDBC column nullability when it is not reliable
> 
>
> Key: SPARK-17195
> URL: https://issues.apache.org/jira/browse/SPARK-17195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jason Moore
>
> Starting with Spark 2.0.0, the column "nullable" property is important to 
> have correct for the code generation to work properly.  Marking the column as 
> nullable = false used to (<2.0.0) allow null values to be operated on, but 
> now this will result in:
> {noformat}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> {noformat}
> I'm all for the change towards a more ridged behavior (enforcing correct 
> input).  But the problem I'm facing now is that when I used JDBC to read from 
> a Teradata server, the column nullability is often not correct (particularly 
> when sub-queries are involved).
> This is the line in question:
> https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L140
> I'm trying to work out what would be the way forward for me on this.  I know 
> that it's really the fault of the Teradata database server not returning the 
> correct schema, but I'll need to make Spark itself or my application 
> resilient to this behavior.
> One of the Teradata JDBC Driver tech leads has told me that "when the 
> rsmd.getSchemaName and rsmd.getTableName methods return an empty zero-length 
> string, then the other metadata values may not be completely accurate" - so 
> one option could be to treat the nullability (at least) the same way as the 
> "unknown" case (as nullable = true).  For reference, see the rest of our 
> discussion here: 
> http://forums.teradata.com/forum/connectivity/teradata-jdbc-driver-returns-the-wrong-schema-column-nullability
> Any other thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17195) Dealing with JDBC column nullability when it is not reliable

2016-08-22 Thread Jason Moore (JIRA)
Jason Moore created SPARK-17195:
---

 Summary: Dealing with JDBC column nullability when it is not 
reliable
 Key: SPARK-17195
 URL: https://issues.apache.org/jira/browse/SPARK-17195
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Jason Moore


Starting with Spark 2.0.0, the column "nullable" property is important to have 
correct for the code generation to work properly.  Marking the column as 
nullable = false used to (<2.0.0) allow null values to be operated on, but now 
this will result in:

{noformat}
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
{noformat}

I'm all for the change towards a more ridged behavior (enforcing correct 
input).  But the problem I'm facing now is that when I used JDBC to read from a 
Teradata server, the column nullability is often not correct (particularly when 
sub-queries are involved).

This is the line in question:
https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L140

I'm trying to work out what would be the way forward for me on this.  I know 
that it's really the fault of the Teradata database server not returning the 
correct schema, but I'll need to make Spark itself or my application resilient 
to this behavior.

One of the Teradata JDBC Driver tech leads has told me that "when the 
rsmd.getSchemaName and rsmd.getTableName methods return an empty zero-length 
string, then the other metadata values may not be completely accurate" - so one 
option could be to treat the nullability (at least) the same way as the 
"unknown" case (as nullable = true).  For reference, see the rest of our 
discussion here: 
http://forums.teradata.com/forum/connectivity/teradata-jdbc-driver-returns-the-wrong-schema-column-nullability

Any other thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17022) Potential deadlock in driver handling message

2016-08-12 Thread Jason Moore (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418470#comment-15418470
 ] 

Jason Moore commented on SPARK-17022:
-

This one is maybe related to SPARK-16533 and/or SPARK-16702, right?  My team 
works in an environment where preemption (and killing of executors) is a common 
occurrence, so have been burnt a bit by this one.   We had been putting 
together a patch, but I'll see how this one holds up.

> Potential deadlock in driver handling message
> -
>
> Key: SPARK-17022
> URL: https://issues.apache.org/jira/browse/SPARK-17022
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 2.0.0
>Reporter: Tao Wang
>Assignee: Tao Wang
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Suggest t1 < t2 < t3 
> At t1, someone called YarnSchedulerBackend.doRequestTotalExecutors from one 
> of three functions: CoarseGrainedSchedulerBackend.killExecutors, 
> CoarseGrainedSchedulerBackend.requestTotalExecutors or 
> CoarseGrainedSchedulerBackend.requestExecutors, in all of which will hold the 
> lock `CoarseGrainedSchedulerBackend`.
> Then YarnSchedulerBackend.doRequestTotalExecutors will send a 
> RequestExecutors message to `yarnSchedulerEndpoint` and wait for reply.
> At t2, someone send a RemoveExecutor to `yarnSchedulerEndpoint` and the 
> message is received by the endpoint.
> At t3, the RequestExexutor message sent at t1 is received by the endpoint.
> Then the endpoint would first handle RemoveExecutor then the RequestExecutor 
> message.
> When handling RemoveExecutor, it would send the same message to 
> `driverEndpoint` and wait for reply.
> In `driverEndpoint` it will request lock `CoarseGrainedSchedulerBackend` to 
> handle that message, while the lock has been occupied in t1.
> So it would cause a deadlock.
> We have found the issue in our deployment, it would block the driver to make 
> it handle no messages until the two message all went timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14915) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job to never complete

2016-05-05 Thread Jason Moore (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Moore updated SPARK-14915:

Affects Version/s: 2.0.0
   1.5.3

> Tasks that fail due to CommitDeniedException (a side-effect of speculation) 
> can cause job to never complete
> ---
>
> Key: SPARK-14915
> URL: https://issues.apache.org/jira/browse/SPARK-14915
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.3, 1.6.2, 2.0.0
>Reporter: Jason Moore
>Assignee: Jason Moore
>Priority: Critical
> Fix For: 2.0.0
>
>
> In SPARK-14357, code was corrected towards the originally intended behavior 
> that a CommitDeniedException should not count towards the failure count for a 
> job.  After having run with this fix for a few weeks, it's become apparent 
> that this behavior has some unintended consequences - that a speculative task 
> will continuously receive a CDE from the driver, now causing it to fail and 
> retry over and over without limit.
> I'm thinking we could put a task that receives a CDE from the driver, into a 
> TaskState.FINISHED or some other state to indicated that the task shouldn't 
> be resubmitted by the TaskScheduler. I'd probably need some opinions on 
> whether there are other consequences for doing something like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14915) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job to never complete

2016-04-27 Thread Jason Moore (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261252#comment-15261252
 ] 

Jason Moore commented on SPARK-14915:
-

That's exactly my current thinking too.  But even if keep allowing some tasks 
to be retried without limit in certain contexts (the current two I'm aware of 
are: commit denied on speculative tasks or an executor lost because of a YARN 
de-allocation), it does seem that the commit denied is often happening when 
another copy has already succeeded.  I'm about to do some testing on this now, 
and not re-queuing in this scenario.

> Tasks that fail due to CommitDeniedException (a side-effect of speculation) 
> can cause job to never complete
> ---
>
> Key: SPARK-14915
> URL: https://issues.apache.org/jira/browse/SPARK-14915
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2
>Reporter: Jason Moore
>Priority: Critical
>
> In SPARK-14357, code was corrected towards the originally intended behavior 
> that a CommitDeniedException should not count towards the failure count for a 
> job.  After having run with this fix for a few weeks, it's become apparent 
> that this behavior has some unintended consequences - that a speculative task 
> will continuously receive a CDE from the driver, now causing it to fail and 
> retry over and over without limit.
> I'm thinking we could put a task that receives a CDE from the driver, into a 
> TaskState.FINISHED or some other state to indicated that the task shouldn't 
> be resubmitted by the TaskScheduler. I'd probably need some opinions on 
> whether there are other consequences for doing something like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14915) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job to never complete

2016-04-27 Thread Jason Moore (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259920#comment-15259920
 ] 

Jason Moore commented on SPARK-14915:
-

Could I get thoughts on this: at 
[TaskSetManager.scala#L723|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L723]
 a call is made to addPendingTask after a task has failed.  I can think of a 
scenario that it might be a good idea not to add the task back into the pending 
queue: when success(index) == true (which implies that another copy of the task 
has already succeeded).

I'm soon going to test it out with the condition, as I think it's quite 
possibly what is causing tasks to continually re-queue after a CDE until the 
stage has completed (further lengthening the duration of the stage, as that 
take up execution resources).

> Tasks that fail due to CommitDeniedException (a side-effect of speculation) 
> can cause job to never complete
> ---
>
> Key: SPARK-14915
> URL: https://issues.apache.org/jira/browse/SPARK-14915
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2
>Reporter: Jason Moore
>Priority: Critical
>
> In SPARK-14357, code was corrected towards the originally intended behavior 
> that a CommitDeniedException should not count towards the failure count for a 
> job.  After having run with this fix for a few weeks, it's become apparent 
> that this behavior has some unintended consequences - that a speculative task 
> will continuously receive a CDE from the driver, now causing it to fail and 
> retry over and over without limit.
> I'm thinking we could put a task that receives a CDE from the driver, into a 
> TaskState.FINISHED or some other state to indicated that the task shouldn't 
> be resubmitted by the TaskScheduler. I'd probably need some opinions on 
> whether there are other consequences for doing something like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14915) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job to never complete

2016-04-26 Thread Jason Moore (JIRA)
Jason Moore created SPARK-14915:
---

 Summary: Tasks that fail due to CommitDeniedException (a 
side-effect of speculation) can cause job to never complete
 Key: SPARK-14915
 URL: https://issues.apache.org/jira/browse/SPARK-14915
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.2
Reporter: Jason Moore
Priority: Critical


In SPARK-14357, code was corrected towards the originally intended behavior 
that a CommitDeniedException should not count towards the failure count for a 
job.  After having run with this fix for a few weeks, it's become apparent that 
this behavior has some unintended consequences - that a speculative task will 
continuously receive a CDE from the driver, now causing it to fail and retry 
over and over without limit.

I'm thinking we could put a task that receives a CDE from the driver, into a 
TaskState.FINISHED or some other state to indicated that the task shouldn't be 
resubmitted by the TaskScheduler. I'd probably need some opinions on whether 
there are other consequences for doing something like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14357) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job failure

2016-04-10 Thread Jason Moore (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Moore resolved SPARK-14357.
-
Resolution: Fixed

Issue resolved by pull request 12228
https://github.com/apache/spark/pull/12228

> Tasks that fail due to CommitDeniedException (a side-effect of speculation) 
> can cause job failure
> -
>
> Key: SPARK-14357
> URL: https://issues.apache.org/jira/browse/SPARK-14357
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.0, 1.6.1
>Reporter: Jason Moore
>Assignee: Jason Moore
>Priority: Critical
> Fix For: 1.6.2, 2.0.0, 1.5.2
>
>
> Speculation can often result in a CommitDeniedException, but ideally this 
> shouldn't result in the job failing.  So changes were made along with 
> SPARK-8167 to ensure that the CommitDeniedException is caught and given a 
> failure reason that doesn't increment the failure count.
> However, I'm still noticing that this exception is causing jobs to fail using 
> the 1.6.1 release version.
> {noformat}
> 16/04/04 11:36:02 ERROR InsertIntoHadoopFsRelation: Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 18 in 
> stage 315.0 failed 8 times, most recent failure: Lost task 18.8 in stage 
> 315.0 (TID 100793, qaphdd099.quantium.com.au.local): 
> org.apache.spark.SparkException: Task failed while writing rows.
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:272)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Failed to commit task
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.commitTask$1(WriterContainer.scala:287)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:267)
> ... 8 more
> Caused by: org.apache.spark.executor.CommitDeniedException: 
> attempt_201604041136_0315_m_18_8: Not committed because the driver did 
> not authorize commit
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:135)
> at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitTask(WriterContainer.scala:219)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.commitTask$1(WriterContainer.scala:282)
> ... 9 more
> {noformat}
> It seems to me that the CommitDeniedException gets wrapped into a 
> RuntimeException at 
> [WriterContainer.scala#L286|https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L286]
>  and then into a SparkException at 
> [InsertIntoHadoopFsRelation.scala#L154|https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala#L154]
>  which results in it not being able to be handled properly at 
> [Executor.scala#L290|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/executor/Executor.scala#L290]
> The solution might be that this catch block should type match on the 
> inner-most cause of an error?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14357) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job failure

2016-04-03 Thread Jason Moore (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Moore updated SPARK-14357:

Component/s: Spark Core

> Tasks that fail due to CommitDeniedException (a side-effect of speculation) 
> can cause job failure
> -
>
> Key: SPARK-14357
> URL: https://issues.apache.org/jira/browse/SPARK-14357
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.0, 1.6.1
>Reporter: Jason Moore
>Priority: Critical
>
> Speculation can often result in a CommitDeniedException, but ideally this 
> shouldn't result in the job failing.  So changes were made along with 
> SPARK-8167 to ensure that the CommitDeniedException is caught and given a 
> failure reason that doesn't increment the failure count.
> However, I'm still noticing that this exception is causing jobs to fail using 
> the 1.6.1 release version.
> {noformat}
> 16/04/04 11:36:02 ERROR InsertIntoHadoopFsRelation: Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 18 in 
> stage 315.0 failed 8 times, most recent failure: Lost task 18.8 in stage 
> 315.0 (TID 100793, qaphdd099.quantium.com.au.local): 
> org.apache.spark.SparkException: Task failed while writing rows.
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:272)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Failed to commit task
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.commitTask$1(WriterContainer.scala:287)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:267)
> ... 8 more
> Caused by: org.apache.spark.executor.CommitDeniedException: 
> attempt_201604041136_0315_m_18_8: Not committed because the driver did 
> not authorize commit
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:135)
> at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitTask(WriterContainer.scala:219)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.commitTask$1(WriterContainer.scala:282)
> ... 9 more
> {noformat}
> It seems to me that the CommitDeniedException gets wrapped into a 
> RuntimeException at 
> [WriterContainer.scala#L286|https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L286]
>  and then into a SparkException at 
> [InsertIntoHadoopFsRelation.scala#L154|https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala#L154]
>  which results in it not being able to be handled properly at 
> [Executor.scala#L290|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/executor/Executor.scala#L290]
> The solution might be that this catch block should type match on the 
> inner-most cause of an error?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14357) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job failure

2016-04-03 Thread Jason Moore (JIRA)
Jason Moore created SPARK-14357:
---

 Summary: Tasks that fail due to CommitDeniedException (a 
side-effect of speculation) can cause job failure
 Key: SPARK-14357
 URL: https://issues.apache.org/jira/browse/SPARK-14357
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.1, 1.6.0, 1.5.2
Reporter: Jason Moore
Priority: Critical


Speculation can often result in a CommitDeniedException, but ideally this 
shouldn't result in the job failing.  So changes were made along with 
SPARK-8167 to ensure that the CommitDeniedException is caught and given a 
failure reason that doesn't increment the failure count.

However, I'm still noticing that this exception is causing jobs to fail using 
the 1.6.1 release version.

{noformat}
16/04/04 11:36:02 ERROR InsertIntoHadoopFsRelation: Aborting job.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 18 in 
stage 315.0 failed 8 times, most recent failure: Lost task 18.8 in stage 315.0 
(TID 100793, qaphdd099.quantium.com.au.local): org.apache.spark.SparkException: 
Task failed while writing rows.
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:272)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Failed to commit task
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.commitTask$1(WriterContainer.scala:287)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:267)
... 8 more
Caused by: org.apache.spark.executor.CommitDeniedException: 
attempt_201604041136_0315_m_18_8: Not committed because the driver did not 
authorize commit
at 
org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:135)
at 
org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitTask(WriterContainer.scala:219)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.commitTask$1(WriterContainer.scala:282)
... 9 more
{noformat}

It seems to me that the CommitDeniedException gets wrapped into a 
RuntimeException at 
[WriterContainer.scala#L286|https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L286]
 and then into a SparkException at 
[InsertIntoHadoopFsRelation.scala#L154|https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala#L154]
 which results in it not being able to be handled properly at 
[Executor.scala#L290|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/executor/Executor.scala#L290]

The solution might be that this catch block should type match on the inner-most 
cause of an error?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8892) Column.cast(LongType) does not work for large values

2015-07-08 Thread Jason Moore (JIRA)
Jason Moore created SPARK-8892:
--

 Summary: Column.cast(LongType) does not work for large values
 Key: SPARK-8892
 URL: https://issues.apache.org/jira/browse/SPARK-8892
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Jason Moore


It seems that casting a column from String to Long seems to go through an 
intermediate step of being cast to a Double (hits Cast.scala line 328 in 
castToDecimal).  The result is that for large values, the wrong value is 
returned.

This test reveals this bug:

{code:java}
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
import org.scalatest.FlatSpec

import scala.util.Random

class DataFrameCastBug extends FlatSpec {

  DataFrame should cast StringType to LongType correctly in {
val sc = new SparkContext(new 
SparkConf().setMaster(local).setAppName(app))
val qc = new SQLContext(sc)

val values = Seq.fill(10)(Random.nextLong)
val source = qc.createDataFrame(
  sc.parallelize(values.map(v = Row(v))),
  StructType(Seq(StructField(value, LongType

val result = source.select(source(value), 
source(value).cast(StringType).cast(LongType).as(castValue))

assert(result.where(result(value) !== result(castValue)).count() === 0)
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org