[jira] [Resolved] (SPARK-26807) Confusing documentation regarding installation from PyPi

2019-03-01 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26807.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23933
[https://github.com/apache/spark/pull/23933]

> Confusing documentation regarding installation from PyPi
> 
>
> Key: SPARK-26807
> URL: https://issues.apache.org/jira/browse/SPARK-26807
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Emmanuel Arias
>Assignee: Sean Owen
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Hello!
> I am new using Spark. Reading the documentation I think that is a little 
> confusing on Downloading section.
> [ttps://spark.apache.org/docs/latest/#downloading|https://spark.apache.org/docs/latest/#downloading]
>  write: "Scala and Java users can include Spark in their projects using its 
> Maven coordinates and in the future Python users can also install Spark from 
> PyPI.", I interpret that currently Spark is not on PyPi yet. But  
> [https://spark.apache.org/downloads.html] write: 
> "[PySpark|https://pypi.python.org/pypi/pyspark] is now available in pypi. To 
> install just run {{pip install pyspark}}."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26807) Confusing documentation regarding installation from PyPi

2019-03-01 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26807:


Assignee: Sean Owen

> Confusing documentation regarding installation from PyPi
> 
>
> Key: SPARK-26807
> URL: https://issues.apache.org/jira/browse/SPARK-26807
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Emmanuel Arias
>Assignee: Sean Owen
>Priority: Trivial
>
> Hello!
> I am new using Spark. Reading the documentation I think that is a little 
> confusing on Downloading section.
> [ttps://spark.apache.org/docs/latest/#downloading|https://spark.apache.org/docs/latest/#downloading]
>  write: "Scala and Java users can include Spark in their projects using its 
> Maven coordinates and in the future Python users can also install Spark from 
> PyPI.", I interpret that currently Spark is not on PyPi yet. But  
> [https://spark.apache.org/downloads.html] write: 
> "[PySpark|https://pypi.python.org/pypi/pyspark] is now available in pypi. To 
> install just run {{pip install pyspark}}."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26867) Spark Support of YARN Placement Constraint

2019-03-01 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782270#comment-16782270
 ] 

Prabhu Joseph edited comment on SPARK-26867 at 3/2/19 3:52 AM:
---

[~srowen] Spark can allow users to configure the Placement Constraint so that 
users will have more control on where the executors will get placed. For 
example:

1. Spark job wants to be run on machines where Python version is x or Java 
version is y (Node Attributes)
2. Spark job needs / does not need executors to be placed on machine where 
Hbase RegionServer / Zookeeper / Or any other Service is running. (Affinity / 
Anti Affinity)
3. Spark job wants no more than 2 of it's executors on same node (Cardinality)
4. Spark Job A executors wants / does not want to be run on where Spark Job / 
Any Other Job B containers runs (Application_Tag NameSpace)




was (Author: prabhu joseph):
Spark can allow users to configure the Placement Constraint so that users will 
have more control on where the executors will get placed. For example:

1. Spark job wants to be run on machines where Python version is x or Java 
version is y (Node Attributes)
2. Spark job needs / does not need executors to be placed on machine where 
Hbase RegionServer / Zookeeper / Or any other Service is running. (Affinity / 
Anti Affinity)
3. Spark job wants no more than 2 of it's executors on same node (Cardinality)
4. Spark Job A executors wants / does not want to be run on where Spark Job / 
Any Other Job B containers runs (Application_Tag NameSpace)



> Spark Support of YARN Placement Constraint
> --
>
> Key: SPARK-26867
> URL: https://issues.apache.org/jira/browse/SPARK-26867
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, YARN
>Affects Versions: 3.0.0
>Reporter: Prabhu Joseph
>Priority: Major
>
> YARN provides Placement Constraint Features - where application can request 
> containers based on affinity / anti-affinity / cardinality to services or 
> other application containers / node attributes. This is a useful feature for 
> Spark Jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26867) Spark Support of YARN Placement Constraint

2019-03-01 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782270#comment-16782270
 ] 

Prabhu Joseph commented on SPARK-26867:
---

Spark can allow users to configure the Placement Constraint so that users will 
have more control on where the executors will get placed. For example:

1. Spark job wants to be run on machines where Python version is x or Java 
version is y (Node Attributes)
2. Spark job needs / does not need executors to be placed on machine where 
Hbase RegionServer / Zookeeper / Or any other Service is running. (Affinity / 
Anti Affinity)
3. Spark job wants no more than 2 of it's executors on same node (Cardinality)
4. Spark Job A executors wants / does not want to be run on where Spark Job / 
Any Other Job B containers runs (Application_Tag NameSpace)



> Spark Support of YARN Placement Constraint
> --
>
> Key: SPARK-26867
> URL: https://issues.apache.org/jira/browse/SPARK-26867
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, YARN
>Affects Versions: 3.0.0
>Reporter: Prabhu Joseph
>Priority: Major
>
> YARN provides Placement Constraint Features - where application can request 
> containers based on affinity / anti-affinity / cardinality to services or 
> other application containers / node attributes. This is a useful feature for 
> Spark Jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26982) Enhance describe framework to describe the output of a query

2019-03-01 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26982:
---

Assignee: Dilip Biswal

> Enhance describe framework to describe the output of a query
> 
>
> Key: SPARK-26982
> URL: https://issues.apache.org/jira/browse/SPARK-26982
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Minor
>
> Currently we can use `df.printSchema` to discover the schema information for 
> a query. We should have a way to describe the output schema of a query using 
> SQL interface. 
>  
> Example:
> DESCRIBE SELECT * FROM desc_table
> DESCRIBE QUERY SELECT * FROM desc_table



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26982) Enhance describe framework to describe the output of a query

2019-03-01 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26982.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23883
[https://github.com/apache/spark/pull/23883]

> Enhance describe framework to describe the output of a query
> 
>
> Key: SPARK-26982
> URL: https://issues.apache.org/jira/browse/SPARK-26982
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently we can use `df.printSchema` to discover the schema information for 
> a query. We should have a way to describe the output schema of a query using 
> SQL interface. 
>  
> Example:
> DESCRIBE SELECT * FROM desc_table
> DESCRIBE QUERY SELECT * FROM desc_table



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26492) support streaming DecisionTreeRegressor

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26492.
---
Resolution: Won't Fix

That's no extra info, and not a valid reason to reopen this.

> support streaming DecisionTreeRegressor
> ---
>
> Key: SPARK-26492
> URL: https://issues.apache.org/jira/browse/SPARK-26492
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.4.0
>Reporter: sky54521
>Priority: Major
>  Labels: DecisionTreeRegressor
>
> hope to support streaming DecisionTreeRegressor as soon as possible



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26555) Thread safety issue causes createDataset to fail with misleading errors

2019-03-01 Thread Martin Loncaric (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782258#comment-16782258
 ] 

Martin Loncaric commented on SPARK-26555:
-

I can also replicate with different schemas containing Option.

When I remove all Option columns from the schema, the sporadic failure goes 
away. This also never happens when I remove the concurrency.

> Thread safety issue causes createDataset to fail with misleading errors
> ---
>
> Key: SPARK-26555
> URL: https://issues.apache.org/jira/browse/SPARK-26555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Martin Loncaric
>Priority: Major
>
> This can be replicated (~2% of the time) with
> {code:scala}
> import java.sql.Timestamp
> import java.util.concurrent.{Executors, Future}
> import org.apache.spark.sql.SparkSession
> import scala.collection.mutable.ListBuffer
> import scala.concurrent.ExecutionContext
> import scala.util.Random
> object Main {
>   def main(args: Array[String]): Unit = {
> val sparkSession = SparkSession.builder
>   .getOrCreate()
> import sparkSession.implicits._
> val executor = Executors.newFixedThreadPool(1)
> try {
>   implicit val xc: ExecutionContext = 
> ExecutionContext.fromExecutorService(executor)
>   val futures = new ListBuffer[Future[_]]()
>   for (i <- 1 to 3) {
> futures += executor.submit(new Runnable {
>   override def run(): Unit = {
> val d = if (Random.nextInt(2) == 0) Some("d value") else None
> val e = if (Random.nextInt(2) == 0) Some(5.0) else None
> val f = if (Random.nextInt(2) == 0) Some(6.0) else None
> println("DEBUG", d, e, f)
> sparkSession.createDataset(Seq(
>   MyClass(new Timestamp(1L), "b", "c", d, e, f)
> ))
>   }
> })
>   }
>   futures.foreach(_.get())
> } finally {
>   println("SHUTDOWN")
>   executor.shutdown()
>   sparkSession.stop()
> }
>   }
>   case class MyClass(
> a: Timestamp,
> b: String,
> c: String,
> d: Option[String],
> e: Option[Double],
> f: Option[Double]
>   )
> }
> {code}
> So it will usually come up during
> {code:bash}
> for i in $(seq 1 200); do
>   echo $i
>   spark-submit --master local[4] target/scala-2.11/spark-test_2.11-0.1.jar
> done
> {code}
> causing a variety of possible errors, such as
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:210){code}
> or
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-26492) support streaming DecisionTreeRegressor

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-26492.
-

> support streaming DecisionTreeRegressor
> ---
>
> Key: SPARK-26492
> URL: https://issues.apache.org/jira/browse/SPARK-26492
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.4.0
>Reporter: sky54521
>Priority: Major
>  Labels: DecisionTreeRegressor
>
> hope to support streaming DecisionTreeRegressor as soon as possible



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26492) support streaming DecisionTreeRegressor

2019-03-01 Thread sky54521 (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782251#comment-16782251
 ] 

sky54521 commented on SPARK-26492:
--

If can not be implemented, then the spark streaming is weak

> support streaming DecisionTreeRegressor
> ---
>
> Key: SPARK-26492
> URL: https://issues.apache.org/jira/browse/SPARK-26492
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.4.0
>Reporter: sky54521
>Priority: Major
>  Labels: DecisionTreeRegressor
>
> hope to support streaming DecisionTreeRegressor as soon as possible



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26492) support streaming DecisionTreeRegressor

2019-03-01 Thread sky54521 (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782250#comment-16782250
 ] 

sky54521 edited comment on SPARK-26492 at 3/2/19 1:54 AM:
--

Concrete implement way I don't know, but I think that can be implement [~srowen]


was (Author: sky54521):
Concrete implement way I don't know, but I think that can be implement

> support streaming DecisionTreeRegressor
> ---
>
> Key: SPARK-26492
> URL: https://issues.apache.org/jira/browse/SPARK-26492
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.4.0
>Reporter: sky54521
>Priority: Major
>  Labels: DecisionTreeRegressor
>
> hope to support streaming DecisionTreeRegressor as soon as possible



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-26492) support streaming DecisionTreeRegressor

2019-03-01 Thread sky54521 (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sky54521 reopened SPARK-26492:
--

Concrete implement way I don't know, but I think that can be implement

> support streaming DecisionTreeRegressor
> ---
>
> Key: SPARK-26492
> URL: https://issues.apache.org/jira/browse/SPARK-26492
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.4.0
>Reporter: sky54521
>Priority: Major
>  Labels: DecisionTreeRegressor
>
> hope to support streaming DecisionTreeRegressor as soon as possible



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26555) Thread safety issue causes createDataset to fail with misleading errors

2019-03-01 Thread Martin Loncaric (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782239#comment-16782239
 ] 

Martin Loncaric edited comment on SPARK-26555 at 3/2/19 1:25 AM:
-

I was able to replicate with both all rows in all optional columns as `Some()` 
and all rows in all optional columns as `None`.


was (Author: mwlon):
I was able to replicate with both all rows as `Some()` and all rows as `None`.

> Thread safety issue causes createDataset to fail with misleading errors
> ---
>
> Key: SPARK-26555
> URL: https://issues.apache.org/jira/browse/SPARK-26555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Martin Loncaric
>Priority: Major
>
> This can be replicated (~2% of the time) with
> {code:scala}
> import java.sql.Timestamp
> import java.util.concurrent.{Executors, Future}
> import org.apache.spark.sql.SparkSession
> import scala.collection.mutable.ListBuffer
> import scala.concurrent.ExecutionContext
> import scala.util.Random
> object Main {
>   def main(args: Array[String]): Unit = {
> val sparkSession = SparkSession.builder
>   .getOrCreate()
> import sparkSession.implicits._
> val executor = Executors.newFixedThreadPool(1)
> try {
>   implicit val xc: ExecutionContext = 
> ExecutionContext.fromExecutorService(executor)
>   val futures = new ListBuffer[Future[_]]()
>   for (i <- 1 to 3) {
> futures += executor.submit(new Runnable {
>   override def run(): Unit = {
> val d = if (Random.nextInt(2) == 0) Some("d value") else None
> val e = if (Random.nextInt(2) == 0) Some(5.0) else None
> val f = if (Random.nextInt(2) == 0) Some(6.0) else None
> println("DEBUG", d, e, f)
> sparkSession.createDataset(Seq(
>   MyClass(new Timestamp(1L), "b", "c", d, e, f)
> ))
>   }
> })
>   }
>   futures.foreach(_.get())
> } finally {
>   println("SHUTDOWN")
>   executor.shutdown()
>   sparkSession.stop()
> }
>   }
>   case class MyClass(
> a: Timestamp,
> b: String,
> c: String,
> d: Option[String],
> e: Option[Double],
> f: Option[Double]
>   )
> }
> {code}
> So it will usually come up during
> {code:bash}
> for i in $(seq 1 200); do
>   echo $i
>   spark-submit --master local[4] target/scala-2.11/spark-test_2.11-0.1.jar
> done
> {code}
> causing a variety of possible errors, such as
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:210){code}
> or
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26555) Thread safety issue causes createDataset to fail with misleading errors

2019-03-01 Thread Martin Loncaric (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782239#comment-16782239
 ] 

Martin Loncaric commented on SPARK-26555:
-

I was able to replicate with both all rows as `Some()` and all rows as `None`.

> Thread safety issue causes createDataset to fail with misleading errors
> ---
>
> Key: SPARK-26555
> URL: https://issues.apache.org/jira/browse/SPARK-26555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Martin Loncaric
>Priority: Major
>
> This can be replicated (~2% of the time) with
> {code:scala}
> import java.sql.Timestamp
> import java.util.concurrent.{Executors, Future}
> import org.apache.spark.sql.SparkSession
> import scala.collection.mutable.ListBuffer
> import scala.concurrent.ExecutionContext
> import scala.util.Random
> object Main {
>   def main(args: Array[String]): Unit = {
> val sparkSession = SparkSession.builder
>   .getOrCreate()
> import sparkSession.implicits._
> val executor = Executors.newFixedThreadPool(1)
> try {
>   implicit val xc: ExecutionContext = 
> ExecutionContext.fromExecutorService(executor)
>   val futures = new ListBuffer[Future[_]]()
>   for (i <- 1 to 3) {
> futures += executor.submit(new Runnable {
>   override def run(): Unit = {
> val d = if (Random.nextInt(2) == 0) Some("d value") else None
> val e = if (Random.nextInt(2) == 0) Some(5.0) else None
> val f = if (Random.nextInt(2) == 0) Some(6.0) else None
> println("DEBUG", d, e, f)
> sparkSession.createDataset(Seq(
>   MyClass(new Timestamp(1L), "b", "c", d, e, f)
> ))
>   }
> })
>   }
>   futures.foreach(_.get())
> } finally {
>   println("SHUTDOWN")
>   executor.shutdown()
>   sparkSession.stop()
> }
>   }
>   case class MyClass(
> a: Timestamp,
> b: String,
> c: String,
> d: Option[String],
> e: Option[Double],
> f: Option[Double]
>   )
> }
> {code}
> So it will usually come up during
> {code:bash}
> for i in $(seq 1 200); do
>   echo $i
>   spark-submit --master local[4] target/scala-2.11/spark-test_2.11-0.1.jar
> done
> {code}
> causing a variety of possible errors, such as
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:210){code}
> or
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24130) Data Source V2: Join Push Down

2019-03-01 Thread William Wong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782237#comment-16782237
 ] 

William Wong commented on SPARK-24130:
--

https://github.com/apache/spark/pull/22547 
It seems that the PR was closed already. 

> Data Source V2: Join Push Down
> --
>
> Key: SPARK-24130
> URL: https://issues.apache.org/jira/browse/SPARK-24130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jia Li
>Priority: Major
> Attachments: Data Source V2 Join Push Down.pdf
>
>
> Spark applications often directly query external data sources such as 
> relational databases, or files. Spark provides Data Sources APIs for 
> accessing structured data through Spark SQL. Data Sources APIs in both V1 and 
> V2 support optimizations such as Filter push down and Column pruning which 
> are subset of the functionality that can be pushed down to some data sources. 
> We’re proposing to extend Data Sources APIs with join push down (JPD). Join 
> push down significantly improves query performance by reducing the amount of 
> data transfer and exploiting the capabilities of the data sources such as 
> index access.
> Join push down design document is available 
> [here|https://docs.google.com/document/d/1k-kRadTcUbxVfUQwqBbIXs_yPZMxh18-e-cz77O_TaE/edit?usp=sharing].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27029) Update Thrift to 0.12.0

2019-03-01 Thread Sean Owen (JIRA)
Sean Owen created SPARK-27029:
-

 Summary: Update Thrift to 0.12.0
 Key: SPARK-27029
 URL: https://issues.apache.org/jira/browse/SPARK-27029
 Project: Spark
  Issue Type: Task
  Components: Spark Core, SQL
Affects Versions: 3.0.0
Reporter: Sean Owen
Assignee: Sean Owen


We should update to Thrift 0.12.0 to pick up security and bug fixes. It appears 
to be compatible with the current build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27028) PySpark read .dat file. Multiline issue

2019-03-01 Thread alokchowdary (JIRA)
alokchowdary created SPARK-27028:


 Summary: PySpark read .dat file. Multiline issue
 Key: SPARK-27028
 URL: https://issues.apache.org/jira/browse/SPARK-27028
 Project: Spark
  Issue Type: Question
  Components: PySpark
Affects Versions: 2.4.0
 Environment: Pyspark(2.4) in AWS EMR
Reporter: alokchowdary


* I am trying to read the dat file using pyspark csv reader and it contains 
newline character ("\n") as part of the data. Spark is unable to read this file 
as single column, rather treating it as new row. I tried using the "multiLine" 
option while reading , but still its not working.

 * {{spark.read.csv(file_path, schema=schema, sep=delimiter,multiLine=True)}}

 * {{}}Data is something like this. Every line below is considered as row in 
dataframe.

 * Here  '\x01' is actual delimeter(but used , for ease of reading).
{{ }}

{{1. name,test,12345,}}
{{2. x, }}
{{3. desc }}
{{4. name2,test2,12345 }}
{{5. ,y}}
{{6. ,desc2}}

 * {{}}So pyspark is treating x and desc as new row in dataframe, with nulls 
for other columns.

How to read such data in pyspark 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly

2019-03-01 Thread Hien Luu (JIRA)
Hien Luu created SPARK-27027:


 Summary: from_avro function does not deserialize the Avro record 
of a struct column type correctly
 Key: SPARK-27027
 URL: https://issues.apache.org/jira/browse/SPARK-27027
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Hien Luu


from_avro function produces wrong output of a struct field.  See the output at 
the bottom of the description

=

import org.apache.spark.sql.types._
import org.apache.spark.sql.avro._
import org.apache.spark.sql.functions._


spark.version

val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", 
50)).toDF("id", "name", "age")

val dfStruct = df.withColumn("value", struct("name","age"))

dfStruct.show
dfStruct.printSchema

val dfKV = dfStruct.select(to_avro('id).as("key"), to_avro('value).as("value"))

val expectedSchema = StructType(Seq(StructField("name", StringType, 
true),StructField("age", IntegerType, false)))

val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString

val avroTypeStr = s"""
 |{
 | "type": "int",
 | "name": "key"
 |}
 """.stripMargin


dfKV.select(from_avro('key, avroTypeStr)).show
dfKV.select(from_avro('value, avroTypeStruct)).show

// output for the last statement and that is not correct
+-+
|from_avro(value, struct)|
+-+
| [Josh Duke, 50]|
| [Josh Duke, 50]|
| [Josh Duke, 50]|
+-+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26977) Warn against subclassing scala.App doesn't work

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26977:
-

Assignee: Manu Zhang

> Warn against subclassing scala.App doesn't work
> ---
>
> Key: SPARK-26977
> URL: https://issues.apache.org/jira/browse/SPARK-26977
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
>
> As per discussion in 
> [PR#3497|https://github.com/apache/spark/pull/3497#discussion_r258412735], 
> the warn against subclassing scala.App doesn't work. For example,
> {code:scala}
> object Test extends scala.App {
>// spark code
> }
> {code}
> Scala will compile {{object Test}} into two Java classes, {{Test}} passed in 
> by user and {{Test$}} subclassing {{scala.App}}. Currect code checks against 
> {{Test}}  and thus there will be no warn when user's application subclassing 
> {{scala.App}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26977) Warn against subclassing scala.App doesn't work

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26977.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23903
[https://github.com/apache/spark/pull/23903]

> Warn against subclassing scala.App doesn't work
> ---
>
> Key: SPARK-26977
> URL: https://issues.apache.org/jira/browse/SPARK-26977
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>
> As per discussion in 
> [PR#3497|https://github.com/apache/spark/pull/3497#discussion_r258412735], 
> the warn against subclassing scala.App doesn't work. For example,
> {code:scala}
> object Test extends scala.App {
>// spark code
> }
> {code}
> Scala will compile {{object Test}} into two Java classes, {{Test}} passed in 
> by user and {{Test$}} subclassing {{scala.App}}. Currect code checks against 
> {{Test}}  and thus there will be no warn when user's application subclassing 
> {{scala.App}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26387) Parallelism seems to cause difference in CrossValidation model metrics

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26387.
---
Resolution: Not A Problem

It shouldn't have any effect. But, you might get different results on different 
runs if you don't fix a seed for k-fold cross validation. Reopen if that's not 
it, and you can maybe show a reproducer vs 2.4 or master.

> Parallelism seems to cause difference in CrossValidation model metrics
> --
>
> Key: SPARK-26387
> URL: https://issues.apache.org/jira/browse/SPARK-26387
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Evan Zamir
>Priority: Major
>
> I can only reproduce this issue when running Spark on different Amazon EMR 
> versions, but it seems that between Spark 2.3.1 and 2.3.2 (corresponding to 
> EMR versions 5.17/5.18) the presence of the parallelism parameter was causing 
> AUC metric to increase. Literally, I run the same exact code with and without 
> parallelism and the AUC of my models (logistic regression) are changing 
> significantly. I can't find a previous bug report relating to this, so I'm 
> posting this as new.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26458) OneHotEncoderModel verifies the number of category values incorrectly when tries to transform a dataframe.

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26458.
---
Resolution: Not A Problem

I don't quite get this; it already accounts for handleInvalid and dropLast, and 
the fact that it can exist for multiple columns. Reopen if you can show a 
specific example.

> OneHotEncoderModel verifies the number of category values incorrectly when 
> tries to transform a dataframe.
> --
>
> Key: SPARK-26458
> URL: https://issues.apache.org/jira/browse/SPARK-26458
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: duruihuan
>Priority: Major
>
> When the handleInvalid is set to "keep", then one should not compare the 
> categorySizes of the tranformSchema and the values of the metadata of the 
> dataframe to be transformed. Because there may be more than one invalid 
> values in some columns in the dataframe, which causes exception as described 
> in lines 302-306 in OneHotEncoderEstimator.scala. To be concluded, I think 
> the verifyNumOfValues in the method transformSchema should be removed, which 
> can be found in line 299 in the code.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26395) Spark Thrift server memory leak

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26395.
---
Resolution: Duplicate

If this isn't resolved for you 2.3.3 we can reopen, but the duplicate indicated 
above looks like a likely explanation.

> Spark Thrift server memory leak
> ---
>
> Key: SPARK-26395
> URL: https://issues.apache.org/jira/browse/SPARK-26395
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Konstantinos Andrikopoulos
>Priority: Major
>
> We are running Thrift Server in standalone mode and we have observed that the 
> heap of the driver is constantly increasing. After analysing the heap dump 
> the issue seems to be that the ElementTrackingStore is constantly increasing 
> due to the addition of RDDOperationGraphWrapper objects that are not cleaned 
> up.
> The ElementTrackingStore defines the addTrigger method were you are able to 
> set thresholds in order to perform cleanup but in practice it is used for  
> ExecutorSummaryWrapper, JobDataWrapper and StageDataWrapper classes by using 
> the following spark properties 
>  * spark.ui.retainedDeadExecutors
>  * spark.ui.retainedJobs
>  * spark.ui.retainedStages
> So the  RDDOperationGraphWrapper which is been added using the onJobStart 
> method of  AppStatusListener class [kvstore.write(uigraph) #line 291]
> in not cleaned up and it constantly increases causing a memory leak



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26404) set spark.pyspark.python or PYSPARK_PYTHON doesn't work in k8s client-cluster mode.

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26404.
---
Resolution: Not A Problem

> set spark.pyspark.python or PYSPARK_PYTHON doesn't work in k8s client-cluster 
> mode.
> ---
>
> Key: SPARK-26404
> URL: https://issues.apache.org/jira/browse/SPARK-26404
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Dongqing  Liu
>Priority: Major
>
> Neither
>    conf.set("spark.executorEnv.PYSPARK_PYTHON", "/opt/pythonenvs/bin/python")
> nor 
>   conf.set("spark.pyspark.python", "/opt/pythonenvs/bin/python") 
> works. 
> Looks like the executor always picks python from PATH.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26407) For an external non-partitioned table, if add a directory named with k=v to the table path, select result will be wrong

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26407.
---
Resolution: Not A Problem

I don't think it's reasonable to add arbitrary other dirs under this directory.

> For an external non-partitioned table, if add a directory named with k=v to 
> the table path, select result will be wrong
> ---
>
> Key: SPARK-26407
> URL: https://issues.apache.org/jira/browse/SPARK-26407
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bao Yunz
>Priority: Major
>  Labels: usability
>
> Scenario 1
> Create an external non-partitioned table, in which location directory has a 
> directory named with "part=1" and its schema is (id, name), for example. And 
> there is some data in the "part=1" directory. Then desc the table, we will 
> find the "part" is added in table schema as table column. when insert into 
> the table with two columns data, will throw a exception that  target table 
> has 3 columns but the inserted data has 2 columns. 
> Scenario 2
> Create an external non-partitioned table, which location path is empty and 
> its scema is (id, name), for example. After several times insert operation, 
> we add a directory named with "part=1" in the table location directory.  And 
> there is some data in the "part=1" directory.  Then do insert and select 
> operation, we will find the scan path is changed to "tablePath/part=1",so 
> that we will get a wrong result.
>  The right logic should be that if a table is a non-partitioned table, adding 
> a partition-like folder under tablePath should not change its schema and 
> select result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26425) Add more constraint checks in file streaming source to avoid checkpoint corruption

2019-03-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782196#comment-16782196
 ] 

Sean Owen commented on SPARK-26425:
---

[~kabhwan] I think you're welcome to work on this.

> Add more constraint checks in file streaming source to avoid checkpoint 
> corruption
> --
>
> Key: SPARK-26425
> URL: https://issues.apache.org/jira/browse/SPARK-26425
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> Two issues observed in production. 
> - HDFSMetadataLog.getLatest() tries to read older versions when it is not 
> able to read the latest listed version file. Not sure why this was done but 
> this should not be done. If the latest listed file is not readable, then 
> something is horribly wrong and we should fail rather than report an older 
> version as that can completely corrupt the checkpoint directory. 
> - FileStreamSource should check whether adding the a new batch to the 
> FileStreamSourceLog succeeded or not (similar to how StreamExecution checks 
> for the OffsetSeqLog)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26408) java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:347)

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26408.
---
Resolution: Not A Problem

> java.util.NoSuchElementException: None.get at 
> scala.None$.get(Option.scala:347)
> ---
>
> Key: SPARK-26408
> URL: https://issues.apache.org/jira/browse/SPARK-26408
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Spark version 2.3.2
> Scala version 2.11.8
> Hbase version 1.4.7
>Reporter: Amit Siddhu
>Priority: Major
>
> {code:java}
> sudo spark-shell --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 
> --repositories http://repo.hortonworks.com/content/groups/public/
> {code}
> {code:java}
> import org.apache.spark.sql.{SQLContext, _}
> import org.apache.spark.sql.execution.datasources.hbase._
> import org.apache.spark.{SparkConf, SparkContext}
> import spark.sqlContext.implicits._
> {code}
> {code:java}
> def withCatalog(cat: String): DataFrame = {
>   spark.sqlContext
>   .read
>   .options(Map(HBaseTableCatalog.tableCatalog->cat))
>   .format("org.apache.spark.sql.execution.datasources.hbase")
>   .load()
> }
> {code}
> {code:java}
> def motorQuoteCatatog = s"""{ |"table":{"namespace":"default", 
> "name":"public.motor_product_quote", "tableCoder":"PrimitiveType"}, 
> |"rowkey":"id", |"columns":{ |"id":{"cf":"rowkey", "col":"id", 
> "type":"string"}, |"quote_id":{"cf":"motor_product_quote", "col":"quote_id", 
> "type":"string"}, |"vehicle_id":{"cf":"motor_product_quote", 
> "col":"vehicle_id", "type":"bigint"}, |"is_new":{"cf":"motor_product_quote", 
> "col":"is_new", "type":"boolean"}, 
> |"date_of_manufacture":{"cf":"motor_product_quote", 
> "col":"date_of_manufacture", "type":"string"}, 
> |"raw_data":{"cf":"motor_product_quote", "col":"raw_data", "type":"string"}, 
> |"is_processed":{"cf":"motor_product_quote", "col":"is_processed", 
> "type":"boolean"}, |"created_on":{"cf":"motor_product_quote", 
> "col":"created_on", "type":"string"}, |"type":{"cf":"motor_product_quote", 
> "col":"type", "type":"string"}, 
> |"requirement_id":{"cf":"motor_product_quote", "col":"requirement_id", 
> "type":"int"}, |"previous_policy_id":{"cf":"motor_product_quote", 
> "col":"type", "previous_policy_id":"int"}, 
> |"parent_quote_id":{"cf":"motor_product_quote", "col":"type", 
> "parent_quote_id":"int"}, |"ticket_id":{"cf":"motor_product_quote", 
> "col":"type", "ticket_id":"int"}, |"tracker_id":{"cf":"motor_product_quote", 
> "col":"tracker_id", "type":"int"}, |"category":{"cf":"motor_product_quote", 
> "col":"category", "type":"string"}, 
> |"sales_channel_id":{"cf":"motor_product_quote", "col":"sales_channel_id", 
> "type":"int"}, |"policy_type":{"cf":"motor_product_quote", 
> "col":"policy_type", "type":"string"}, 
> |"original_quote_created_by_id":{"cf":"motor_product_quote", "col":"type", 
> "original_quote_created_by_id":"int"}, 
> |"created_by_id":{"cf":"motor_product_quote", "col":"created_by_id", 
> "type":"int"}, |"mobile":{"cf":"motor_product_quote", "col":"mobile", 
> "type":"string"}, |"registration_number":{"cf":"motor_product_quote", 
> "col":"registration_number", "type":"string"} |} |}""".stripMargin
> {code}
>  
> {code:java}
> val df = withCatalog(motorQuoteCatatog){code}
> {code:java}
> java.util.NoSuchElementException: None.get
>  at scala.None$.get(Option.scala:347)
>  at scala.None$.get(Option.scala:345)
>  at org.apache.spark.sql.execution.datasources.hbase.Field.   
> (HBaseTableCatalog.scala:102)
>  at  
> org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$$anonfun$ap
>  ply$3.apply(HBaseTableCatalog.scala:286)
>  at 
> org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$$anonfun$apply$3.apply(HBaseTableCatalog.scala:281)
> at scala.collection.immutable.List.foreach(List.scala:381)
>  at 
> org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$.apply(HBaseTableCatalog.scala:281)
>  at 
> org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.(HBaseRelation.scala:80)
>  at 
> org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:51)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:341)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
>  at withCatalog(:38)
>  ... 55 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Resolved] (SPARK-26492) support streaming DecisionTreeRegressor

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26492.
---
Resolution: Won't Fix

I'm not even sure that can be implemented in a one-pass algorithm? You'd have 
to give more detail.

> support streaming DecisionTreeRegressor
> ---
>
> Key: SPARK-26492
> URL: https://issues.apache.org/jira/browse/SPARK-26492
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.4.0
>Reporter: sky54521
>Priority: Major
>  Labels: DecisionTreeRegressor
>
> hope to support streaming DecisionTreeRegressor as soon as possible



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26494) 【spark sql】Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type can't be found,

2019-03-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782193#comment-16782193
 ] 

Sean Owen commented on SPARK-26494:
---

Correct me if I'm wrong, but does a timestamp without local time zone make 
sense in Spark?

> 【spark sql】Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type 
> can't be found,
> --
>
> Key: SPARK-26494
> URL: https://issues.apache.org/jira/browse/SPARK-26494
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: kun'qin 
>Priority: Minor
>
> Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type can't be 
> found,
> When the data type is TIMESTAMP(6) WITH LOCAL TIME ZONE
> At this point, the sqlType value of the function getCatalystType in the 
> JdbcUtils class is -102.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26505) Catalog class Function is missing "database" field

2019-03-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782190#comment-16782190
 ] 

Sean Owen commented on SPARK-26505:
---

Go ahead with a PR

> Catalog class Function is missing "database" field
> --
>
> Key: SPARK-26505
> URL: https://issues.apache.org/jira/browse/SPARK-26505
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Devin Boyer
>Priority: Minor
>
> This change fell out of the review of 
> [https://github.com/apache/spark/pull/20658,] which is the implementation of 
> https://issues.apache.org/jira/browse/SPARK-23488. The Scala Catalog class 
> [Function|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.catalog.Function]
>  contains a `database` attribute, while the [Python 
> version|https://github.com/apache/spark/blob/v2.4.0/python/pyspark/sql/catalog.py#L32]
>  does not.
>  
> To be consistent, it would likely be best to add the `database` attribute to 
> the Python class. This would be a breaking API change, though (as discussed 
> in [this PR 
> comment|[https://github.com/apache/spark/pull/20658#issuecomment-368561007]]),
>  so it would have to be made for Spark 3.0.0, the next major version where 
> breaking API changes can occur.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26506) RegressionMetrics fails in Spark 2.4

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26506.
---
Resolution: Cannot Reproduce

There's not enough info here. What's the error? these work in general per 
integration tests. Reopen if it can be narrowed down ideally with some kind of 
reproduction.

> RegressionMetrics fails in Spark 2.4
> 
>
> Key: SPARK-26506
> URL: https://issues.apache.org/jira/browse/SPARK-26506
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.0
> Environment: Windows using the Anaconda stack running Spark 2.4, 
> using Java jdk 1.8.0_181.  It may also affect unix when running Spark 2.4, 
> not sure because my workplace where I use Spark in unix is still on Spark 
> 2.2. 
> The bug does not appear cause an error in either 2.3 or 2.2 on either windows 
> or unix.
>Reporter: Casey Bennett
>Priority: Major
>
> RegressionMetrics fails in Spark 2.4 when running via Anaconda on a Windows 
> machine.  A java error comes back saying that "python worker failed to 
> connect back".  This makes all the evaluation metrics 
> ([https://spark.apache.org/docs/2.2.0/mllib-evaluation-metrics.html)] 
> unusable for scoring model performance. 
> Reverted to Spark 2.3 and did not have this issue, also tested 2.2 and did 
> not have this issue.  So it appears to be a bug specific to Spark 2.4.  
> Likely also affects other evaluation metric types, 
> e.g.BinaryClassificationMetrics.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26512) Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26512.
---
Resolution: Cannot Reproduce

> Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10
> --
>
> Key: SPARK-26512
> URL: https://issues.apache.org/jira/browse/SPARK-26512
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, YARN
>Affects Versions: 2.4.0
> Environment: operating system : Windows 10
> Spark Version : 2.4.0
> Hadoop Version : 2.8.3
>Reporter: Anubhav Jain
>Priority: Minor
>  Labels: windows
> Attachments: log.png
>
>
> I have installed Hadoop version 2.8.3 in my windows 10 environment and its 
> working fine. Now when i try to install Apache Spark(version 2.4.0) with yarn 
> as cluster manager and its not working. When i try to submit a spark job 
> using spark-submit for testing , so its coming under ACCEPTED tab in YARN UI 
> after that it fail



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26518) UI Application Info Race Condition Can Throw NoSuchElement

2019-03-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782186#comment-16782186
 ] 

Sean Owen commented on SPARK-26518:
---

It looks easy to handle the case where applicationInfo() isn't available; 
[~planga82] are you saying you've tried that and other stuff breaks? yeah, this 
isn't great but might be hard to fix properly. 

> UI Application Info Race Condition Can Throw NoSuchElement
> --
>
> Key: SPARK-26518
> URL: https://issues.apache.org/jira/browse/SPARK-26518
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Russell Spitzer
>Priority: Trivial
> Attachments: 15476091405552344590691778159589.jpg
>
>
> There is a slight race condition in the 
> [AppStatusStore|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala#L39]
> Which calls `next` on the returned store even if it is empty which i can be 
> for a short period of time after the UI is up but before the store is 
> populated.
> {code}
> 
> 
> Error 500 Server Error
> 
> HTTP ERROR 500
> Problem accessing /jobs/. Reason:
> Server ErrorCaused 
> by:java.util.NoSuchElementException
> at java.util.Collections$EmptyIterator.next(Collections.java:4189)
> at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:281)
> at 
> org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:38)
> at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:275)
> at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:86)
> at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:86)
> at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
> at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:865)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:535)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
> at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:724)
> at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
> at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at org.spark_project.jetty.server.Server.handle(Server.java:531)
> at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:352)
> at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
> at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
> at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:102)
> at 
> org.spark_project.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26555) Thread safety issue causes createDataset to fail with misleading errors

2019-03-01 Thread Martin Loncaric (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782185#comment-16782185
 ] 

Martin Loncaric commented on SPARK-26555:
-

Will try it out and report back

> Thread safety issue causes createDataset to fail with misleading errors
> ---
>
> Key: SPARK-26555
> URL: https://issues.apache.org/jira/browse/SPARK-26555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Martin Loncaric
>Priority: Major
>
> This can be replicated (~2% of the time) with
> {code:scala}
> import java.sql.Timestamp
> import java.util.concurrent.{Executors, Future}
> import org.apache.spark.sql.SparkSession
> import scala.collection.mutable.ListBuffer
> import scala.concurrent.ExecutionContext
> import scala.util.Random
> object Main {
>   def main(args: Array[String]): Unit = {
> val sparkSession = SparkSession.builder
>   .getOrCreate()
> import sparkSession.implicits._
> val executor = Executors.newFixedThreadPool(1)
> try {
>   implicit val xc: ExecutionContext = 
> ExecutionContext.fromExecutorService(executor)
>   val futures = new ListBuffer[Future[_]]()
>   for (i <- 1 to 3) {
> futures += executor.submit(new Runnable {
>   override def run(): Unit = {
> val d = if (Random.nextInt(2) == 0) Some("d value") else None
> val e = if (Random.nextInt(2) == 0) Some(5.0) else None
> val f = if (Random.nextInt(2) == 0) Some(6.0) else None
> println("DEBUG", d, e, f)
> sparkSession.createDataset(Seq(
>   MyClass(new Timestamp(1L), "b", "c", d, e, f)
> ))
>   }
> })
>   }
>   futures.foreach(_.get())
> } finally {
>   println("SHUTDOWN")
>   executor.shutdown()
>   sparkSession.stop()
> }
>   }
>   case class MyClass(
> a: Timestamp,
> b: String,
> c: String,
> d: Option[String],
> e: Option[Double],
> f: Option[Double]
>   )
> }
> {code}
> So it will usually come up during
> {code:bash}
> for i in $(seq 1 200); do
>   echo $i
>   spark-submit --master local[4] target/scala-2.11/spark-test_2.11-0.1.jar
> done
> {code}
> causing a variety of possible errors, such as
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:210){code}
> or
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26523) Getting this error while reading from kinesis :- Could not read until the end sequence number of the range: SequenceNumberRange

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26523.
---
Resolution: Not A Problem

This would have to have more info or be narrowed down more to make this 
actionable. It's not clear it's a Spark issue.

> Getting this error while reading from kinesis :- Could not read until the end 
> sequence number of the range: SequenceNumberRange
> ---
>
> Key: SPARK-26523
> URL: https://issues.apache.org/jira/browse/SPARK-26523
> Project: Spark
>  Issue Type: Brainstorming
>  Components: DStreams, Spark Submit, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: CHIRAG YADAV
>Priority: Major
>
> I am using spark to read data from kinesis stream and after reading data for 
> sometime i get this error ERROR Executor: Exception in task 74.0 in stage 
> 52.0 (TID 339) org.apache.spark.SparkException: Could not read until the end 
> sequence number of the range: 
> SequenceNumberRange(godel-logs,shardId-0007,49591040259365283625183097566179815847537156031957172338,49591040259365283625183097600068424422974441881954418802,4517)
>  
> Can someone please tell why am i getting this error and how to resolve this
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26555) Thread safety issue causes createDataset to fail with misleading errors

2019-03-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782179#comment-16782179
 ] 

Sean Owen commented on SPARK-26555:
---

Hm, I might have that backwards; might be when all are not None? at least, I 
have a strong suspicious it's to do with the data that gets generated in some 
runs. Maybe fix that at one data set and see if you can reproduce?

> Thread safety issue causes createDataset to fail with misleading errors
> ---
>
> Key: SPARK-26555
> URL: https://issues.apache.org/jira/browse/SPARK-26555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Martin Loncaric
>Priority: Major
>
> This can be replicated (~2% of the time) with
> {code:scala}
> import java.sql.Timestamp
> import java.util.concurrent.{Executors, Future}
> import org.apache.spark.sql.SparkSession
> import scala.collection.mutable.ListBuffer
> import scala.concurrent.ExecutionContext
> import scala.util.Random
> object Main {
>   def main(args: Array[String]): Unit = {
> val sparkSession = SparkSession.builder
>   .getOrCreate()
> import sparkSession.implicits._
> val executor = Executors.newFixedThreadPool(1)
> try {
>   implicit val xc: ExecutionContext = 
> ExecutionContext.fromExecutorService(executor)
>   val futures = new ListBuffer[Future[_]]()
>   for (i <- 1 to 3) {
> futures += executor.submit(new Runnable {
>   override def run(): Unit = {
> val d = if (Random.nextInt(2) == 0) Some("d value") else None
> val e = if (Random.nextInt(2) == 0) Some(5.0) else None
> val f = if (Random.nextInt(2) == 0) Some(6.0) else None
> println("DEBUG", d, e, f)
> sparkSession.createDataset(Seq(
>   MyClass(new Timestamp(1L), "b", "c", d, e, f)
> ))
>   }
> })
>   }
>   futures.foreach(_.get())
> } finally {
>   println("SHUTDOWN")
>   executor.shutdown()
>   sparkSession.stop()
> }
>   }
>   case class MyClass(
> a: Timestamp,
> b: String,
> c: String,
> d: Option[String],
> e: Option[Double],
> f: Option[Double]
>   )
> }
> {code}
> So it will usually come up during
> {code:bash}
> for i in $(seq 1 200); do
>   echo $i
>   spark-submit --master local[4] target/scala-2.11/spark-test_2.11-0.1.jar
> done
> {code}
> causing a variety of possible errors, such as
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:210){code}
> or
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27024) Design executor interface to support GPU resources

2019-03-01 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782167#comment-16782167
 ] 

Thomas Graves commented on SPARK-27024:
---

This and SPARK-27005 basically split the design of the entire feature into 2.  
The intention is for SPARK-27005 to be the core scheduler pieces and this one 
is for the cluster manager and part of executor sides.

 

I will try to clarify the description when I get a chance to look at it a bit 
more, probably early next week.

> Design executor interface to support GPU resources
> --
>
> Key: SPARK-27024
> URL: https://issues.apache.org/jira/browse/SPARK-27024
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> The executor interface shall deal with the resources allocated to the 
> executor by cluster managers(Standalone, YARN, Kubernetes), so the Spark 
> Executor don’t need to involve into the GPU discovery and allocation, which 
> shall be handled by cluster managers. However, an executor need to sync with 
> the driver to expose available resources to support task scheduling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26727) CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException

2019-03-01 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782162#comment-16782162
 ] 

Gabor Somogyi edited comment on SPARK-26727 at 3/1/19 10:21 PM:


I've seen this issue lately when I was dealing with a unit test but that test 
created a table and not a view (latest master).


was (Author: gsomogyi):
I've seen this issue lately when I was dealing with a unit test but that test 
created a table and not a view.

> CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException
> ---
>
> Key: SPARK-26727
> URL: https://issues.apache.org/jira/browse/SPARK-26727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Srinivas Yarra
>Priority: Major
>
> We experienced that sometimes the Hive query "CREATE OR REPLACE VIEW  name> AS SELECT  FROM " fails with the following exception:
> {code:java}
> // code placeholder
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view '' already exists in database 'default'; at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:314)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:165) 
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
> org.apache.spark.sql.Dataset.(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80) at 
> org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) ... 49 elided
> {code}
> {code}
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res2: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res3: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res4: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res5: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res6: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res7: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res8: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res9: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res10: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res11: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") 
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view 'testsparkreplace' already exists in database 'default'; at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:246)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:236)
>  at 
> 

[jira] [Commented] (SPARK-26727) CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException

2019-03-01 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782162#comment-16782162
 ] 

Gabor Somogyi commented on SPARK-26727:
---

I've seen this issue lately when I was dealing with a unit test but that test 
created a table and not a view.

> CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException
> ---
>
> Key: SPARK-26727
> URL: https://issues.apache.org/jira/browse/SPARK-26727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Srinivas Yarra
>Priority: Major
>
> We experienced that sometimes the Hive query "CREATE OR REPLACE VIEW  name> AS SELECT  FROM " fails with the following exception:
> {code:java}
> // code placeholder
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view '' already exists in database 'default'; at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:314)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:165) 
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
> org.apache.spark.sql.Dataset.(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80) at 
> org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) ... 49 elided
> {code}
> {code}
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res2: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res3: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res4: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res5: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res6: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res7: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res8: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res9: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res10: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res11: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") 
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view 'testsparkreplace' already exists in database 'default'; at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:246)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:319)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:165) 
> at 
> 

[jira] [Commented] (SPARK-26555) Thread safety issue causes createDataset to fail with misleading errors

2019-03-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782146#comment-16782146
 ] 

Sean Owen commented on SPARK-26555:
---

This doesn't look like a Spark bug. It comes up, I think, when your random data 
set has all "None" for a column. That's what the error indicates at least. That 
part of the code shouldn't have any shared state. Can you verify from your 
debug output?

> Thread safety issue causes createDataset to fail with misleading errors
> ---
>
> Key: SPARK-26555
> URL: https://issues.apache.org/jira/browse/SPARK-26555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Martin Loncaric
>Priority: Major
>
> This can be replicated (~2% of the time) with
> {code:scala}
> import java.sql.Timestamp
> import java.util.concurrent.{Executors, Future}
> import org.apache.spark.sql.SparkSession
> import scala.collection.mutable.ListBuffer
> import scala.concurrent.ExecutionContext
> import scala.util.Random
> object Main {
>   def main(args: Array[String]): Unit = {
> val sparkSession = SparkSession.builder
>   .getOrCreate()
> import sparkSession.implicits._
> val executor = Executors.newFixedThreadPool(1)
> try {
>   implicit val xc: ExecutionContext = 
> ExecutionContext.fromExecutorService(executor)
>   val futures = new ListBuffer[Future[_]]()
>   for (i <- 1 to 3) {
> futures += executor.submit(new Runnable {
>   override def run(): Unit = {
> val d = if (Random.nextInt(2) == 0) Some("d value") else None
> val e = if (Random.nextInt(2) == 0) Some(5.0) else None
> val f = if (Random.nextInt(2) == 0) Some(6.0) else None
> println("DEBUG", d, e, f)
> sparkSession.createDataset(Seq(
>   MyClass(new Timestamp(1L), "b", "c", d, e, f)
> ))
>   }
> })
>   }
>   futures.foreach(_.get())
> } finally {
>   println("SHUTDOWN")
>   executor.shutdown()
>   sparkSession.stop()
> }
>   }
>   case class MyClass(
> a: Timestamp,
> b: String,
> c: String,
> d: Option[String],
> e: Option[Double],
> f: Option[Double]
>   )
> }
> {code}
> So it will usually come up during
> {code:bash}
> for i in $(seq 1 200); do
>   echo $i
>   spark-submit --master local[4] target/scala-2.11/spark-test_2.11-0.1.jar
> done
> {code}
> causing a variety of possible errors, such as
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:210){code}
> or
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26589) proper `median` method for spark dataframe

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26589:
--
Priority: Minor  (was: Major)

Would you like to implement it? It's kind of DIY here. It's not crazy to add, 
but indeed, how would you do efficiently it at scale?

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26534) Closure Cleaner Bug

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26534:
--
Priority: Minor  (was: Major)

Yes, the closure cleaner has never been able to be 100% sure it gets all the 
references. It also can't null some references as it would modify other objects 
state (think of references to other objects that are shared by other objects). 
This also partly depends on how Scala chooses to represent it. 

Try Scala 2.12; its implementation of closures uses the lambda metafactory and 
lots of this goes away.

I agree it's weird, but, what do you propose?

> Closure Cleaner Bug
> ---
>
> Key: SPARK-26534
> URL: https://issues.apache.org/jira/browse/SPARK-26534
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: sam
>Priority: Minor
>
> I've found a strange combination of closures where the closure cleaner 
> doesn't seem to be smart enough to figure out how to remove a reference that 
> is not used. I.e. we get a `org.apache.spark.SparkException: Task not 
> serializable` for a Task that is perfectly serializable.  
>  
> In the example below, the only `val` that is actually needed for the closure 
> of the `map` is `foo`, but it tries to serialise `thingy`.  What is odd is 
> changing this code in a number of subtle ways eliminates the error, which 
> I've tried to highlight using comments inline.
>  
> {code:java}
> import org.apache.spark.sql._
> object Test {
>   val sparkSession: SparkSession =
> SparkSession.builder.master("local").appName("app").getOrCreate()
>   def apply(): Unit = {
> import sparkSession.implicits._
> val landedData: Dataset[String] = 
> sparkSession.sparkContext.makeRDD(Seq("foo", "bar")).toDS()
> // thingy has to be in this outer scope to reproduce, if in someFunc, 
> cannot reproduce
> val thingy: Thingy = new Thingy
> // If not wrapped in someFunc cannot reproduce
> val someFunc = () => {
>   // If don't reference this foo inside the closer (e.g. just use 
> identity function) cannot reproduce
>   val foo: String = "foo"
>   thingy.run(block = () => {
> landedData.map(r => {
>   r + foo
> })
> .count()
>   })
> }
> someFunc()
>   }
> }
> class Thingy {
>   def run[R](block: () => R): R = {
> block()
>   }
> }
> {code}
> The full trace if ran in `sbt console`
> {code}
> scala> class Thingy {
>  |   def run[R](block: () => R): R = {
>  | block()
>  |   }
>  | }
> defined class Thingy
> scala> 
> scala> object Test {
>  |   val sparkSession: SparkSession =
>  | SparkSession.builder.master("local").appName("app").getOrCreate()
>  | 
>  |   def apply(): Unit = {
>  | import sparkSession.implicits._
>  | 
>  | val landedData: Dataset[String] = 
> sparkSession.sparkContext.makeRDD(Seq("foo", "bar")).toDS()
>  | 
>  | // thingy has to be in this outer scope to reproduce, if in 
> someFunc, cannot reproduce
>  | val thingy: Thingy = new Thingy
>  | 
>  | // If not wrapped in someFunc cannot reproduce
>  | val someFunc = () => {
>  |   // If don't reference this foo inside the closer (e.g. just use 
> identity function) cannot reproduce
>  |   val foo: String = "foo"
>  | 
>  |   thingy.run(block = () => {
>  | landedData.map(r => {
>  |   r + foo
>  | })
>  | .count()
>  |   })
>  | }
>  | 
>  | someFunc()
>  | 
>  |   }
>  | }
> defined object Test
> scala> 
> scala> 
> scala> Test.apply()
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 19/01/07 11:27:19 INFO SparkContext: Running Spark version 2.3.1
> 19/01/07 11:27:20 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 19/01/07 11:27:20 INFO SparkContext: Submitted application: app
> 19/01/07 11:27:20 INFO SecurityManager: Changing view acls to: sams
> 19/01/07 11:27:20 INFO SecurityManager: Changing modify acls to: sams
> 19/01/07 11:27:20 INFO SecurityManager: Changing view acls groups to: 
> 19/01/07 11:27:20 INFO SecurityManager: Changing modify acls groups to: 
> 19/01/07 11:27:20 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(sams); groups 
> with view permissions: Set(); users  with modify permissions: Set(sams); 
> groups with modify permissions: Set()
> 19/01/07 11:27:20 INFO Utils: Successfully started service 'sparkDriver' on 
> port 54066.
> 19/01/07 11:27:20 INFO SparkEnv: Registering MapOutputTracker
> 19/01/07 11:27:20 INFO SparkEnv: Registering BlockManagerMaster
> 

[jira] [Resolved] (SPARK-26568) Too many partitions may cause thriftServer frequently Full GC

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26568.
---
Resolution: Not A Problem

I don't think this is actionable; you're generally saying that the Hive 
thriftserver GCs when you make it do more work. Unless it's more specific than 
that, I don't think it's useful.

> Too many partitions may cause thriftServer frequently Full GC
> -
>
> Key: SPARK-26568
> URL: https://issues.apache.org/jira/browse/SPARK-26568
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: zhoukang
>Priority: Major
>
> The reason is that:
> first we have a table with many partitions(may be several hundred);second, we 
> have some concurrent queries.Then the long-running thriftServer may encounter 
> OOM issue.
> Here is a case:
> call stack of OOM thread:
> {code:java}
> pool-34-thread-10
>   at 
> org.apache.hadoop.hive.metastore.api.StorageDescriptor.(Lorg/apache/hadoop/hive/metastore/api/StorageDescriptor;)V
>  (StorageDescriptor.java:240)
>   at 
> org.apache.hadoop.hive.metastore.api.Partition.(Lorg/apache/hadoop/hive/metastore/api/Partition;)V
>  (Partition.java:216)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.deepCopy(Lorg/apache/hadoop/hive/metastore/api/Partition;)Lorg/apache/hadoop/hive/metastore/api/Partition;
>  (HiveMetaStoreClient.java:1343)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.deepCopyPartitions(Ljava/util/Collection;Ljava/util/List;)Ljava/util/List;
>  (HiveMetaStoreClient.java:1409)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.deepCopyPartitions(Ljava/util/List;)Ljava/util/List;
>  (HiveMetaStoreClient.java:1397)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsByFilter(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;S)Ljava/util/List;
>  (HiveMetaStoreClient.java:914)
>   at 
> sun.reflect.GeneratedMethodAccessor98.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;
>  (Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;
>  (DelegatingMethodAccessorImpl.java:43)
>   at 
> java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;
>  (Method.java:606)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(Ljava/lang/Object;Ljava/lang/reflect/Method;[Ljava/lang/Object;)Ljava/lang/Object;
>  (RetryingMetaStoreClient.java:90)
>   at 
> com.sun.proxy.$Proxy30.listPartitionsByFilter(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;S)Ljava/util/List;
>  (Unknown Source)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getPartitionsByFilter(Lorg/apache/hadoop/hive/ql/metadata/Table;Ljava/lang/String;)Ljava/util/List;
>  (Hive.java:1967)
>   at 
> sun.reflect.GeneratedMethodAccessor97.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;
>  (Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;
>  (DelegatingMethodAccessorImpl.java:43)
>   at 
> java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;
>  (Method.java:606)
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(Lorg/apache/hadoop/hive/ql/metadata/Hive;Lorg/apache/hadoop/hive/ql/metadata/Table;Lscala/collection/Seq;)Lscala/collection/Seq;
>  (HiveShim.scala:602)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply()Lscala/collection/Seq;
>  (HiveClientImpl.scala:608)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply()Ljava/lang/Object;
>  (HiveClientImpl.scala:606)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply()Ljava/lang/Object;
>  (HiveClientImpl.scala:321)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(Lscala/Function0;Lscala/runtime/IntRef;Lscala/runtime/ObjectRef;Ljava/lang/Object;)V
>  (HiveClientImpl.scala:264)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(Lscala/Function0;)Ljava/lang/Object;
>  (HiveClientImpl.scala:263)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(Lscala/Function0;)Ljava/lang/Object;
>  (HiveClientImpl.scala:307)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(Lorg/apache/spark/sql/catalyst/catalog/CatalogTable;Lscala/collection/Seq;)Lscala/collection/Seq;
>  (HiveClientImpl.scala:606)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply()Lscala/collection/Seq;
>  (HiveExternalCatalog.scala:1017)
>   at 
> 

[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles

2019-03-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782137#comment-16782137
 ] 

Sean Owen commented on SPARK-26570:
---

How big can these be? are you saying they're large, or that they leak?


> Out of memory when InMemoryFileIndex bulkListLeafFiles
> --
>
> Key: SPARK-26570
> URL: https://issues.apache.org/jira/browse/SPARK-26570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: deshanxiao
>Priority: Major
> Attachments: screenshot-1.png
>
>
> The *bulkListLeafFiles* will collect all filestatus in memory for every query 
> which may cause the oom of driver. I use the spark 2.3.2 meeting with the 
> problem. Maybe the latest one also exists the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode

2019-03-01 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782135#comment-16782135
 ] 

Gabor Somogyi commented on SPARK-26998:
---

How is this different from https://github.com/apache/spark/pull/23820?

> spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor 
> processes in Standalone mode
> ---
>
> Key: SPARK-26998
> URL: https://issues.apache.org/jira/browse/SPARK-26998
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Security, Spark Core
>Affects Versions: 2.3.3, 2.4.0
>Reporter: t oo
>Priority: Major
>  Labels: SECURITY, Security, secur, security, security-issue
>
> Run spark standalone mode, then start a spark-submit requiring at least 1 
> executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to 
> see  spark.ssl.keyStorePassword value in plaintext!
>  
> spark.ssl.keyStorePassword and  spark.ssl.keyPassword don't need to be passed 
> to  CoarseGrainedExecutorBackend. Only  spark.ssl.trustStorePassword is used.
>  
> Can be resolved if below PR is merged:
> [[Github] Pull Request #21514 
> (tooptoop4)|https://github.com/apache/spark/pull/21514]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26587) Deadlock between SparkUI thread and Driver thread

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26587.
---
Resolution: Duplicate

> Deadlock between SparkUI thread and Driver thread  
> ---
>
> Key: SPARK-26587
> URL: https://issues.apache.org/jira/browse/SPARK-26587
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: EMR 5.9.0
>Reporter: Vitaliy Savkin
>Priority: Major
> Attachments: 
> _Spark_node_hanging__Thread_dump_from_application_master.txt
>
>
> One time in a month (~1000 runs) one of our spark applications freezes at 
> startup. jstack says that there is a deadlock. Please see locks 
> 0x802c00c0 and 0x8271bb98 in stacktraces below.
> {noformat}
> "Driver":
> at java.lang.Package.getSystemPackage(Package.java:540)
> - waiting to lock <0x802c00c0> (a java.util.HashMap)
> at java.lang.ClassLoader.getPackage(ClassLoader.java:1625)
> at java.net.URLClassLoader.getAndVerifyPackage(URLClassLoader.java:394)
> at java.net.URLClassLoader.definePackageInternal(URLClassLoader.java:420)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:452)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> - locked <0x82789598> (a 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:221)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:210)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
> - locked <0x82789540> (a 
> org.apache.spark.sql.internal.NonClosableMutableURLClassLoader)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:370)
> at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
> at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
> at javax.xml.parsers.FactoryFinder$1.run(FactoryFinder.java:294)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.xml.parsers.FactoryFinder.findServiceProvider(FactoryFinder.java:289)
> at javax.xml.parsers.FactoryFinder.find(FactoryFinder.java:267)
> at 
> javax.xml.parsers.DocumentBuilderFactory.newInstance(DocumentBuilderFactory.java:120)
> at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2516)
> at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2492)
> at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2405)
> - locked <0x8271bb98> (a org.apache.hadoop.conf.Configuration)
> at org.apache.hadoop.conf.Configuration.get(Configuration.java:981)
> at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1031)
> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2189)
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2702)
> at 
> org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74)
> at java.net.URL.getURLStreamHandler(URL.java:1142)
> at java.net.URL.(URL.java:599)
> at java.net.URL.(URL.java:490)
> at java.net.URL.(URL.java:439)
> at java.net.JarURLConnection.parseSpecs(JarURLConnection.java:175)
> at java.net.JarURLConnection.(JarURLConnection.java:158)
> at sun.net.www.protocol.jar.JarURLConnection.(JarURLConnection.java:81)
> at sun.net.www.protocol.jar.Handler.openConnection(Handler.java:41)
> at java.net.URL.openConnection(URL.java:979)
> at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:238)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:216)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:210)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
> - locked <0x82789540> (a 
> org.apache.spark.sql.internal.NonClosableMutableURLClassLoader)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:262)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:362)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:266)
> at 
> 

[jira] [Resolved] (SPARK-26602) Once creating and quering udf with incorrect path,followed by querying tables or functions registered with correct path gives the runtime exception within the same sess

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26602.
---
Resolution: Duplicate

> Once creating and quering udf with incorrect path,followed by querying tables 
> or functions registered with correct path gives the runtime exception within 
> the same session
> ---
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26991) Investigate difference of `returnNullable` between ScalaReflection.deserializerFor and JavaTypeInference.deserializerFor

2019-03-01 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782128#comment-16782128
 ] 

Jungtaek Lim edited comment on SPARK-26991 at 3/1/19 9:53 PM:
--

The outcome would be
 * "invalid" if there's reasonable reason for Java side to use `returnNullable 
= true`
 * new patch if there's no reason for Java side to use `returnNullable = true` 
separately and can follow Scala side

Btw, is it discouraged to file an issue for TODO task? I agree this might not 
be an issue after investigation: this is just not to miss for TODO - given this 
issue is opened to public, other than me can investigate this.


was (Author: kabhwan):
The outcome would be
 * "invalid" if there's reasonable reason for Java side to use `returnNullable 
= true`
 * new patch if there's no reason for Java side to use `returnNullable = true` 
separately and can follow Scala side

Btw, is it discouraged to file an issue for TODO task? I agree this wouldn't be 
an issue after investigation: this is just not to miss for TODO - given this 
issue is opened to public, other than me can investigate this.

> Investigate difference of `returnNullable` between 
> ScalaReflection.deserializerFor and JavaTypeInference.deserializerFor
> 
>
> Key: SPARK-26991
> URL: https://issues.apache.org/jira/browse/SPARK-26991
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue tracks the effort on investigation on difference between 
> ScalaReflection.deserializerFor and JavaTypeInference.deserializerFor, 
> especially the reason why Java side uses `returnNullable = true` whereas 
> `returnNullable = false`.
> The origin discussion is linked here:
> https://github.com/apache/spark/pull/23854#discussion_r260117702



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26770) Misleading/unhelpful error message when wrapping a null in an Option

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26770:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

I'm not sure; it's a user code error and it does get an informative exception 
about the cause. The place it gets checked is about the right place.

> Misleading/unhelpful error message when wrapping a null in an Option
> 
>
> Key: SPARK-26770
> URL: https://issues.apache.org/jira/browse/SPARK-26770
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: sam
>Priority: Minor
>
> This
> {code}
> // Using options to indicate nullable fields
> case class Product(productID: Option[Int],
>productName: Option[String])
> val productExtract: Dataset[Product] =
> spark.createDataset(Seq(
>   Product(
> productID = Some(6050286),
> // user mistake here, should be `None` not `Some(null)`
> productName = Some(null)
>   )))
> productExtract.count()
> {code}
> will give an error like the one below.  This error is thrown from quite deep 
> down, but there should be some handling logic further up to check for nulls 
> and to give a more informative error message.  E.g. it could tell the user 
> which field is null, it could detect the `Some(null)` error and suggest using 
> `None`.
> Whatever the exception it shouldn't be NPE, since this is clearly a user 
> error, so should be some kind of user error exception.
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 9 in stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in 
> stage 1.0 (TID 276, 10.139.64.8, executor 1): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:194)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:620)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:112)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:384)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> I've seen quite a few other people with this error, but I don't think it's 
> for the same reason:
> https://docs.databricks.com/spark/latest/data-sources/tips/redshift-npe.html
> https://groups.google.com/a/lists.datastax.com/forum/#!topic/spark-connector-user/Dt6ilC9Dn54
> https://issues.apache.org/jira/browse/SPARK-17195
> https://issues.apache.org/jira/browse/SPARK-18859
> https://github.com/datastax/spark-cassandra-connector/issues/1062
> https://stackoverflow.com/questions/39875711/spark-sql-2-0-nullpointerexception-with-a-valid-postgresql-query



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26623) Need a transpose function

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26623.
---
Resolution: Won't Fix

Yeah, I've never heard of this type of function. The use cases I can think of 
like this are what pivot is for in DBs.

> Need a transpose function 
> --
>
> Key: SPARK-26623
> URL: https://issues.apache.org/jira/browse/SPARK-26623
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Anubhav Jain
>Priority: Minor
>
> Can we aspect a transpose function which can be used to transpose a dataframe 
> or dataset  in pyspark . 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26991) Investigate difference of `returnNullable` between ScalaReflection.deserializerFor and JavaTypeInference.deserializerFor

2019-03-01 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782128#comment-16782128
 ] 

Jungtaek Lim commented on SPARK-26991:
--

The outcome would be
 * "invalid" if there's reasonable reason for Java side to use `returnNullable 
= true`
 * new patch if there's no reason for Java side to use `returnNullable = true` 
separately and can follow Scala side

Btw, is it discouraged to file an issue for TODO task? I agree this wouldn't be 
an issue after investigation: this is just not to miss for TODO - given this 
issue is opened to public, other than me can investigate this.

> Investigate difference of `returnNullable` between 
> ScalaReflection.deserializerFor and JavaTypeInference.deserializerFor
> 
>
> Key: SPARK-26991
> URL: https://issues.apache.org/jira/browse/SPARK-26991
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue tracks the effort on investigation on difference between 
> ScalaReflection.deserializerFor and JavaTypeInference.deserializerFor, 
> especially the reason why Java side uses `returnNullable = true` whereas 
> `returnNullable = false`.
> The origin discussion is linked here:
> https://github.com/apache/spark/pull/23854#discussion_r260117702



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26624) Different classloader use on subsequent call to same query, causing different behavior

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26624.
---
Resolution: Duplicate

> Different classloader use on subsequent call to same query, causing different 
> behavior
> --
>
> Key: SPARK-26624
> URL: https://issues.apache.org/jira/browse/SPARK-26624
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bimalendu Choudhary
>Priority: Major
>
> When calling a Hive UDF function from spark shell, we get the output when we 
> call the query first time, but when we call the query again it gives 
> following error
> #spark2-shell
> scala> spark.sql("select test(name) from customers limit 2").show (50, false)
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'com.vnb.fgp.generic.udf.encrypt.EncryptGenericUDF': 
> We have not provided the UDF jar files on the command line, but still we get 
> the output. The function test is created in Hive service as a permanent 
> function using the jar file.
> Debugging it further we see that on first invocation of the select command 
> the  following classLoader is being used and it has a path pointing to the 
> hdfs directory as set in Hive service:
> loader:  
> org.apache.spark.sql.internal.NonClosableMutableURLClassLoader@42cef0af
> hdfs:/tmp/bimal/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar
> file:/usr/java/jdk1.8.0_162/jre/lib/resources.jar
> file:/usr/java/jdk1.8.0_162/jre/lib/rt.jar
> On subsequent calls, a different class loader is being used:
> loader scala.tools.nsc.interpreter.IMain$TranslatingClassLoader@7bc3ec95
> file:/usr/java/jdk1.8.0_162/jre/lib/resources.jar
> file:/usr/java/jdk1.8.0_162/jre/lib/rt.jar
> file:/usr/java/jdk1.8.0_162/jre/lib/jsse.jar
> file:/usr/java/jdk1.8.0_162/jre/lib/jce.jar
> This does not have the hdfs path for the jar file and hence the exception is 
> generated.
> Most probably the classloader is picking things from Hive metastore.
> If we pass the UDF jar files on command line using --jars option, everything 
> works fine. 
> But this indicates that the classLoader and classpaths are different when 
> called first and second time causing inconsistent behavior and cause problem.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26693) Large Numbers Truncated

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26693.
---
Resolution: Cannot Reproduce

Assuming the zeppelin interpretation is correct for now

> Large Numbers Truncated 
> 
>
> Key: SPARK-26693
> URL: https://issues.apache.org/jira/browse/SPARK-26693
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Code was run in Zeppelin using Spark 2.4.
>Reporter: Jason Blahovec
>Priority: Major
>
> We have a process that takes a file dumped from an external API and formats 
> it for use in other processes.  These API dumps are brought into Spark with 
> all fields read in as strings.  One of the fields is a 19 digit visitor ID.  
> Since implementing Spark 2.4 a few weeks ago, we have noticed that dataframes 
> read the 19 digits correctly but any function in SQL appears to truncate the 
> last two digits and replace them with "00".  
> Our process is set up to convert these numbers to bigint, which worked before 
> Spark 2.4.  We looked into data types, and the possibility of changing to a 
> "long" type with no luck.  At that point we tried bringing in the string 
> value as is, with the same result.  I've added code that should replicate the 
> issue with a few 19 digit test cases and demonstrating the type conversions I 
> tried.
> Results for the code below are shown here:
> dfTestExpanded.show:
> +---+---+---+ | idAsString| 
> idAsBigint| idAsLong| 
> +---+---+---+ 
> |4065453307562594031|4065453307562594031|4065453307562594031| 
> |765995720523059|765995720523059|765995720523059| 
> |1614560078712787995|1614560078712787995|1614560078712787995| 
> +---+---+---+
> Run this query in a paragraph:
> %sql
> select * from global_temp.testTable
> and see these results (all 3 columns):
> 4065453307562594000
> 765995720523000
> 1614560078712788000
>  
> Another notable observation was that this issue soes not appear to affect 
> joins on the affected fields - we are seeing issues when the fields are used 
> in where clauses or as part of a select list.
>  
>  
> {code:java}
> // code placeholder
> %pyspark
> from pyspark.sql.functions import *
> sfTestValue = StructField("testValue",StringType(), True)
> schemaTest = StructType([sfTestValue])
> listTestValues = []
> listTestValues.append(("4065453307562594031",))
> listTestValues.append(("765995720523059",))
> listTestValues.append(("1614560078712787995",))
> dfTest = spark.createDataFrame(listTestValues, schemaTest)
> dfTestExpanded = dfTest.selectExpr(\
> "testValue as idAsString",\
> "cast(testValue as bigint) as idAsBigint",\
> "cast(testValue as long) as idAsLong")
> dfTestExpanded.show() ##This will show three columns of data correctly.
> dfTestExpanded.createOrReplaceGlobalTempView('testTable') ##When this table 
> is viewed in a %sql paragraph, the truncated values are shown.{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26701) spark thrift server driver memory leak

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26701.
---
Resolution: Duplicate

> spark thrift server driver memory leak
> --
>
> Key: SPARK-26701
> URL: https://issues.apache.org/jira/browse/SPARK-26701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: wangdabin
>Priority: Major
>
> When using the spark thrift server, the driver memory is getting bigger and 
> bigger, and finally the memory overflows. The memory analysis results show 
> that the SparkSQLOpeartionManager handleToOperation object is not released, 
> resulting in memory leaks, and the final result is service downtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26736) if filter condition has rand() function it does not do partition prunning

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26736:
--
Priority: Minor  (was: Major)

> if filter condition has rand() function it does not do partition prunning
> -
>
> Key: SPARK-26736
> URL: https://issues.apache.org/jira/browse/SPARK-26736
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: roncenzhao
>Priority: Minor
>
> Example:
> A  partitioned table definition:
> _create table test(id int) partitioned by (dt string);_
> The following sql does not do partition prunning:
> _select * from test where dt='20190101' and rand() < 0.5;_
>  
> I think it should do partition prunning in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26791) Some scala codes doesn't show friendly and some description about foreachBatch is misleading

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26791.
---
Resolution: Not A Problem

I read the docs and am not clear what the issue is. The scala code is fine and 
matches the other languages, and there is no call to an "uncache()" method

> Some scala codes doesn't show friendly and some description about 
> foreachBatch is misleading
> 
>
> Key: SPARK-26791
> URL: https://issues.apache.org/jira/browse/SPARK-26791
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.0
> Environment: NA
>Reporter: chaiyongqiang
>Priority: Minor
> Attachments: foreachBatch.jpg, multi-watermark.jpg
>
>
> [Introduction about 
> foreachbatch|http://spark.apache.org/docs/2.4.0/structured-streaming-programming-guide.html#foreachbatch]
> [Introduction about 
> policy-for-handling-multiple-watermarks|http://spark.apache.org/docs/2.4.0/structured-streaming-programming-guide.html#policy-for-handling-multiple-watermarks]
> The introduction about foreachBatch and 
> policy-for-handling-multiple-watermarks doesn't look good with the scala code.
> Besides, when taking about foreachBatch using the uncache api which doesn't 
> exists, it may be misleading.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26883) Spark MLIB Logistic Regression with heavy class imbalance estimates 0 coefficients

2019-03-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782099#comment-16782099
 ] 

Sean Owen commented on SPARK-26883:
---

My guess: something goes wrong when a partition has 0 of the 'rare' outcome. 
While it's not going to give exactly the same output as scikit, this seems too 
different of course. You could probably verify by seeing what happens if all 
the data is in one partition?

> Spark MLIB Logistic Regression with heavy class imbalance estimates 0 
> coefficients
> --
>
> Key: SPARK-26883
> URL: https://issues.apache.org/jira/browse/SPARK-26883
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.3.2
>Reporter: GAURAV BHATIA
>Priority: Major
>
> Minimal example is below.
> Basically, when the frequency of positives becomes low, the coefficients out 
> of spark.ml.classification.LogisticRegression become 0, deviating from the 
> corresponding sklearn results.
> I have not been able to find any parameter setting or documentation that 
> describes why this happens or how I can alter the behavior. 
> I'd appreciate any help in debugging. Thanks in advance!
>  
> Here, we set up the code to create the two sample scenarios. In both cases a 
> binary outcome is fit to a single binary predictor using logistic regression. 
> The effect of the binary predictor is to approximately 10x the probability of 
> a positive (1) outcome. The only difference between the "common" and "rare" 
> cases is the base frequency of the positive outcome. In the "common" case it 
> is 0.01, in the "rare" case it is 1e-4.
>  
> {code:java}
>  
> import pandas as pd
> import numpy as np
> import math
> def sampleLogistic(p0, p1, p1prev,size):
>  intercept = -1*math.log(1/p0 - 1)
>  coefficient = -1*math.log(1/p1 - 1) - intercept
>  x = np.random.choice([0, 1], size=(size,), p=[1 - p1prev, p1prev])
>  freq= 1/(1 + np.exp(-1*(intercept + coefficient*x)))
>  y = (np.random.uniform(size=size) < freq).astype(int)
>  df = pd.DataFrame({'x':x, 'y':y})
>  return(df)
> df_common = sampleLogistic(0.01,0.1,0.1,10)
> df_rare = sampleLogistic(0.0001,0.001,0.1,10){code}
>  
> Using sklearn:
>  
> {code:java}
> from sklearn.linear_model import LogisticRegression
> l = 0.3
> skmodel = LogisticRegression(
> fit_intercept=True,
> penalty='l2',
> C=1/l,
> max_iter=100,
> tol=1e-11,
> solver='lbfgs',verbose=1)
> skmodel.fit(df_common[['x']], df_common.y)
> print(skmodel.coef_, skmodel.intercept_)
> skmodel.fit(df_rare[['x']], df_rare.y)
> print(skmodel.coef_, skmodel.intercept_)
> {code}
> In one run of the simulation, this prints:
>  
>  
> {noformat}
> [[ 2.39497867]] [-4.58143701] # the common case 
> [[ 1.84918485]] [-9.05090438] # the rare case{noformat}
> Now, using PySpark for the common case:
>  
>  
> {code:java}
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.feature import VectorAssembler
> n = len(df_common.index)
> sdf_common = spark.createDataFrame(df_common)
> assembler = VectorAssembler(inputCols=['x'], outputCol="features")
> sdf_common = assembler.transform(sdf_common)
> # Make regularization 0.3/10=0.03
> lr = 
> LogisticRegression(regParam=l/n,labelCol='y',featuresCol='features',tol=1e-11/n,maxIter=100,standardization=False)
> model = lr.fit(sdf_common)
> print(model.coefficients, model.intercept)
> {code}
>  
> This prints:
>  
> {code:java}
> [2.39497214622] -4.5814342575166505 # nearly identical to the common case 
> above
> {code}
> Pyspark for the rare case:
>  
>  
> {code:java}
> n = len(df_rare.index)
> sdf_rare = spark.createDataFrame(df_rare)
> assembler = VectorAssembler(inputCols=['x'], outputCol="features")
> sdf_rare = assembler.transform(sdf_rare)
> # Make regularization 0.3/10=0.03
> lr = 
> LogisticRegression(regParam=l/n,labelCol='y',featuresCol='features',tol=1e-11/n,maxIter=100,standardization=False)
> model = lr.fit(sdf_rare)
> print(model.coefficients,model.intercept)
> {code}
> This prints:
>  
>  
> {noformat}
> [0.0] -8.62237369087212 # where does the 0 come from??
> {noformat}
>  
>  
> To verify that the data frames have the properties that we discussed:
> {code:java}
> sdf_common.describe().show()
> +---+--+--+
> |summary| x| y|
> +---+--+--+
> |  count|10|10|
> |   mean|   0.10055|   0.01927|
> | stddev|0.3007334399530905|0.1374731104200414|
> |min| 0| 0|
> |max| 1| 1|
> +---+--+--+
> sdf_rare.describe().show()
> +---+--++
> |summary| x|   

[jira] [Assigned] (SPARK-26807) Confusing documentation regarding installation from PyPi

2019-03-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26807:


Assignee: Apache Spark

> Confusing documentation regarding installation from PyPi
> 
>
> Key: SPARK-26807
> URL: https://issues.apache.org/jira/browse/SPARK-26807
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Emmanuel Arias
>Assignee: Apache Spark
>Priority: Trivial
>
> Hello!
> I am new using Spark. Reading the documentation I think that is a little 
> confusing on Downloading section.
> [ttps://spark.apache.org/docs/latest/#downloading|https://spark.apache.org/docs/latest/#downloading]
>  write: "Scala and Java users can include Spark in their projects using its 
> Maven coordinates and in the future Python users can also install Spark from 
> PyPI.", I interpret that currently Spark is not on PyPi yet. But  
> [https://spark.apache.org/downloads.html] write: 
> "[PySpark|https://pypi.python.org/pypi/pyspark] is now available in pypi. To 
> install just run {{pip install pyspark}}."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26807) Confusing documentation regarding installation from PyPi

2019-03-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26807:


Assignee: (was: Apache Spark)

> Confusing documentation regarding installation from PyPi
> 
>
> Key: SPARK-26807
> URL: https://issues.apache.org/jira/browse/SPARK-26807
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Emmanuel Arias
>Priority: Trivial
>
> Hello!
> I am new using Spark. Reading the documentation I think that is a little 
> confusing on Downloading section.
> [ttps://spark.apache.org/docs/latest/#downloading|https://spark.apache.org/docs/latest/#downloading]
>  write: "Scala and Java users can include Spark in their projects using its 
> Maven coordinates and in the future Python users can also install Spark from 
> PyPI.", I interpret that currently Spark is not on PyPi yet. But  
> [https://spark.apache.org/downloads.html] write: 
> "[PySpark|https://pypi.python.org/pypi/pyspark] is now available in pypi. To 
> install just run {{pip install pyspark}}."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26807) Confusing documentation regarding installation from PyPi

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26807:
--
Priority: Trivial  (was: Minor)

I'll just fix it, to expedite.

> Confusing documentation regarding installation from PyPi
> 
>
> Key: SPARK-26807
> URL: https://issues.apache.org/jira/browse/SPARK-26807
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Emmanuel Arias
>Priority: Trivial
>
> Hello!
> I am new using Spark. Reading the documentation I think that is a little 
> confusing on Downloading section.
> [ttps://spark.apache.org/docs/latest/#downloading|https://spark.apache.org/docs/latest/#downloading]
>  write: "Scala and Java users can include Spark in their projects using its 
> Maven coordinates and in the future Python users can also install Spark from 
> PyPI.", I interpret that currently Spark is not on PyPi yet. But  
> [https://spark.apache.org/downloads.html] write: 
> "[PySpark|https://pypi.python.org/pypi/pyspark] is now available in pypi. To 
> install just run {{pip install pyspark}}."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26815) run command "Spark-shell --proxy-user " failed in kerberos environment

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26815.
---
Resolution: Cannot Reproduce

> run command "Spark-shell --proxy-user " failed in kerberos environment
> --
>
> Key: SPARK-26815
> URL: https://issues.apache.org/jira/browse/SPARK-26815
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: KaiXinXIaoLei
>Priority: Major
> Attachments: image-2019-03-01-18-38-18-173.png, spark-shell-error.png
>
>
> I run "spark-shell --proxy-user hdfs" in kerberos environment, and run 
> "spark.sql(show tables)', there is a error:
> `Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Could not 
> connect to meta store using any of the URIs provided. Most recent failure: 
> org.apache.thrift.transport.TTransportException: GSS initiate failed
>  at 
> org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232)
>  at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:316)
>  at 
> org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
>  at org.apache.hadoop.hive.thrift.cli`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26828) Coalesce to reduce partitions before writing to hive is not working

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26828.
---
Resolution: Cannot Reproduce

Not enough info here.

> Coalesce to reduce partitions before writing to hive is not working
> ---
>
> Key: SPARK-26828
> URL: https://issues.apache.org/jira/browse/SPARK-26828
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Anusha Buchireddygari
>Priority: Minor
>
> final_store.coalesce(5).write.mode("overwrite").insertInto("database.tablename",overwrite
>  = True), this statement is not merging partitions. I've set 
> .config("spark.default.parallelism", "2000") \
> .config("spark.sql.shuffle.partitions", "2000") \
> however repartition is working but takes 20-25 minutes to insert.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26943) Weird behaviour with `.cache()`

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26943.
---
Resolution: Cannot Reproduce

I don't think this is a bug, or at least, I can think of other reasons this 
happens.

Your transformation and/or data have some problem (see the error). It doesn't 
come up in .count() because, for example, Spark can avoid actually parsing the 
data if you just want to know how many things there are. To cache it requires 
persisting its representation in memory and actually parsing it, and so that's 
why it comes up.

> Weird behaviour with `.cache()`
> ---
>
> Key: SPARK-26943
> URL: https://issues.apache.org/jira/browse/SPARK-26943
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Will Uto
>Priority: Major
>
>  
> {code:java}
> sdf.count(){code}
>  
> works fine. However:
>  
> {code:java}
> sdf = sdf.cache()
> sdf.count()
> {code}
>  does not, and produces error
> {code:java}
> Py4JJavaError: An error occurred while calling o314.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 75 
> in stage 8.0 failed 4 times, most recent failure: Lost task 75.3 in stage 8.0 
> (TID 438, uat-datanode-02, executor 1): java.text.ParseException: Unparseable 
> number: "(N/A)"
>   at java.text.NumberFormat.parse(NumberFormat.java:350)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26829) In place standard scaler so the column remains same after transformation

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26829.
---
Resolution: Won't Fix

> In place standard scaler so the column remains same after transformation
> 
>
> Key: SPARK-26829
> URL: https://issues.apache.org/jira/browse/SPARK-26829
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.2
>Reporter: Santokh Singh
>Priority: Major
>
> Standard scaler and some similar transformations takes input column name and 
> produce a new column, either accepting output column or generating new one 
> with some random name after performing transformation.
>  "inplace" flag  on true does not generate new column in output in dataframe 
> after transformation; preserves schema of df.
> "inplace" flag on false works the way its currently working.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26863) Add minimal values for spark.driver.memory and spark.executor.memory

2019-03-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782106#comment-16782106
 ] 

Sean Owen commented on SPARK-26863:
---

Sure but is it really a comment about the default, or limits on the non-default 
values you can specify?

> Add minimal values for spark.driver.memory and spark.executor.memory
> 
>
> Key: SPARK-26863
> URL: https://issues.apache.org/jira/browse/SPARK-26863
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: oskarryn
>Priority: Trivial
>
> I propose to change `1g` to `1g, with minimum of 472m` in "Default" column 
> for spark.driver.memory and spark.executor.memory properties in [Application 
> Properties](https://spark.apache.org/docs/latest/configuration.html#application-properties).
> Reasoning:
> In UnifiedMemoryManager.scala file I see definition of 
> RESERVED_SYSTEM_MEMORY_BYTES:
> {code:scala}
> // Set aside a fixed amount of memory for non-storage, non-execution purposes.
> // This serves a function similar to `spark.memory.fraction`, but guarantees 
> that we reserve
> // sufficient memory for the system even for small heaps. E.g. if we have a 
> 1GB JVM, then
> // the memory used for execution and storage will be (1024 - 300) * 0.6 = 
> 434MB by default.
> private val RESERVED_SYSTEM_MEMORY_BYTES = 300 * 1024 * 1024
> {code}
> Then `reservedMemory` takes on this value and also `minSystemMemory` is 
> defined: 
> {code:scala}
> val minSystemMemory = (reservedMemory * 1.5).ceil.toLong
> {code}
> Consequently driver heap size and executor memory are checked if they are 
> bigger than  minSystemMemory (471859200B) or IllegalArgumentException is 
> thrown. It seems that 472MB is absolute minimum for spark.driver.memory and 
> spark.executor.memory. 
> Side question: how is this 472MB established as sufficient memory for small 
> heaps? What do I risk if I build Spark with smaller 
> RESERVED_SYSTEM_MEMORY_BYTES?
> EDIT: I actually just tried to set spark.driver.memory to 472m and it turns 
> out the systemMemory variable was 440401920 not 471859200, so the exception 
> persists (bug?). It only works when spark.driver.memory is set to at least 
> 505m to have systemMemory >= minSystemMemory. I don't know why is it the case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26867) Spark Support of YARN Placement Constraint

2019-03-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782105#comment-16782105
 ] 

Sean Owen commented on SPARK-26867:
---

How would Spark use it?

> Spark Support of YARN Placement Constraint
> --
>
> Key: SPARK-26867
> URL: https://issues.apache.org/jira/browse/SPARK-26867
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, YARN
>Affects Versions: 3.0.0
>Reporter: Prabhu Joseph
>Priority: Major
>
> YARN provides Placement Constraint Features - where application can request 
> containers based on affinity / anti-affinity / cardinality to services or 
> other application containers / node attributes. This is a useful feature for 
> Spark Jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26876) Spark repl scala test failure on big-endian system

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26876.
---
Resolution: Duplicate

Thought not the exact same issue, this is basically the same question, and this 
JIRA is more question than a report with detail

> Spark repl scala test failure on big-endian system
> --
>
> Key: SPARK-26876
> URL: https://issues.apache.org/jira/browse/SPARK-26876
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.2
>Reporter: salamani
>Priority: Major
> Attachments: repl_scala_issue.txt
>
>
> I have built the spark 2.3.2 from source on big endian system. I have 
> observed the following test failure on spark 2.3.2 repl scala on big-endian 
> system.Please find the log attached.
>  
> how to go about resolving this issues or are they known issues for big endian 
> platform. How important is this failure?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26885) Remove yyyy/yyyy-[d]d format in DataTimeUtils for stringToTimestamp and stringToDate

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26885:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> Remove /-[d]d format in DataTimeUtils for stringToTimestamp and 
> stringToDate
> 
>
> Key: SPARK-26885
> URL: https://issues.apache.org/jira/browse/SPARK-26885
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Darcy Shen
>Priority: Minor
>
> stringToTimestamp, stringToDate support , as a result:
> select year("1912") => 1912
> select month("1912") => 1
> select hour("1912") => 0
>  
> In Presto or Hive, 
> select year("1912") => null
> select month("1912") => null
> select hour("1912") => null
>  
> It is not a good idea to support  for a Date/DateTime.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26896) Add maven profiles for running tests with JDK 11

2019-03-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782096#comment-16782096
 ] 

Sean Owen commented on SPARK-26896:
---

Reproducing my comments from elsewhere: I don't think we want to have users set 
flags to get Spark to run if at all possible. So far, haven't seen something 
that we can't work around 'correctly', but we haven't finished finding all the 
issues.

We'd always build with Java 8 BTW until Java 8 support is dropped. I agree we 
might need a new profile just for tests if it's needed to set java.version or 
something, but hopefully not flags.

> Add maven profiles for running tests with JDK 11
> 
>
> Key: SPARK-26896
> URL: https://issues.apache.org/jira/browse/SPARK-26896
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> Running unit tests w/ JDK 11 trips over some issues w/ the new module system. 
>  These can be worked around with the new {{--add-opens}} etc. commands.  I 
> think we need to add a build profile for JDK 11 to add some extra args to the 
> test runners.
> In particular:
> 1) removal of jaxb from java itself (used in pmml export in mllib)
> 2) Some reflective access which results in failures, eg. 
> {noformat}
> Unable to make field jdk.internal.ref.PhantomCleanable
> jdk.internal.ref.PhantomCleanable.prev accessible: module java.base does
> not "opens jdk.internal.ref" to unnamed module
> {noformat}
> 3) Some reflective access which results in warnings (you can add 
> {{--illegal-access=warn}} to see all of these).
> All I'm proposing we do here is put in the required handling to make these 
> problems go away, not necessarily do the "right" thing by no longer 
> referencing these unexposed internals.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26899) CountMinSketchAgg ExpressionDescription is not so correct

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26899.
---
Resolution: Not A Problem

This isn't "Major", and I don't think it's a doc problem in a comment, and I 
don't think it's wrong: it's just stating what count min sketch is for in 
general.

> CountMinSketchAgg ExpressionDescription is not so correct
> -
>
> Key: SPARK-26899
> URL: https://issues.apache.org/jira/browse/SPARK-26899
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: tomzhu
>Priority: Major
>
> Hi, all, there are some not-so-correct comment in CountMinSketchAgg.scala, 
> the ExpressionDescription says:
> {code:java}
> @ExpressionDescription(
>   usage = """
>   _FUNC_(col, eps, confidence, seed) - Returns a count-min sketch of a 
> column with the given esp,
>   confidence and seed. The result is an array of bytes, which can be 
> deserialized to a
>   `CountMinSketch` before usage. Count-min sketch is a probabilistic data 
> structure used for
>   cardinality estimation using sub-linear space.
>   """,
>   since = "2.2.0")
> {code}
> , *the Count-min sketch is a probabilistic data structure used for 
> cardinality estimation*, ** actually, Count-min sketch is mainly used for 
> point query, self_join size query, 
> how can it support cardinality estimation? a fix might be better.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26906) Pyspark RDD Replication Potentially Not Working

2019-03-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782093#comment-16782093
 ] 

Sean Owen commented on SPARK-26906:
---

I can't reproduce this on 2.4.0. It shows "2x replicated". What are you 
running, and is it local or on a cluster?

> Pyspark RDD Replication Potentially Not Working
> ---
>
> Key: SPARK-26906
> URL: https://issues.apache.org/jira/browse/SPARK-26906
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Web UI
>Affects Versions: 2.3.2
> Environment: I am using Google Cloud's Dataproc version [1.3.19-deb9 
> 2018/12/14|https://cloud.google.com/dataproc/docs/release-notes#december_14_2018]
>  (version 2.3.2 Spark and version 2.9.0 Hadoop) with version Debian 9, with 
> python version 3.7. PySpark shell is activated using pyspark --num-executors 
> = 100
>Reporter: Han Altae-Tran
>Priority: Minor
> Attachments: spark_ui.png
>
>
> Pyspark RDD replication doesn't seem to be functioning properly. Even with a 
> simple example, the UI reports only 1x replication, despite using the flag 
> for 2x replication
> {code:java}
> rdd = sc.range(10**9)
> mapped = rdd.map(lambda x: x)
> mapped.persist(pyspark.StorageLevel.DISK_ONLY_2) \\ PythonRDD[1] at RDD at 
> PythonRDD.scala:52
> mapped.count(){code}
>  
> Interestingly, if you catch the UI page at just the right time, you see that 
> it starts off 2x replicated, but ends up 1x replicated afterward. Perhaps the 
> RDD is replicated, but it is just the UI that is unable to register this.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins

2019-03-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782085#comment-16782085
 ] 

Sean Owen commented on SPARK-26944:
---

I can see things like 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.4-test-sbt-hadoop-2.7/315/artifact/target/unit-tests.log
  but not the python one. [~shaneknapp] just in case there's an easy way to 
make Jenkins save that?

> Python unit-tests.log not available in artifacts for a build in Jenkins
> ---
>
> Key: SPARK-26944
> URL: https://issues.apache.org/jira/browse/SPARK-26944
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Priority: Minor
>
> I had a pr where the python unit tests failed.  The tests point at the 
> `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, 
> but I can't get to that from jenkins UI it seems (are all prs writing to the 
> same file?).
> {code:java}
> 
> Running PySpark tests
> 
> Running PySpark tests. Output is in 
> /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code}
> For reference, please see this build: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console
> This Jira is to make it available under the artifacts for each build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26980) Kryo deserialization not working with KryoSerializable class

2019-03-01 Thread Alexis Sarda-Espinosa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782082#comment-16782082
 ] 

Alexis Sarda-Espinosa commented on SPARK-26980:
---

Kryo works fine if used directly ([example 
here|https://github.com/asardaes/hello-spark-kotlin/blob/master/src/test/kotlin/hello/spark/kotlin/MainTest.kt#L21]),
 it just breaks when the data goes through Spark.

> Kryo deserialization not working with KryoSerializable class
> 
>
> Key: SPARK-26980
> URL: https://issues.apache.org/jira/browse/SPARK-26980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Local Spark v2.4.0
> Kotlin v1.3.21
>Reporter: Alexis Sarda-Espinosa
>Priority: Minor
>  Labels: kryo, serialization
>
> I'm trying to create an {{Aggregator}} that uses a custom container that 
> should be serialized with {{Kryo:}} 
> {code:java}
> class StringSet(other: Collection) : HashSet(other), 
> KryoSerializable {
> companion object {
> @JvmStatic
> private val serialVersionUID = 1L
> }
> constructor() : this(Collections.emptyList())
> override fun write(kryo: Kryo, output: Output) {
> output.writeInt(this.size)
> for (string in this) {
> output.writeString(string)
> }
> }
> override fun read(kryo: Kryo, input: Input) {
> val size = input.readInt()
> repeat(size) { this.add(input.readString()) }
> }
> }
> {code}
> However, if I look at the corresponding value in the {{Row}} after 
> aggregation (for example by using {{collectAsList()}}), I see a {{byte[]}}. 
> Interestingly, the first byte in that array seems to be some sort of noise, 
> and I can deserialize by doing something like this: 
> {code:java}
> val b = row.getAs(2)
> val input = Input(b.copyOfRange(1, b.size)) // extra byte?
> val set = Kryo().readObject(input, StringSet::class.java)
> {code}
> Used configuration: 
> {code:java}
> SparkConf()
> .setAppName("Hello Spark with Kotlin")
> .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> .set("spark.kryo.registrationRequired", "true")
> .registerKryoClasses(arrayOf(StringSet::class.java))
> {code}
> [Sample repo with all the 
> code|https://github.com/asardaes/hello-spark-kotlin/tree/8e8a54fd81f0412507318149841c69bb17d8572c].
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27026) Upgrade Docker image for release build to Ubuntu 18.04

2019-03-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27026:


Assignee: (was: Apache Spark)

> Upgrade Docker image for release build to Ubuntu 18.04
> --
>
> Key: SPARK-27026
> URL: https://issues.apache.org/jira/browse/SPARK-27026
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26984) Incompatibility between Spark releases - Some(null)

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26984.
---
Resolution: Not A Problem

I agree, I don't think "Some(null)" is reasonable. You mean None, right?

> Incompatibility between Spark releases - Some(null) 
> 
>
> Key: SPARK-26984
> URL: https://issues.apache.org/jira/browse/SPARK-26984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Linux CentOS, Databricks.
>Reporter: Gerard Alexander
>Priority: Minor
>  Labels: newbie
>
> Please refer to 
> [https://stackoverflow.com/questions/54851205/why-does-somenull-throw-nullpointerexception-in-spark-2-4-but-worked-in-2-2/54861152#54861152.]
> NB: Not sure of priority being correct - no doubt one will evaluate.
> It is noted that the following:
> {code}
> val df = Seq(
>   (1, Some("a"), Some(1)),
>   (2, Some(null), Some(2)),
>   (3, Some("c"), Some(3)),
>   (4, None, None)).toDF("c1", "c2", "c3")}
> {code}
> In Spark 2.2.1 (on mapr) the {{Some(null)}} works fine, in Spark 2.4.0 on 
> Databricks an error ensues.
> {code}
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException assertnotnull(assertnotnull(input[0, 
> scala.Tuple3, true]))._1 AS _1#6 staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> unwrapoption(ObjectType(class java.lang.String), 
> assertnotnull(assertnotnull(input[0, scala.Tuple3, true]))._2), true, false) 
> AS _2#7 unwrapoption(IntegerType, assertnotnull(assertnotnull(input[0, 
> scala.Tuple3, true]))._3) AS _3#8 at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:293)
>  at 
> org.apache.spark.sql.SparkSession.$anonfun$createDataset$1(SparkSession.scala:472)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:233) at 
> scala.collection.immutable.List.foreach(List.scala:388) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:233) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:226) at 
> scala.collection.immutable.List.map(List.scala:294) at 
> org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:472) at 
> org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377) at 
> org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:228)
>  ... 57 elided Caused by: java.lang.NullPointerException at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:289)
>  ... 66 more
> {code}
> You can argue it is solvable otherwise, but there may well be an existing 
> code base that could be affected.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26947) Pyspark KMeans Clustering job fails on large values of k

2019-03-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782076#comment-16782076
 ] 

Sean Owen commented on SPARK-26947:
---

How big is k? yes, you're going to run out of memory eventually if you have 
enough centroids and they're large enough. I'm not sure that's a bug, unless 
this is a really surprisingly small k.

> Pyspark KMeans Clustering job fails on large values of k
> 
>
> Key: SPARK-26947
> URL: https://issues.apache.org/jira/browse/SPARK-26947
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
> Attachments: clustering_app.py
>
>
> We recently had a case where a user's pyspark job running KMeans clustering 
> was failing for large values of k. I was able to reproduce the same issue 
> with dummy dataset. I have attached the code as well as the data in the JIRA. 
> The stack trace is printed below from Java:
>  
> {code:java}
> Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:3332)
>   at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
>   at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)
>   at java.lang.StringBuilder.append(StringBuilder.java:202)
>   at py4j.Protocol.getOutputCommand(Protocol.java:328)
>   at py4j.commands.CallCommand.execute(CallCommand.java:81)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Python:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 985, in send_command
> response = connection.send_command(command)
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> Traceback (most recent call last):
>   File "clustering_app.py", line 154, in 
> main(args)
>   File "clustering_app.py", line 145, in main
> run_clustering(sc, args.input_path, args.output_path, 
> args.num_clusters_list)
>   File "clustering_app.py", line 136, in run_clustering
> clustersTable, cluster_Centers = clustering(sc, documents, output_path, 
> k, max_iter)
>   File "clustering_app.py", line 68, in clustering
> cluster_Centers = km_model.clusterCenters()
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py",
>  line 337, in clusterCenters
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py",
>  line 55, in _call_java
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py",
>  line 109, in _java2py
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 336, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling 
> z:org.apache.spark.ml.python.MLSerDe.dumps
> {code}
> The command with which the application was launched is given below:
> {code:java}
> $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf 
> spark.executor.memory=20g --conf spark.driver.memory=20g --conf 
> spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf 
> spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=12g 
> ~/clustering_app.py --input_path hdfs:///user/username/part-v001x 
> --output_path hdfs:///user/username --num_clusters_list 1
> {code}
> The input dataset is approximately 90 MB in size and the assigned heap memory 
> to both driver and executor is close to 20 GB. This only happens for large 
> values of k.



--
This message was sent 

[jira] [Resolved] (SPARK-26951) Should not throw KryoException when root cause is IOexception

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26951.
---
Resolution: Not A Problem

Agree, I don't see a reason to retry in this case

> Should not throw KryoException when root cause is IOexception
> -
>
> Key: SPARK-26951
> URL: https://issues.apache.org/jira/browse/SPARK-26951
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: zhoukang
>Priority: Major
>
> Job will failed with below exception:
> {code:java}
> Job aborted due to stage failure: Task 1576 in stage 97.0 failed 4 times, 
> most recent failure: Lost task 1576.3 in stage 97.0 (TID 121949, xxx, 
> executor 14): com.esotericsoftware.kryo.KryoException: java.io.IOException: 
> Stream is corrupted. The lz4's magic number should be 
> LZ4Block(4c5a34426c6f636b), but received buffer's head bytes is 
> ().
> {code}
> {code:java}
> Job aborted due to stage failure: Task 1576 in stage 97.0 failed 4 times, 
> most recent failure: Lost task 1576.3 in stage 97.0 (TID 121949, xxx, 
> executor 14): com.esotericsoftware.kryo.KryoException: java.io.IOException: 
> Stream is corrupted. The lz4's magic number should be 
> LZ4Block(4c5a34426c6f636b), but received buffer's head bytes is 
> ().
>   at com.esotericsoftware.kryo.io.Input.fill(Input.java:166)
>   at com.esotericsoftware.kryo.io.Input.require(Input.java:196)
>   at com.esotericsoftware.kryo.io.Input.readVarInt(Input.java:373)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:127)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:804)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:244)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:180)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Stream is corrupted. The lz4's magic number 
> should be LZ4Block(4c5a34426c6f636b), but received buffer's head bytes is 
> ().
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:169)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:127)
>   at com.esotericsoftware.kryo.io.Input.fill(Input.java:164)
>   ... 19 more
> Driver stacktrace:
> {code}
> For IOException, it should retry



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26970:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> Can't load PipelineModel that was created in Scala with Python due to missing 
> Interaction transformer
> -
>
> Key: SPARK-26970
> URL: https://issues.apache.org/jira/browse/SPARK-26970
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Andrew Crosby
>Priority: Minor
>
> The Interaction transformer 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala]
>  is missing from the set of pyspark feature transformers 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py]
>  
> This means that it is impossible to create a model that includes an 
> Interaction transformer with pyspark. It also means that attempting to load a 
> PipelineModel created in Scala that includes an Interaction transformer with 
> pyspark fails with the following error:
> {code:java}
> AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26972) Issue with CSV import and inferSchema set to true

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26972.
---
Resolution: Not A Problem

The option is "multiLine"

> Issue with CSV import and inferSchema set to true
> -
>
> Key: SPARK-26972
> URL: https://issues.apache.org/jira/browse/SPARK-26972
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.3, 2.3.3, 2.4.0
> Environment: Java 8/Scala 2.11/MacOs
>Reporter: Jean Georges Perrin
>Priority: Major
> Attachments: ComplexCsvToDataframeApp.java, 
> ComplexCsvToDataframeWithSchemaApp.java, books.csv, issue.txt, pom.xml
>
>
>  
> I found a few discrepencies while working with inferSchema set to true in CSV 
> ingestion.
> Given the following CSV in the attached books.csv:
> {noformat}
> id;authorId;title;releaseDate;link
> 1;1;Fantastic Beasts and Where to Find Them: The Original 
> Screenplay;11/18/16;http://amzn.to/2kup94P
> 2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
> Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP
> 3;1;*The Tales of Beedle the Bard, Standard Edition (Harry 
> Potter)*;12/4/08;http://amzn.to/2kYezqr
> 4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry 
> Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n
> 5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the 
> Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT
> 6;2;*Development Tools in 2006: any Room for a 4GL-style Language?
> An independent study by Jean Georges Perrin, IIUG Board 
> Member*;12/28/16;http://amzn.to/2vBxOe1
> 7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav
> 8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD
> 10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA
> 11;4;Diderot Encyclopedia: The Complete Illustrations 
> 1762-1777;;http://amzn.to/2i2zo3I
> 12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ
> 13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW
> 14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk
> 15;7;Soft Skills: The software developer's life 
> manual;12/29/14;http://amzn.to/2zNnSyn
> 16;8;Of Mice and Men;;http://amzn.to/2zJjXoc
> 17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
> programming*;8/28/14;http://amzn.to/2isdqoL
> 18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY
> 19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG
> 20;14;*Fables choisies; mises en vers par M. de La 
> Fontaine*;9/1/1999;http://amzn.to/2yRH10W
> 21;15;Discourse on Method and Meditations on First 
> Philosophy;6/15/1999;http://amzn.to/2hwB8zc
> 22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo
> 23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo{noformat}
> And this Java code:
> {code:java}
> Dataset df = spark.read().format("csv")
>  .option("header", "true")
>  .option("multiline", true)
>  .option("sep", ";")
>  .option("quote", "*")
>  .option("dateFormat", "M/d/y")
>  .option("inferSchema", true)
>  .load("data/books.csv");
> df.show(7);
> df.printSchema();
> {code}
> h1. In Spark v2.0.1
> Output: 
> {noformat}
> +---+++---++
> | id|authorId|   title|releaseDate|link|
> +---+++---++
> |  1|   1|Fantastic Beasts ...|   11/18/16|http://amzn.to/2k...|
> |  2|   1|Harry Potter and ...|10/6/15|http://amzn.to/2l...|
> |  3|   1|The Tales of Beed...|12/4/08|http://amzn.to/2k...|
> |  4|   1|Harry Potter and ...|10/4/16|http://amzn.to/2k...|
> |  5|   2|Informix 12.10 on...|4/23/17|http://amzn.to/2i...|
> |  6|   2|Development Tools...|   12/28/16|http://amzn.to/2v...|
> |  7|   3|Adventures of Huc...|.   5/26/94|http://amzn.to/2w...|
> +---+++---++
> only showing top 7 rows
> Dataframe's schema:
> root
> |-- id: integer (nullable = true)
> |-- authorId: integer (nullable = true)
> |-- title: string (nullable = true)
> |-- releaseDate: string (nullable = true)
> |-- link: string (nullable = true)
> {noformat}
> *This is fine and the expected output*.
> h1. Using Apache Spark v2.1.3
> Excerpt of the dataframe content: 
> {noformat}
> ++++---++
> | id|authorId| title|releaseDate| link|
> ++++---++
> | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
> | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|
> | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|
> | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|
> | 5| 2|Informix 

[jira] [Assigned] (SPARK-27026) Upgrade Docker image for release build to Ubuntu 18.04

2019-03-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27026:


Assignee: Apache Spark

> Upgrade Docker image for release build to Ubuntu 18.04
> --
>
> Key: SPARK-27026
> URL: https://issues.apache.org/jira/browse/SPARK-27026
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26980) Kryo deserialization not working with KryoSerializable class

2019-03-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782072#comment-16782072
 ] 

Sean Owen commented on SPARK-26980:
---

This sounds like a Kryo usage question.
It could have something to do with the fact that Spark also uses Kryo and 
enables registration for certain classes; I'm not sure. Unless it were a 
problem with Spark's Kryo usage I'd close this.

> Kryo deserialization not working with KryoSerializable class
> 
>
> Key: SPARK-26980
> URL: https://issues.apache.org/jira/browse/SPARK-26980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Local Spark v2.4.0
> Kotlin v1.3.21
>Reporter: Alexis Sarda-Espinosa
>Priority: Minor
>  Labels: kryo, serialization
>
> I'm trying to create an {{Aggregator}} that uses a custom container that 
> should be serialized with {{Kryo:}} 
> {code:java}
> class StringSet(other: Collection) : HashSet(other), 
> KryoSerializable {
> companion object {
> @JvmStatic
> private val serialVersionUID = 1L
> }
> constructor() : this(Collections.emptyList())
> override fun write(kryo: Kryo, output: Output) {
> output.writeInt(this.size)
> for (string in this) {
> output.writeString(string)
> }
> }
> override fun read(kryo: Kryo, input: Input) {
> val size = input.readInt()
> repeat(size) { this.add(input.readString()) }
> }
> }
> {code}
> However, if I look at the corresponding value in the {{Row}} after 
> aggregation (for example by using {{collectAsList()}}), I see a {{byte[]}}. 
> Interestingly, the first byte in that array seems to be some sort of noise, 
> and I can deserialize by doing something like this: 
> {code:java}
> val b = row.getAs(2)
> val input = Input(b.copyOfRange(1, b.size)) // extra byte?
> val set = Kryo().readObject(input, StringSet::class.java)
> {code}
> Used configuration: 
> {code:java}
> SparkConf()
> .setAppName("Hello Spark with Kotlin")
> .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> .set("spark.kryo.registrationRequired", "true")
> .registerKryoClasses(arrayOf(StringSet::class.java))
> {code}
> [Sample repo with all the 
> code|https://github.com/asardaes/hello-spark-kotlin/tree/8e8a54fd81f0412507318149841c69bb17d8572c].
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26982) Enhance describe framework to describe the output of a query

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26982:
--
Priority: Minor  (was: Major)

> Enhance describe framework to describe the output of a query
> 
>
> Key: SPARK-26982
> URL: https://issues.apache.org/jira/browse/SPARK-26982
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dilip Biswal
>Priority: Minor
>
> Currently we can use `df.printSchema` to discover the schema information for 
> a query. We should have a way to describe the output schema of a query using 
> SQL interface. 
>  
> Example:
> DESCRIBE SELECT * FROM desc_table
> DESCRIBE QUERY SELECT * FROM desc_table



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26983) Spark PassThroughSuite,ColumnVectorSuite failure on bigendian

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26983:
--
Target Version/s:   (was: 2.3.2)
Priority: Minor  (was: Major)
   Fix Version/s: (was: 2.3.2)

Don't set Fix or Target version please.

I'm not sure how much big-endian works at all here, though some parts of the 
code try to accommodate it. I doubt this would be resolved unless you open a PR.

> Spark PassThroughSuite,ColumnVectorSuite failure on bigendian
> -
>
> Key: SPARK-26983
> URL: https://issues.apache.org/jira/browse/SPARK-26983
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: salamani
>Priority: Minor
>
> Following failures are observed in Spark Project SQL  on big endian system
> PassThroughSuite :
>  - PassThrough with FLOAT: empty column for decompress()
>  - PassThrough with FLOAT: long random series for decompress() *** FAILED ***
>  Expected 0.10990685, but got -6.6357654E14 Wrong 0-th decoded float value 
> (PassThroughEncodingSuite.scala:146)
>  - PassThrough with FLOAT: simple case with null for decompress() *** FAILED 
> ***
>  Expected 2.0, but got 9.0E-44 Wrong 0-th decoded float value 
> (PassThroughEncodingSuite.scala:146)
>  - PassThrough with DOUBLE: empty column
>  - PassThrough with DOUBLE: long random series
>  - PassThrough with DOUBLE: empty column for decompress()
>  - PassThrough with DOUBLE: long random series for decompress() *** FAILED ***
>  Expected 0.20634564007984624, but got 5.902392643940031E-230 Wrong 0-th 
> decoded double value (PassThroughEncodingSuite.scala:150)
>  - PassThrough with DOUBLE: simple case with null for decompress() *** FAILED 
> ***
>  Expected 2.0, but got 3.16E-322 Wrong 0-th decoded double value 
> (PassThroughEncodingSuite.scala:150)
>  Run completed in 9 seconds, 72 milliseconds.
>  Total number of tests run: 30
>  Suites: completed 2, aborted 0
>  Tests: succeeded 26, failed 4, canceled 0, ignored 0, pending 0
>  ** 
>  *** 4 TESTS FAILED ***
>  
> ColumnVectorSuite:
>  - CachedBatch long Apis
>  - CachedBatch float Apis *** FAILED ***
>  4.6006E-41 did not equal 1.0 (ColumnVectorSuite.scala:378)
>  - CachedBatch double Apis *** FAILED ***
>  3.03865E-319 did not equal 1.0 (ColumnVectorSuite.scala:402)
>  Run completed in 8 seconds, 183 milliseconds.
>  Total number of tests run: 21
>  Suites: completed 2, aborted 0
>  Tests: succeeded 19, failed 2, canceled 0, ignored 0, pending 0
>  ** 
>  *** 2 TESTS FAILED ***



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26985) Test "access only some column of the all of columns " fails on big endian

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26985.
---
Resolution: Not A Problem

Same as SPARK-26940; until it's reproducible on a standard JDK, I don't think 
it's at all clear it's not due to this custom JDK implementation. I don't see 
evidence it has to do with endian-ness.

> Test "access only some column of the all of columns " fails on big endian
> -
>
> Key: SPARK-26985
> URL: https://issues.apache.org/jira/browse/SPARK-26985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Linux Ubuntu 16.04 
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 64-Bit Compressed 
> References 20190205_218 (JIT enabled, AOT enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>  
>Reporter: Anuja Jakhade
>Priority: Major
>  Labels: BigEndian
> Attachments: DataFrameTungstenSuite.txt, 
> InMemoryColumnarQuerySuite.txt, access only some column of the all of 
> columns.txt
>
>
> While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am 
> observing test failures for 2 Suites of Project SQL.
>  1. InMemoryColumnarQuerySuite
>  2. DataFrameTungstenSuite
>  In both the cases test "access only some column of the all of columns" fails 
> due to mismatch in the final assert.
> Observed that the data obtained after df.cache() is causing the error. Please 
> find attached the log with the details. 
> cache() works perfectly fine if double and  float values are not in picture.
> Inside test !!- access only some column of the all of columns *** FAILED 
> ***



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26991) Investigate difference of `returnNullable` between ScalaReflection.deserializerFor and JavaTypeInference.deserializerFor

2019-03-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782062#comment-16782062
 ] 

Sean Owen commented on SPARK-26991:
---

I'm not sure what the outcome would be here; let's open JIRAs for specific 
issues only.

> Investigate difference of `returnNullable` between 
> ScalaReflection.deserializerFor and JavaTypeInference.deserializerFor
> 
>
> Key: SPARK-26991
> URL: https://issues.apache.org/jira/browse/SPARK-26991
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue tracks the effort on investigation on difference between 
> ScalaReflection.deserializerFor and JavaTypeInference.deserializerFor, 
> especially the reason why Java side uses `returnNullable = true` whereas 
> `returnNullable = false`.
> The origin discussion is linked here:
> https://github.com/apache/spark/pull/23854#discussion_r260117702



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27026) Upgrade Docker image for release build to Ubuntu 18.04

2019-03-01 Thread DB Tsai (JIRA)
DB Tsai created SPARK-27026:
---

 Summary: Upgrade Docker image for release build to Ubuntu 18.04
 Key: SPARK-27026
 URL: https://issues.apache.org/jira/browse/SPARK-27026
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.0
Reporter: DB Tsai
 Fix For: 3.0.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27011) reset command fails after cache table

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-27011:
--
Priority: Minor  (was: Critical)

> reset command fails after cache table
> -
>
> Key: SPARK-27011
> URL: https://issues.apache.org/jira/browse/SPARK-27011
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: Ajith S
>Priority: Minor
>
>  
> h3. Commands to reproduce 
> spark-sql> create table abcde ( a int);
> spark-sql> reset; // can work success
> spark-sql> cache table abcde;
> spark-sql> reset; //fails with exception
> h3. Below is the stack
> {{org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree:}}
> {{ResetCommand$}}{{at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)}}
> {{ at 
> org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:379)}}
> {{ at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:216)}}
> {{ at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:211)}}
> {{ at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.sameResult(QueryPlan.scala:259)}}
> {{ at 
> org.apache.spark.sql.execution.CacheManager.$anonfun$lookupCachedData$3(CacheManager.scala:236)}}
> {{ at 
> org.apache.spark.sql.execution.CacheManager.$anonfun$lookupCachedData$3$adapted(CacheManager.scala:236)}}
> {{ at scala.collection.Iterator.find(Iterator.scala:993)}}
> {{ at scala.collection.Iterator.find$(Iterator.scala:990)}}
> {{ at scala.collection.AbstractIterator.find(Iterator.scala:1429)}}
> {{ at scala.collection.IterableLike.find(IterableLike.scala:81)}}
> {{ at scala.collection.IterableLike.find$(IterableLike.scala:80)}}
> {{ at scala.collection.AbstractIterable.find(Iterable.scala:56)}}
> {{ at 
> org.apache.spark.sql.execution.CacheManager.$anonfun$lookupCachedData$2(CacheManager.scala:236)}}
> {{ at 
> org.apache.spark.sql.execution.CacheManager.readLock(CacheManager.scala:59)}}
> {{ at 
> org.apache.spark.sql.execution.CacheManager.lookupCachedData(CacheManager.scala:236)}}
> {{ at 
> org.apache.spark.sql.execution.CacheManager$$anonfun$1.applyOrElse(CacheManager.scala:250)}}
> {{ at 
> org.apache.spark.sql.execution.CacheManager$$anonfun$1.applyOrElse(CacheManager.scala:241)}}
> {{ at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:258)}}
> {{ at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)}}
> {{ at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258)}}
> {{ at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)}}
> {{ at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)}}
> {{ at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)}}
> {{ at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)}}
> {{ at 
> org.apache.spark.sql.execution.CacheManager.useCachedData(CacheManager.scala:241)}}
> {{ at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:68)}}
> {{ at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:65)}}
> {{ at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:72)}}
> {{ at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)}}
> {{ at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:72)}}
> {{ at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:71)}}
> {{ at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$writePlans$4(QueryExecution.scala:139)}}
> {{ at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$.append(QueryPlan.scala:316)}}
> {{ at 
> org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$writePlans(QueryExecution.scala:139)}}
> {{ at 
> org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:146)}}
> {{ at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:82)}}
> {{ at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)}}
> {{ at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)}}
> {{ at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3346)}}
> {{ at org.apache.spark.sql.Dataset.(Dataset.scala:203)}}
> {{ at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88)}}
> {{ 

[jira] [Commented] (SPARK-27024) Design executor interface to support GPU resources

2019-03-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782060#comment-16782060
 ] 

Sean Owen commented on SPARK-27024:
---

Is this different from SPARK-27005?

> Design executor interface to support GPU resources
> --
>
> Key: SPARK-27024
> URL: https://issues.apache.org/jira/browse/SPARK-27024
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> The executor interface shall deal with the resources allocated to the 
> executor by cluster managers(Standalone, YARN, Kubernetes), so the Spark 
> Executor don’t need to involve into the GPU discovery and allocation, which 
> shall be handled by cluster managers. However, an executor need to sync with 
> the driver to expose available resources to support task scheduling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27015) spark-submit does not properly escape arguments sent to Mesos dispatcher

2019-03-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782059#comment-16782059
 ] 

Sean Owen commented on SPARK-27015:
---

Sure, open a PR to escape the args as needed.

> spark-submit does not properly escape arguments sent to Mesos dispatcher
> 
>
> Key: SPARK-27015
> URL: https://issues.apache.org/jira/browse/SPARK-27015
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Martin Loncaric
>Priority: Major
>
> Arguments sent to the dispatcher must be escaped; for instance,
> {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a 
> b$c"{noformat}
> fails, and instead must be submitted as
> {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a\\ 
> b\\$c"{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27014) Support removal of jars and Spark binaries from Mesos driver and executor sandboxes

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-27014:
--
Priority: Minor  (was: Major)

> Support removal of jars and Spark binaries from Mesos driver and executor 
> sandboxes
> ---
>
> Key: SPARK-27014
> URL: https://issues.apache.org/jira/browse/SPARK-27014
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.5.0
>Reporter: Martin Loncaric
>Priority: Minor
> Fix For: 2.5.0, 3.0.0
>
>
> Currently, each Spark application run on Mesos leaves behind at least 500MB 
> of data in sandbox directories, coming from Spark binaries and copied URIs. 
> These can build up as a disk leak, causing major issues on Mesos clusters 
> unless their grace period for sandbox directories is very short.
> Spark should have a feature to delete these (from both driver and executor 
> sandboxes) on teardown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27025) Speed up toLocalIterator

2019-03-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782051#comment-16782051
 ] 

Sean Owen commented on SPARK-27025:
---

If you fetched it all at once proactively, you have another problem: what if it 
doesn't fit on the driver? the use case for toLocalIterator() is probably 
exactly to avoid this.

> Speed up toLocalIterator
> 
>
> Key: SPARK-27025
> URL: https://issues.apache.org/jira/browse/SPARK-27025
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Erik van Oosten
>Priority: Major
>
> Method {{toLocalIterator}} fetches the partitions to the driver one by one. 
> However, as far as I can see, any required computation for the 
> yet-to-be-fetched-partitions is not kicked off until it is fetched. 
> Effectively only one partition is being computed at the same time. 
> Desired behavior: immediately start calculation of all partitions while 
> retaining the download-a-partition at a time behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26326) Cannot save a NaiveBayesModel with 48685 features and 5453 labels

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26326:
--
Priority: Minor  (was: Major)

Yeah, this means you have a model with about 265M parameters, and when 
serialized as an array of bytes, is (barely) exceeding 2GB. I think 
reimplementing this under the hood is possible but it may call into question 
whether this is a realistic use case for naive bayes?

> Cannot save a NaiveBayesModel with 48685 features and 5453 labels
> -
>
> Key: SPARK-26326
> URL: https://issues.apache.org/jira/browse/SPARK-26326
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Markus Paaso
>Priority: Minor
>
> When executing
> {code:java}
> model.write().overwrite().save("/tmp/mymodel"){code}
> The error occurs
> {code:java}
> java.lang.UnsupportedOperationException: Cannot convert this array to unsafe 
> format as it's too big.
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.fromPrimitiveArray(UnsafeArrayData.java:457)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.fromPrimitiveArray(UnsafeArrayData.java:524)
> at org.apache.spark.ml.linalg.MatrixUDT.serialize(MatrixUDT.scala:66)
> at org.apache.spark.ml.linalg.MatrixUDT.serialize(MatrixUDT.scala:28)
> at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:143)
> at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
> at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:258)
> at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238)
> at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
> at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$.$anonfun$createToCatalystConverter$2(CatalystTypeConverters.scala:396)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LocalRelation$.$anonfun$fromProduct$1(LocalRelation.scala:43)
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:233)
> at scala.collection.immutable.List.foreach(List.scala:388)
> at scala.collection.TraversableLike.map(TraversableLike.scala:233)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:226)
> at scala.collection.immutable.List.map(List.scala:294)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LocalRelation$.fromProduct(LocalRelation.scala:43)
> at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:315)
> at 
> org.apache.spark.ml.classification.NaiveBayesModel$NaiveBayesModelWriter.saveImpl(NaiveBayes.scala:393)
> at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:180)
> {code}
> Data file to reproduce the problem: 
> [https://github.com/make/spark-26326-files/raw/master/data.libsvm]
> Code to reproduce the problem:
> {code:java}
> import org.apache.spark.ml.classification.NaiveBayes
> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
> // Load the data stored in LIBSVM format as a DataFrame.
> val data = spark.read.format("libsvm").load("/tmp/data.libsvm")
> // Train a NaiveBayes model.
> val model = new NaiveBayes().fit(data)
> model.write().overwrite().save("/tmp/mymodel"){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26357) Expose executors' procfs metrics to Metrics system

2019-03-01 Thread Reza Safi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782046#comment-16782046
 ] 

Reza Safi commented on SPARK-26357:
---

I think the primary usecase is for those users who are depending on codahale / 
dropwizard metrics.

> Expose executors' procfs metrics to Metrics system
> --
>
> Key: SPARK-26357
> URL: https://issues.apache.org/jira/browse/SPARK-26357
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Reza Safi
>Priority: Major
>
> It will be good to make the prosfs metrics from SPARK-24958 visible to 
> Metrics system. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26373) Spark UI 'environment' tab - column to indicate default vs overridden values

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26373:
--
Priority: Minor  (was: Major)

> Spark UI 'environment' tab - column to indicate default vs overridden values
> 
>
> Key: SPARK-26373
> URL: https://issues.apache.org/jira/browse/SPARK-26373
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: t oo
>Priority: Minor
>
> Rather than just showing name and value for each property, a new column would 
> also show whether the value is default (show 'AS PER DEFAULT') or if its 
> overridden (show the actual default value).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26347) MergeAggregate serialize and deserialize funcition can use ByteBuffer to opimize

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26347.
---
Resolution: Won't Fix

> MergeAggregate serialize and deserialize funcition can use ByteBuffer to 
> opimize
> 
>
> Key: SPARK-26347
> URL: https://issues.apache.org/jira/browse/SPARK-26347
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.4.0
>Reporter: lichaoqun
>Priority: Minor
>
> MergeAggregate serialize and deserialize funcition can use ByteBuffer to 
> opimize



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26357) Expose executors' procfs metrics to Metrics system

2019-03-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782039#comment-16782039
 ] 

Sean Owen commented on SPARK-26357:
---

What is the use case for this?

> Expose executors' procfs metrics to Metrics system
> --
>
> Key: SPARK-26357
> URL: https://issues.apache.org/jira/browse/SPARK-26357
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Reza Safi
>Priority: Major
>
> It will be good to make the prosfs metrics from SPARK-24958 visible to 
> Metrics system. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26358) Spark deployed mode question

2019-03-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26358.
---
Resolution: Invalid

There isn't enough detail here. It can be reopened if you can, ideally, provide 
an example.

> Spark deployed mode question
> 
>
> Key: SPARK-26358
> URL: https://issues.apache.org/jira/browse/SPARK-26358
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.0
> Environment: spark2.3.0
> hadoop2.7.3
>Reporter: Si Chen
>Priority: Major
> Attachments: sparkbug.jpg
>
>
> When submit my job with yarn-client mode.When I didn`t visit application web 
> UI if executor have exception the application will exit. But I had visit 
> application web UI if executor have exception the application will not exit!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26997) k8s integration tests failing after client upgraded to 4.1.2

2019-03-01 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26997.

  Resolution: Fixed
Target Version/s: 3.0.0

I reverted the client upgrade and re-opened the original bug, so let's keep the 
discussion there.

> k8s integration tests failing after client upgraded to 4.1.2
> 
>
> Key: SPARK-26997
> URL: https://issues.apache.org/jira/browse/SPARK-26997
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> SPARK-26742 upgraded the client libs to version 4.1.2, and that doesn't seem 
> to agree well with the minikube we're using in jenkins. My PRs are failing 
> (minikube 0.25):
> {noformat}
> 19/02/25 17:46:52.599 ScalaTest-main-running-KubernetesSuite INFO 
> ProcessUtils: 19/02/25 17:46:52 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-3007689c-e3ca-48f5-a673-f3bad5c4774a
> 19/02/25 17:46:52.788 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:500. Message:container not found 
> ("spark-kubernetes-driver")
> java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal 
> Server Error'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 19/02/25 17:46:52.999 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:404. Message:404 page not found
> java.net.ProtocolException: Expected HTTP 101 response but was '404 Not Found'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Tests pass on my local minikube (0.34). Reverting that change makes them pass 
> on jenkins (see https://github.com/apache/spark/pull/23893).
> Not sure if this is a client bug or a compatibility issue.
> [~shaneknapp] [~skonto]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26048) Flume connector for Spark 2.4 does not exist in Maven repository

2019-03-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26048:


Assignee: (was: Apache Spark)

> Flume connector for Spark 2.4 does not exist in Maven repository
> 
>
> Key: SPARK-26048
> URL: https://issues.apache.org/jira/browse/SPARK-26048
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.4.0
>Reporter: Aki Tanaka
>Priority: Major
>
> Flume connector for Spark 2.4 does not exist in the Maven repository.
> [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume]
>  
> [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume-sink]
> These packages will be removed in Spark 3. But Spark 2.4 branch still has 
> these packages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26048) Flume connector for Spark 2.4 does not exist in Maven repository

2019-03-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26048:


Assignee: Apache Spark

> Flume connector for Spark 2.4 does not exist in Maven repository
> 
>
> Key: SPARK-26048
> URL: https://issues.apache.org/jira/browse/SPARK-26048
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.4.0
>Reporter: Aki Tanaka
>Assignee: Apache Spark
>Priority: Major
>
> Flume connector for Spark 2.4 does not exist in the Maven repository.
> [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume]
>  
> [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume-sink]
> These packages will be removed in Spark 3. But Spark 2.4 branch still has 
> these packages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >