[jira] [Resolved] (SPARK-15685) StackOverflowError (VirtualMachineError) or NoClassDefFoundError (LinkageError) should not System.exit() in local mode

2017-11-11 Thread Brett Randall (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brett Randall resolved SPARK-15685.
---
   Resolution: Fixed
Fix Version/s: 2.0.2

> StackOverflowError (VirtualMachineError) or NoClassDefFoundError 
> (LinkageError) should not System.exit() in local mode
> --
>
> Key: SPARK-15685
> URL: https://issues.apache.org/jira/browse/SPARK-15685
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Brett Randall
> Fix For: 2.0.2
>
>
> Spark, when running in local mode, can encounter certain types of {{Error}} 
> exceptions in developer-code or third-party libraries and call 
> {{System.exit()}}, potentially killing a long-running JVM/service.  The 
> caller should decide on the exception-handling and whether the error should 
> be deemed fatal.
> *Consider this scenario:*
> * Spark is being used in local master mode within a long-running JVM 
> microservice, e.g. a Jetty instance.
> * A task is run.  The task errors with particular types of unchecked 
> throwables:
> ** a) there some bad code and/or bad data that exposes a bug where there's 
> unterminated recursion, leading to a {{StackOverflowError}}, or
> ** b) a particular not-often used function is called - there's a packaging 
> error with the service, a third-party library is missing some dependencies, a 
> {{NoClassDefFoundError}} is found.
> *Expected behaviour:* Since we are running in local mode, we might expect 
> some unchecked exception to be thrown, to be optionally-handled by the Spark 
> caller.  In the case of Jetty, a request thread or some other background 
> worker thread might handle the exception or not, the thread might exit or 
> note an error.  The caller should decide how the error is handled.
> *Actual behaviour:* {{System.exit()}} is called, the JVM exits and the Jetty 
> microservice is down and must be restarted.
> *Consequence:* Any local code or third-party library might cause Spark to 
> exit the long-running JVM/microservice, so Spark can be a problem in this 
> architecture.  I have seen this now on three separate occasions, leading to 
> service-down bug reports.
> *Analysis:*
> The line of code that seems to be the problem is: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L405
> {code}
> // Don't forcibly exit unless the exception was inherently fatal, to avoid
> // stopping other tasks unnecessarily.
> if (Utils.isFatalError(t)) {
> SparkUncaughtExceptionHandler.uncaughtException(t)
> }
> {code}
> [Utils.isFatalError()|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1818]
>  first excludes Scala 
> [NonFatal|https://github.com/scala/scala/blob/2.12.x/src/library/scala/util/control/NonFatal.scala#L31],
>  which excludes everything except {{VirtualMachineError}}, {{ThreadDeath}}, 
> {{InterruptedException}}, {{LinkageError}} and {{ControlThrowable}}.  
> {{Utils.isFatalError()}} further excludes {{InterruptedException}}, 
> {{NotImplementedError}} and {{ControlThrowable}}.
> Remaining are {{Error}} s such as {{StackOverflowError extends 
> VirtualMachineError}} or {{NoClassDefFoundError extends LinkageError}}, which 
> occur in the aforementioned scenarios.  
> {{SparkUncaughtExceptionHandler.uncaughtException()}} proceeds to call 
> {{System.exit()}}.
> [Further 
> up|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L77]
>  in in {{Executor}} we see exclusions for registering 
> {{SparkUncaughtExceptionHandler}} if in local mode:
> {code}
>   if (!isLocal) {
> // Setup an uncaught exception handler for non-local mode.
> // Make any thread terminations due to uncaught exceptions kill the entire
> // executor process to avoid surprising stalls.
> Thread.setDefaultUncaughtExceptionHandler(SparkUncaughtExceptionHandler)
>   }
> {code}
> This same exclusion must be applied for local mode for "fatal" errors - 
> cannot afford to shutdown the enclosing JVM (e.g. Jetty), the caller should 
> decide.
> A minimal test-case is supplied.  It installs a logging {{SecurityManager}} 
> to confirm that {{System.exit()}} was called from 
> {{SparkUncaughtExceptionHandler.uncaughtException}} via {{Executor}}.  It 
> also hints at the workaround - install your own {{SecurityManager}} and 
> inspect the current stack in {{checkExit()}} to prevent Spark from exiting 
> the JVM.
> Test-case: https://github.com/javabrett/SPARK-15685 .



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-15685) StackOverflowError (VirtualMachineError) or NoClassDefFoundError (LinkageError) should not System.exit() in local mode

2017-11-11 Thread Brett Randall (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248776#comment-16248776
 ] 

Brett Randall commented on SPARK-15685:
---

[~srowen] this seems to be fixed in 2.0.2, but I don't know which commit is 
responsible.

> StackOverflowError (VirtualMachineError) or NoClassDefFoundError 
> (LinkageError) should not System.exit() in local mode
> --
>
> Key: SPARK-15685
> URL: https://issues.apache.org/jira/browse/SPARK-15685
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Brett Randall
>
> Spark, when running in local mode, can encounter certain types of {{Error}} 
> exceptions in developer-code or third-party libraries and call 
> {{System.exit()}}, potentially killing a long-running JVM/service.  The 
> caller should decide on the exception-handling and whether the error should 
> be deemed fatal.
> *Consider this scenario:*
> * Spark is being used in local master mode within a long-running JVM 
> microservice, e.g. a Jetty instance.
> * A task is run.  The task errors with particular types of unchecked 
> throwables:
> ** a) there some bad code and/or bad data that exposes a bug where there's 
> unterminated recursion, leading to a {{StackOverflowError}}, or
> ** b) a particular not-often used function is called - there's a packaging 
> error with the service, a third-party library is missing some dependencies, a 
> {{NoClassDefFoundError}} is found.
> *Expected behaviour:* Since we are running in local mode, we might expect 
> some unchecked exception to be thrown, to be optionally-handled by the Spark 
> caller.  In the case of Jetty, a request thread or some other background 
> worker thread might handle the exception or not, the thread might exit or 
> note an error.  The caller should decide how the error is handled.
> *Actual behaviour:* {{System.exit()}} is called, the JVM exits and the Jetty 
> microservice is down and must be restarted.
> *Consequence:* Any local code or third-party library might cause Spark to 
> exit the long-running JVM/microservice, so Spark can be a problem in this 
> architecture.  I have seen this now on three separate occasions, leading to 
> service-down bug reports.
> *Analysis:*
> The line of code that seems to be the problem is: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L405
> {code}
> // Don't forcibly exit unless the exception was inherently fatal, to avoid
> // stopping other tasks unnecessarily.
> if (Utils.isFatalError(t)) {
> SparkUncaughtExceptionHandler.uncaughtException(t)
> }
> {code}
> [Utils.isFatalError()|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1818]
>  first excludes Scala 
> [NonFatal|https://github.com/scala/scala/blob/2.12.x/src/library/scala/util/control/NonFatal.scala#L31],
>  which excludes everything except {{VirtualMachineError}}, {{ThreadDeath}}, 
> {{InterruptedException}}, {{LinkageError}} and {{ControlThrowable}}.  
> {{Utils.isFatalError()}} further excludes {{InterruptedException}}, 
> {{NotImplementedError}} and {{ControlThrowable}}.
> Remaining are {{Error}} s such as {{StackOverflowError extends 
> VirtualMachineError}} or {{NoClassDefFoundError extends LinkageError}}, which 
> occur in the aforementioned scenarios.  
> {{SparkUncaughtExceptionHandler.uncaughtException()}} proceeds to call 
> {{System.exit()}}.
> [Further 
> up|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L77]
>  in in {{Executor}} we see exclusions for registering 
> {{SparkUncaughtExceptionHandler}} if in local mode:
> {code}
>   if (!isLocal) {
> // Setup an uncaught exception handler for non-local mode.
> // Make any thread terminations due to uncaught exceptions kill the entire
> // executor process to avoid surprising stalls.
> Thread.setDefaultUncaughtExceptionHandler(SparkUncaughtExceptionHandler)
>   }
> {code}
> This same exclusion must be applied for local mode for "fatal" errors - 
> cannot afford to shutdown the enclosing JVM (e.g. Jetty), the caller should 
> decide.
> A minimal test-case is supplied.  It installs a logging {{SecurityManager}} 
> to confirm that {{System.exit()}} was called from 
> {{SparkUncaughtExceptionHandler.uncaughtException}} via {{Executor}}.  It 
> also hints at the workaround - install your own {{SecurityManager}} and 
> inspect the current stack in {{checkExit()}} to prevent Spark from exiting 
> the JVM.
> Test-case: https://github.com/javabrett/SPARK-15685 .



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (SPARK-22042) ReorderJoinPredicates can break when child's partitioning is not decided

2017-11-11 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248745#comment-16248745
 ] 

Tejas Patil commented on SPARK-22042:
-

Am trying out the suggestion discussed in 
https://github.com/apache/spark/pull/19257#issuecomment-331069250 
Here is an unpolished but working PR : 
https://github.com/apache/spark/pull/19725 (will polish it after observing the 
test case results).

> ReorderJoinPredicates can break when child's partitioning is not decided
> 
>
> Key: SPARK-22042
> URL: https://issues.apache.org/jira/browse/SPARK-22042
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Tejas Patil
>
> When `ReorderJoinPredicates` tries to get the `outputPartitioning` of its 
> children, the children may not be properly constructed as the child-subtree 
> has to still go through other planner rules.
> In this particular case, the child is `SortMergeJoinExec`. Since the required 
> `Exchange` operators are not in place (because `EnsureRequirements` runs 
> _after_ `ReorderJoinPredicates`), the join's children would not have 
> partitioning defined. This breaks while creation the `PartitioningCollection` 
> here : 
> https://github.com/apache/spark/blob/94439997d57875838a8283c543f9b44705d3a503/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L69
> Small repro:
> {noformat}
> context.sql("SET spark.sql.autoBroadcastJoinThreshold=0")
> val df = (0 until 50).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", 
> "k")
> df.write.format("parquet").saveAsTable("table1")
> df.write.format("parquet").saveAsTable("table2")
> df.write.format("parquet").bucketBy(8, "j", "k").saveAsTable("bucketed_table")
> sql("""
>   SELECT *
>   FROM (
> SELECT a.i, a.j, a.k
> FROM bucketed_table a
> JOIN table1 b
> ON a.i = b.i
>   ) c
>   JOIN table2
>   ON c.i = table2.i
> """).explain
> {noformat}
> This fails with :
> {noformat}
> java.lang.IllegalArgumentException: requirement failed: 
> PartitioningCollection requires all of its partitionings have the same 
> numPartitions.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.physical.PartitioningCollection.(partitioning.scala:324)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputPartitioning(SortMergeJoinExec.scala:69)
>   at 
> org.apache.spark.sql.execution.ProjectExec.outputPartitioning(basicPhysicalOperators.scala:82)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:91)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:76)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:76)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:34)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:90)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:90)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.command.ExplainCommand.run(commands.scala:147)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:78)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:75)
>   at 
> 

[jira] [Commented] (SPARK-22042) ReorderJoinPredicates can break when child's partitioning is not decided

2017-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248743#comment-16248743
 ] 

Apache Spark commented on SPARK-22042:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/19725

> ReorderJoinPredicates can break when child's partitioning is not decided
> 
>
> Key: SPARK-22042
> URL: https://issues.apache.org/jira/browse/SPARK-22042
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Tejas Patil
>
> When `ReorderJoinPredicates` tries to get the `outputPartitioning` of its 
> children, the children may not be properly constructed as the child-subtree 
> has to still go through other planner rules.
> In this particular case, the child is `SortMergeJoinExec`. Since the required 
> `Exchange` operators are not in place (because `EnsureRequirements` runs 
> _after_ `ReorderJoinPredicates`), the join's children would not have 
> partitioning defined. This breaks while creation the `PartitioningCollection` 
> here : 
> https://github.com/apache/spark/blob/94439997d57875838a8283c543f9b44705d3a503/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L69
> Small repro:
> {noformat}
> context.sql("SET spark.sql.autoBroadcastJoinThreshold=0")
> val df = (0 until 50).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", 
> "k")
> df.write.format("parquet").saveAsTable("table1")
> df.write.format("parquet").saveAsTable("table2")
> df.write.format("parquet").bucketBy(8, "j", "k").saveAsTable("bucketed_table")
> sql("""
>   SELECT *
>   FROM (
> SELECT a.i, a.j, a.k
> FROM bucketed_table a
> JOIN table1 b
> ON a.i = b.i
>   ) c
>   JOIN table2
>   ON c.i = table2.i
> """).explain
> {noformat}
> This fails with :
> {noformat}
> java.lang.IllegalArgumentException: requirement failed: 
> PartitioningCollection requires all of its partitionings have the same 
> numPartitions.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.physical.PartitioningCollection.(partitioning.scala:324)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputPartitioning(SortMergeJoinExec.scala:69)
>   at 
> org.apache.spark.sql.execution.ProjectExec.outputPartitioning(basicPhysicalOperators.scala:82)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:91)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:76)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:76)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:34)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:90)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:90)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.command.ExplainCommand.run(commands.scala:147)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:78)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:91)
>   at 

[jira] [Resolved] (SPARK-8824) Support Parquet time related logical types

2017-11-11 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-8824.

Resolution: Fixed

> Support Parquet time related logical types
> --
>
> Key: SPARK-8824
> URL: https://issues.apache.org/jira/browse/SPARK-8824
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10365) Support Parquet logical type TIMESTAMP_MICROS

2017-11-11 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-10365.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19702
[https://github.com/apache/spark/pull/19702]

> Support Parquet logical type TIMESTAMP_MICROS
> -
>
> Key: SPARK-10365
> URL: https://issues.apache.org/jira/browse/SPARK-10365
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>
> Didn't assign target version for this ticket because neither the most recent 
> parquet-mr release (1.8.1) nor the master branch has supported 
> {{TIMESTAMP_MICROS}} yet.
> It would be nice to map Spark SQL {{TimestampType}} to {{TIMESTAMP_MICROS}} 
> since Parquet {{INT96}} will probably be deprecated and is only used for 
> compatibility reason for now.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10365) Support Parquet logical type TIMESTAMP_MICROS

2017-11-11 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-10365:
---

Assignee: Wenchen Fan

> Support Parquet logical type TIMESTAMP_MICROS
> -
>
> Key: SPARK-10365
> URL: https://issues.apache.org/jira/browse/SPARK-10365
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>
> Didn't assign target version for this ticket because neither the most recent 
> parquet-mr release (1.8.1) nor the master branch has supported 
> {{TIMESTAMP_MICROS}} yet.
> It would be nice to map Spark SQL {{TimestampType}} to {{TIMESTAMP_MICROS}} 
> since Parquet {{INT96}} will probably be deprecated and is only used for 
> compatibility reason for now.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15685) StackOverflowError (VirtualMachineError) or NoClassDefFoundError (LinkageError) should not System.exit() in local mode

2017-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15685:
--
 Priority: Major  (was: Critical)
Fix Version/s: (was: 2.0.2)

> StackOverflowError (VirtualMachineError) or NoClassDefFoundError 
> (LinkageError) should not System.exit() in local mode
> --
>
> Key: SPARK-15685
> URL: https://issues.apache.org/jira/browse/SPARK-15685
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Brett Randall
>
> Spark, when running in local mode, can encounter certain types of {{Error}} 
> exceptions in developer-code or third-party libraries and call 
> {{System.exit()}}, potentially killing a long-running JVM/service.  The 
> caller should decide on the exception-handling and whether the error should 
> be deemed fatal.
> *Consider this scenario:*
> * Spark is being used in local master mode within a long-running JVM 
> microservice, e.g. a Jetty instance.
> * A task is run.  The task errors with particular types of unchecked 
> throwables:
> ** a) there some bad code and/or bad data that exposes a bug where there's 
> unterminated recursion, leading to a {{StackOverflowError}}, or
> ** b) a particular not-often used function is called - there's a packaging 
> error with the service, a third-party library is missing some dependencies, a 
> {{NoClassDefFoundError}} is found.
> *Expected behaviour:* Since we are running in local mode, we might expect 
> some unchecked exception to be thrown, to be optionally-handled by the Spark 
> caller.  In the case of Jetty, a request thread or some other background 
> worker thread might handle the exception or not, the thread might exit or 
> note an error.  The caller should decide how the error is handled.
> *Actual behaviour:* {{System.exit()}} is called, the JVM exits and the Jetty 
> microservice is down and must be restarted.
> *Consequence:* Any local code or third-party library might cause Spark to 
> exit the long-running JVM/microservice, so Spark can be a problem in this 
> architecture.  I have seen this now on three separate occasions, leading to 
> service-down bug reports.
> *Analysis:*
> The line of code that seems to be the problem is: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L405
> {code}
> // Don't forcibly exit unless the exception was inherently fatal, to avoid
> // stopping other tasks unnecessarily.
> if (Utils.isFatalError(t)) {
> SparkUncaughtExceptionHandler.uncaughtException(t)
> }
> {code}
> [Utils.isFatalError()|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1818]
>  first excludes Scala 
> [NonFatal|https://github.com/scala/scala/blob/2.12.x/src/library/scala/util/control/NonFatal.scala#L31],
>  which excludes everything except {{VirtualMachineError}}, {{ThreadDeath}}, 
> {{InterruptedException}}, {{LinkageError}} and {{ControlThrowable}}.  
> {{Utils.isFatalError()}} further excludes {{InterruptedException}}, 
> {{NotImplementedError}} and {{ControlThrowable}}.
> Remaining are {{Error}} s such as {{StackOverflowError extends 
> VirtualMachineError}} or {{NoClassDefFoundError extends LinkageError}}, which 
> occur in the aforementioned scenarios.  
> {{SparkUncaughtExceptionHandler.uncaughtException()}} proceeds to call 
> {{System.exit()}}.
> [Further 
> up|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L77]
>  in in {{Executor}} we see exclusions for registering 
> {{SparkUncaughtExceptionHandler}} if in local mode:
> {code}
>   if (!isLocal) {
> // Setup an uncaught exception handler for non-local mode.
> // Make any thread terminations due to uncaught exceptions kill the entire
> // executor process to avoid surprising stalls.
> Thread.setDefaultUncaughtExceptionHandler(SparkUncaughtExceptionHandler)
>   }
> {code}
> This same exclusion must be applied for local mode for "fatal" errors - 
> cannot afford to shutdown the enclosing JVM (e.g. Jetty), the caller should 
> decide.
> A minimal test-case is supplied.  It installs a logging {{SecurityManager}} 
> to confirm that {{System.exit()}} was called from 
> {{SparkUncaughtExceptionHandler.uncaughtException}} via {{Executor}}.  It 
> also hints at the workaround - install your own {{SecurityManager}} and 
> inspect the current stack in {{checkExit()}} to prevent Spark from exiting 
> the JVM.
> Test-case: https://github.com/javabrett/SPARK-15685 .



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: 

[jira] [Updated] (SPARK-15685) StackOverflowError (VirtualMachineError) or NoClassDefFoundError (LinkageError) should not System.exit() in local mode

2017-11-11 Thread Brett Randall (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brett Randall updated SPARK-15685:
--
Fix Version/s: 2.0.2

> StackOverflowError (VirtualMachineError) or NoClassDefFoundError 
> (LinkageError) should not System.exit() in local mode
> --
>
> Key: SPARK-15685
> URL: https://issues.apache.org/jira/browse/SPARK-15685
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Brett Randall
>Priority: Critical
> Fix For: 2.0.2
>
>
> Spark, when running in local mode, can encounter certain types of {{Error}} 
> exceptions in developer-code or third-party libraries and call 
> {{System.exit()}}, potentially killing a long-running JVM/service.  The 
> caller should decide on the exception-handling and whether the error should 
> be deemed fatal.
> *Consider this scenario:*
> * Spark is being used in local master mode within a long-running JVM 
> microservice, e.g. a Jetty instance.
> * A task is run.  The task errors with particular types of unchecked 
> throwables:
> ** a) there some bad code and/or bad data that exposes a bug where there's 
> unterminated recursion, leading to a {{StackOverflowError}}, or
> ** b) a particular not-often used function is called - there's a packaging 
> error with the service, a third-party library is missing some dependencies, a 
> {{NoClassDefFoundError}} is found.
> *Expected behaviour:* Since we are running in local mode, we might expect 
> some unchecked exception to be thrown, to be optionally-handled by the Spark 
> caller.  In the case of Jetty, a request thread or some other background 
> worker thread might handle the exception or not, the thread might exit or 
> note an error.  The caller should decide how the error is handled.
> *Actual behaviour:* {{System.exit()}} is called, the JVM exits and the Jetty 
> microservice is down and must be restarted.
> *Consequence:* Any local code or third-party library might cause Spark to 
> exit the long-running JVM/microservice, so Spark can be a problem in this 
> architecture.  I have seen this now on three separate occasions, leading to 
> service-down bug reports.
> *Analysis:*
> The line of code that seems to be the problem is: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L405
> {code}
> // Don't forcibly exit unless the exception was inherently fatal, to avoid
> // stopping other tasks unnecessarily.
> if (Utils.isFatalError(t)) {
> SparkUncaughtExceptionHandler.uncaughtException(t)
> }
> {code}
> [Utils.isFatalError()|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1818]
>  first excludes Scala 
> [NonFatal|https://github.com/scala/scala/blob/2.12.x/src/library/scala/util/control/NonFatal.scala#L31],
>  which excludes everything except {{VirtualMachineError}}, {{ThreadDeath}}, 
> {{InterruptedException}}, {{LinkageError}} and {{ControlThrowable}}.  
> {{Utils.isFatalError()}} further excludes {{InterruptedException}}, 
> {{NotImplementedError}} and {{ControlThrowable}}.
> Remaining are {{Error}} s such as {{StackOverflowError extends 
> VirtualMachineError}} or {{NoClassDefFoundError extends LinkageError}}, which 
> occur in the aforementioned scenarios.  
> {{SparkUncaughtExceptionHandler.uncaughtException()}} proceeds to call 
> {{System.exit()}}.
> [Further 
> up|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L77]
>  in in {{Executor}} we see exclusions for registering 
> {{SparkUncaughtExceptionHandler}} if in local mode:
> {code}
>   if (!isLocal) {
> // Setup an uncaught exception handler for non-local mode.
> // Make any thread terminations due to uncaught exceptions kill the entire
> // executor process to avoid surprising stalls.
> Thread.setDefaultUncaughtExceptionHandler(SparkUncaughtExceptionHandler)
>   }
> {code}
> This same exclusion must be applied for local mode for "fatal" errors - 
> cannot afford to shutdown the enclosing JVM (e.g. Jetty), the caller should 
> decide.
> A minimal test-case is supplied.  It installs a logging {{SecurityManager}} 
> to confirm that {{System.exit()}} was called from 
> {{SparkUncaughtExceptionHandler.uncaughtException}} via {{Executor}}.  It 
> also hints at the workaround - install your own {{SecurityManager}} and 
> inspect the current stack in {{checkExit()}} to prevent Spark from exiting 
> the JVM.
> Test-case: https://github.com/javabrett/SPARK-15685 .



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-22464) <=> is not supported by Hive metastore partition predicate pushdown

2017-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248635#comment-16248635
 ] 

Apache Spark commented on SPARK-22464:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/19724

> <=> is not supported by Hive metastore partition predicate pushdown
> ---
>
> Key: SPARK-22464
> URL: https://issues.apache.org/jira/browse/SPARK-22464
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> <=> is not supported by Hive metastore partition predicate pushdown. We 
> should forbid it. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-11-11 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248628#comment-16248628
 ] 

Timothy Hunter commented on SPARK-21866:


[~josephkb] if I am not mistaken, the image code is implemented in the 
{{mllib}} package, which depends on {{sql}}. Meanwhile, the data source API is 
implemented in {{sql}}, and if we want it to have some image-specific source, 
like we do for csv or json, we will need to depend on {{mllib}}. This 
dependency should not happen, first because it introduces a circular dependency 
(causing compile time issues), and second because sql (one of the core modules) 
should not depend on {{mllib}}, which is large and not related to SQL.

[~rxin] suggested that we add a runtime dependency using reflection instead, 
and I am keen on making that change a second pull request. What are your 
thoughts?

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * 

[jira] [Commented] (SPARK-22488) The view resolution in the SparkSession internal table() API

2017-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248618#comment-16248618
 ] 

Apache Spark commented on SPARK-22488:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/19723

> The view resolution in the SparkSession internal table() API 
> -
>
> Key: SPARK-22488
> URL: https://issues.apache.org/jira/browse/SPARK-22488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.2, 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> The current internal `table()` API of `SparkSession` bypasses the Analyzer 
> and directly calls `sessionState.catalog.lookupRelation` API. This skips the 
> view resolution logics in our Analyzer rule `ResolveRelations`. This internal 
> API is widely used by various DDL commands or the other internal APIs.
> Users might get the strange error caused by view resolution when the default 
> database is different.
> ```
> Table or view not found: t1; line 1 pos 14
> org.apache.spark.sql.AnalysisException: Table or view not found: t1; line 1 
> pos 14
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22488) The view resolution in the SparkSession internal table() API

2017-11-11 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22488.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> The view resolution in the SparkSession internal table() API 
> -
>
> Key: SPARK-22488
> URL: https://issues.apache.org/jira/browse/SPARK-22488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.2, 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> The current internal `table()` API of `SparkSession` bypasses the Analyzer 
> and directly calls `sessionState.catalog.lookupRelation` API. This skips the 
> view resolution logics in our Analyzer rule `ResolveRelations`. This internal 
> API is widely used by various DDL commands or the other internal APIs.
> Users might get the strange error caused by view resolution when the default 
> database is different.
> ```
> Table or view not found: t1; line 1 pos 14
> org.apache.spark.sql.AnalysisException: Table or view not found: t1; line 1 
> pos 14
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22496) beeline display operation log

2017-11-11 Thread StephenZou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StephenZou updated SPARK-22496:
---
Description: 
For now,when end user runs queries in beeline or in hue through STS, 
no logs are displayed, end user will wait until the job finishes or fails. 

Progress information is needed to inform end users how the job is running if 
they are not familiar with yarn RM or standalone spark master ui. 

  was:
For now,when end user runs queries in beeline or in hue through STS, 
no logs are displayed, end user will wait until the job finishes or fails. 

some progress information is needed to inform end users how the job is running 
if they are not familiar with yarn RM or standalone spark master ui. 


> beeline display operation log
> -
>
> Key: SPARK-22496
> URL: https://issues.apache.org/jira/browse/SPARK-22496
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StephenZou
>Priority: Minor
>
> For now,when end user runs queries in beeline or in hue through STS, 
> no logs are displayed, end user will wait until the job finishes or fails. 
> Progress information is needed to inform end users how the job is running if 
> they are not familiar with yarn RM or standalone spark master ui. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests

2017-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21693:


Assignee: (was: Apache Spark)

> AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
> -
>
> Key: SPARK-21693
> URL: https://issues.apache.org/jira/browse/SPARK-21693
> Project: Spark
>  Issue Type: Test
>  Components: Build, SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> We finally sometimes reach the time limit, 1.5 hours, 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
> I requested to increase this from an hour to 1.5 hours before but it looks we 
> should fix this in Spark. I asked this for my account few times before but it 
> looks we can't increase this time limit again and again.
> I could identify two things that look taking a quite a bit of time:
> 1. Disabled cache feature in pull request builder, which ends up downloading 
> Maven dependencies (roughly 10ish mins)
> https://www.appveyor.com/docs/build-cache/
> {quote}
> Note: Saving cache is disabled in Pull Request builds.
> {quote}
> and also see 
> http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working
> This seems difficult to fix within Spark.
> 2. "MLlib classification algorithms" tests (30-35ish mins)
> This test below looks taking 30-35ish mins.
> {code}
> MLlib classification algorithms, except for tree-based algorithms: Spark 
> package found in SPARK_HOME: C:\projects\spark\bin\..
> ..
> {code}
> As a (I think) last resort, we could make a matrix for this test alone, so 
> that we run the other tests after a build and then run this test after 
> another build, for example, I run Scala tests by this workaround - 
> https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix 
> with 7 build and test each).
> I am also checking and testing other ways.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests

2017-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248524#comment-16248524
 ] 

Apache Spark commented on SPARK-21693:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/19722

> AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
> -
>
> Key: SPARK-21693
> URL: https://issues.apache.org/jira/browse/SPARK-21693
> Project: Spark
>  Issue Type: Test
>  Components: Build, SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> We finally sometimes reach the time limit, 1.5 hours, 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
> I requested to increase this from an hour to 1.5 hours before but it looks we 
> should fix this in Spark. I asked this for my account few times before but it 
> looks we can't increase this time limit again and again.
> I could identify two things that look taking a quite a bit of time:
> 1. Disabled cache feature in pull request builder, which ends up downloading 
> Maven dependencies (roughly 10ish mins)
> https://www.appveyor.com/docs/build-cache/
> {quote}
> Note: Saving cache is disabled in Pull Request builds.
> {quote}
> and also see 
> http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working
> This seems difficult to fix within Spark.
> 2. "MLlib classification algorithms" tests (30-35ish mins)
> This test below looks taking 30-35ish mins.
> {code}
> MLlib classification algorithms, except for tree-based algorithms: Spark 
> package found in SPARK_HOME: C:\projects\spark\bin\..
> ..
> {code}
> As a (I think) last resort, we could make a matrix for this test alone, so 
> that we run the other tests after a build and then run this test after 
> another build, for example, I run Scala tests by this workaround - 
> https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix 
> with 7 build and test each).
> I am also checking and testing other ways.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests

2017-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21693:


Assignee: Apache Spark

> AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
> -
>
> Key: SPARK-21693
> URL: https://issues.apache.org/jira/browse/SPARK-21693
> Project: Spark
>  Issue Type: Test
>  Components: Build, SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> We finally sometimes reach the time limit, 1.5 hours, 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
> I requested to increase this from an hour to 1.5 hours before but it looks we 
> should fix this in Spark. I asked this for my account few times before but it 
> looks we can't increase this time limit again and again.
> I could identify two things that look taking a quite a bit of time:
> 1. Disabled cache feature in pull request builder, which ends up downloading 
> Maven dependencies (roughly 10ish mins)
> https://www.appveyor.com/docs/build-cache/
> {quote}
> Note: Saving cache is disabled in Pull Request builds.
> {quote}
> and also see 
> http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working
> This seems difficult to fix within Spark.
> 2. "MLlib classification algorithms" tests (30-35ish mins)
> This test below looks taking 30-35ish mins.
> {code}
> MLlib classification algorithms, except for tree-based algorithms: Spark 
> package found in SPARK_HOME: C:\projects\spark\bin\..
> ..
> {code}
> As a (I think) last resort, we could make a matrix for this test alone, so 
> that we run the other tests after a build and then run this test after 
> another build, for example, I run Scala tests by this workaround - 
> https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix 
> with 7 build and test each).
> I am also checking and testing other ways.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22462) SQL metrics missing after foreach operation on dataframe

2017-11-11 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22462:
---

Assignee: Liang-Chi Hsieh

> SQL metrics missing after foreach operation on dataframe
> 
>
> Key: SPARK-22462
> URL: https://issues.apache.org/jira/browse/SPARK-22462
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
> Attachments: collect.png, foreach.png
>
>
> No SQL metrics are visible in the SQL tab of SparkUI when foreach is executed 
> on the DataFrame.
> e.g.
> {code}
> sql("select * from range(10)").collect()
> sql("select * from range(10)").foreach(a => Unit)
> sql("select * from range(10)").foreach(a => println(a))
> {code}
> See collect.png vs. foreach.png



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22462) SQL metrics missing after foreach operation on dataframe

2017-11-11 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22462.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19689
[https://github.com/apache/spark/pull/19689]

> SQL metrics missing after foreach operation on dataframe
> 
>
> Key: SPARK-22462
> URL: https://issues.apache.org/jira/browse/SPARK-22462
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
> Fix For: 2.3.0
>
> Attachments: collect.png, foreach.png
>
>
> No SQL metrics are visible in the SQL tab of SparkUI when foreach is executed 
> on the DataFrame.
> e.g.
> {code}
> sql("select * from range(10)").collect()
> sql("select * from range(10)").foreach(a => Unit)
> sql("select * from range(10)").foreach(a => println(a))
> {code}
> See collect.png vs. foreach.png



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests

2017-11-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248470#comment-16248470
 ] 

Hyukjin Kwon commented on SPARK-21693:
--

For caching, yup, so, looks that's why the test failures are less frequent in 
the master branch.
Not sure for other parts. I was first thinking of this build time IIRC but 
failed to come up a good idea to speed up.
Will investigate the single(?) test that takes 20ish(?) mins (IIRC) for now and 
share the results if I can't make the PR to fix it.


> AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
> -
>
> Key: SPARK-21693
> URL: https://issues.apache.org/jira/browse/SPARK-21693
> Project: Spark
>  Issue Type: Test
>  Components: Build, SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> We finally sometimes reach the time limit, 1.5 hours, 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
> I requested to increase this from an hour to 1.5 hours before but it looks we 
> should fix this in Spark. I asked this for my account few times before but it 
> looks we can't increase this time limit again and again.
> I could identify two things that look taking a quite a bit of time:
> 1. Disabled cache feature in pull request builder, which ends up downloading 
> Maven dependencies (roughly 10ish mins)
> https://www.appveyor.com/docs/build-cache/
> {quote}
> Note: Saving cache is disabled in Pull Request builds.
> {quote}
> and also see 
> http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working
> This seems difficult to fix within Spark.
> 2. "MLlib classification algorithms" tests (30-35ish mins)
> This test below looks taking 30-35ish mins.
> {code}
> MLlib classification algorithms, except for tree-based algorithms: Spark 
> package found in SPARK_HOME: C:\projects\spark\bin\..
> ..
> {code}
> As a (I think) last resort, we could make a matrix for this test alone, so 
> that we run the other tests after a build and then run this test after 
> another build, for example, I run Scala tests by this workaround - 
> https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix 
> with 7 build and test each).
> I am also checking and testing other ways.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22433) Linear regression R^2 train/test terminology related

2017-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22433.
---
Resolution: Won't Fix

I would say it's not essential to change anything given this discussion, but if 
someone feel strongly about changing that test, fine to do some day.

> Linear regression R^2 train/test terminology related 
> -
>
> Key: SPARK-22433
> URL: https://issues.apache.org/jira/browse/SPARK-22433
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Teng Peng
>Priority: Minor
>
> Traditional statistics is traditional statistics. Their goal, framework, and 
> terminologies are not the same as ML. However, in linear regression related 
> components, this distinction is not clear, which is reflected:
> 1. regressionMetric + regressionEvaluator : 
> * R2 shouldn't be there. 
> * A better name "regressionPredictionMetric".
> 2. LinearRegressionSuite: 
> * Shouldn't test R2 and residuals on test data. 
> * There is no train set and test set in this setting.
> 3. Terminology: there is no "linear regression with L1 regularization". 
> Linear regression is linear. Adding a penalty term, then it is no longer 
> linear. Just call it "LASSO", "ElasticNet".
> There are more. I am working on correcting them.
> They are not breaking anything, but it does not make one feel good to see the 
> basic distinction is blurred.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22476) Add new function dayofweek in R

2017-11-11 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-22476:


Assignee: Hyukjin Kwon

> Add new function dayofweek in R
> ---
>
> Key: SPARK-22476
> URL: https://issues.apache.org/jira/browse/SPARK-22476
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.0
>
>
> SQL was added in SPARK-20909.
> Scala and Python for {{dayofweek}} were added in SPARK-22456. 
> Looks we should better add it in R too.
> Please refer both JIRAs for more details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22476) Add new function dayofweek in R

2017-11-11 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22476.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19706
[https://github.com/apache/spark/pull/19706]

> Add new function dayofweek in R
> ---
>
> Key: SPARK-22476
> URL: https://issues.apache.org/jira/browse/SPARK-22476
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.0
>
>
> SQL was added in SPARK-20909.
> Scala and Python for {{dayofweek}} were added in SPARK-22456. 
> Looks we should better add it in R too.
> Please refer both JIRAs for more details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19759) ALSModel.predict on Dataframes : potential optimization by not using blas

2017-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19759.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19685
[https://github.com/apache/spark/pull/19685]

> ALSModel.predict on Dataframes : potential optimization by not using blas 
> --
>
> Key: SPARK-19759
> URL: https://issues.apache.org/jira/browse/SPARK-19759
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Sue Ann Hong
>Priority: Minor
> Fix For: 2.3.0
>
>
> In the DataFrame ALS prediction function, we use blas.sdot which may be 
> slower due to the conversion to Arrays. We can try operating on Seqs or 
> another data structure to see if avoiding the conversion makes the operation 
> faster. Ref: 
> https://github.com/apache/spark/pull/17090/files/707bc6b153a7f899fbf3fe2a5675cacba1f95711#diff-be65dd1d6adc53138156641b610fcada
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19759) ALSModel.predict on Dataframes : potential optimization by not using blas

2017-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-19759:
-

Assignee: Marco Gaido

> ALSModel.predict on Dataframes : potential optimization by not using blas 
> --
>
> Key: SPARK-19759
> URL: https://issues.apache.org/jira/browse/SPARK-19759
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Sue Ann Hong
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.3.0
>
>
> In the DataFrame ALS prediction function, we use blas.sdot which may be 
> slower due to the conversion to Arrays. We can try operating on Seqs or 
> another data structure to see if avoiding the conversion makes the operation 
> faster. Ref: 
> https://github.com/apache/spark/pull/17090/files/707bc6b153a7f899fbf3fe2a5675cacba1f95711#diff-be65dd1d6adc53138156641b610fcada
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22496) beeline display operation log

2017-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22496:


Assignee: Apache Spark

> beeline display operation log
> -
>
> Key: SPARK-22496
> URL: https://issues.apache.org/jira/browse/SPARK-22496
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StephenZou
>Assignee: Apache Spark
>Priority: Minor
>
> For now,when end user runs queries in beeline or in hue through STS, 
> no logs are displayed, end user will wait until the job finishes or fails. 
> some progress information is needed to inform end users how the job is 
> running if they are not familiar with yarn RM or standalone spark master ui. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22496) beeline display operation log

2017-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22496:


Assignee: (was: Apache Spark)

> beeline display operation log
> -
>
> Key: SPARK-22496
> URL: https://issues.apache.org/jira/browse/SPARK-22496
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StephenZou
>Priority: Minor
>
> For now,when end user runs queries in beeline or in hue through STS, 
> no logs are displayed, end user will wait until the job finishes or fails. 
> some progress information is needed to inform end users how the job is 
> running if they are not familiar with yarn RM or standalone spark master ui. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22496) beeline display operation log

2017-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248422#comment-16248422
 ] 

Apache Spark commented on SPARK-22496:
--

User 'ChenjunZou' has created a pull request for this issue:
https://github.com/apache/spark/pull/19721

> beeline display operation log
> -
>
> Key: SPARK-22496
> URL: https://issues.apache.org/jira/browse/SPARK-22496
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StephenZou
>Priority: Minor
>
> For now,when end user runs queries in beeline or in hue through STS, 
> no logs are displayed, end user will wait until the job finishes or fails. 
> some progress information is needed to inform end users how the job is 
> running if they are not familiar with yarn RM or standalone spark master ui. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22496) beeline display operation log

2017-11-11 Thread StephenZou (JIRA)
StephenZou created SPARK-22496:
--

 Summary: beeline display operation log
 Key: SPARK-22496
 URL: https://issues.apache.org/jira/browse/SPARK-22496
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: StephenZou
Priority: Minor


For now,when end user runs queries in beeline or in hue through STS, 
no logs are displayed, end user will wait until the job finishes or fails. 

some progress information is needed to inform end users how the job is running 
if they are not familiar with yarn RM or standalone spark master ui. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22406) pyspark version tag is wrong on PyPi

2017-11-11 Thread Holden Karau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248405#comment-16248405
 ] 

Holden Karau commented on SPARK-22406:
--

Yes, although this should be fixed in the documented upload process, it
just has to be run at the end of the release to be verified closed.

On Fri, Nov 10, 2017 at 10:40 PM Felix Cheung (JIRA) 

-- 
Cell : 425-233-8271


> pyspark version tag is wrong on PyPi
> 
>
> Key: SPARK-22406
> URL: https://issues.apache.org/jira/browse/SPARK-22406
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Kerrick Staley
>Assignee: holdenk
>Priority: Minor
>
> On pypi.python.org, the pyspark package is tagged with version 
> {{2.2.0.post0}}: https://pypi.python.org/pypi/pyspark/2.2.0
> However, when you install the package, it has version {{2.2.0}}.
> This has really annoying consequences: if you try {{pip install 
> pyspark==2.2.0}}, it won't work. Instead you have to do {{pip install 
> pyspark==2.2.0.post0}}. Then, if you later run the same command ({{pip 
> install pyspark==2.2.0.post0}}), it won't recognize the existing pyspark 
> installation (because it has version {{2.2.0}}) and instead will reinstall 
> it, which is very slow because pyspark is a large package.
> This can happen if you add a new package to a {{requirements.txt}} file; you 
> end up waiting a lot longer than necessary because every time you run {{pip 
> install -r requirements.txt}} it reinstalls pyspark.
> Can you please change the package on PyPi to have the version {{2.2.0}}?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org