[jira] [Created] (SPARK-14868) Enable NewLineAtEofChecker in checkstyle and fix lint-java errors

2016-04-22 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-14868:
-

 Summary: Enable NewLineAtEofChecker in checkstyle and fix 
lint-java errors
 Key: SPARK-14868
 URL: https://issues.apache.org/jira/browse/SPARK-14868
 Project: Spark
  Issue Type: Task
  Components: Build
Reporter: Dongjoon Hyun
Priority: Minor


Spark uses `NewLineAtEofChecker` rule in Scala by ScalaStyle. And, most Java 
code also comply with the rule. This issue aims to enforce the same rule 
`NewlineAtEndOfFile` by CheckStyle explicitly. Also, this fixes lint-java 
errors since SPARK-14465.

This issue does the following items.
* Adds a new line at the end of the files (19 files)
* Fixes 25 lint-java errors (12 RedundantModifier, 6 ArrayTypeStyle, 2 
LineLength, 2 UnusedImports, 2 RegexpSingleline, 1 ModifierOrder)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14551) Reduce number of NameNode calls in OrcRelation with FileSourceStrategy mode

2016-04-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14551.
-
   Resolution: Fixed
 Assignee: Rajesh Balamohan
Fix Version/s: 2.0.0

> Reduce number of NameNode calls in OrcRelation with FileSourceStrategy mode
> ---
>
> Key: SPARK-14551
> URL: https://issues.apache.org/jira/browse/SPARK-14551
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Minor
> Fix For: 2.0.0
>
>
> When FileSourceStrategy is used, record reader is created which incurs a NN 
> call internally. Later in OrcRelation.unwrapOrcStructs, it ends ups reading 
> the file information to get the ObjectInspector. This incurs additional NN 
> call. It would be good to avoid this additional NN call (specifically for 
> partitioned datasets)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14866) Break SQLQuerySuite out into smaller test suites

2016-04-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14866.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Break SQLQuerySuite out into smaller test suites
> 
>
> Key: SPARK-14866
> URL: https://issues.apache.org/jira/browse/SPARK-14866
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14654) New accumulator API

2016-04-22 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255159#comment-15255159
 ] 

Reynold Xin commented on SPARK-14654:
-

That only works for the built-in ones and can't work if there is more than one 
accumulator types that support a given value, e.g. double for double sum and 
avg.


> New accumulator API
> ---
>
> Key: SPARK-14654
> URL: https://issues.apache.org/jira/browse/SPARK-14654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The current accumulator API has a few problems:
> 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, 
> AccumulatorParam, AccumulableParam, etc.
> 2. The intermediate buffer type must be the same as the output type, so there 
> is no way to define an accumulator that computes averages.
> 3. It is very difficult to specialize the methods, leading to excessive 
> boxing and making accumulators bad for metrics that change for each record.
> 4. There is not a single coherent API that works for both Java and Scala.
> This is a proposed new API that addresses all of the above. In this new API:
> 1. There is only a single class (Accumulator) that is user facing
> 2. The intermediate value is stored in the accumulator itself and can be 
> different from the output type.
> 3. Concrete implementations can provide its own specialized methods.
> 4. Designed to work for both Java and Scala.
> {code}
> abstract class Accumulator[IN, OUT] extends Serializable {
>   def isRegistered: Boolean = ...
>   def register(metadata: AccumulatorMetadata): Unit = ...
>   def metadata: AccumulatorMetadata = ...
>   def reset(): Unit
>   def add(v: IN): Unit
>   def merge(other: Accumulator[IN, OUT]): Unit
>   def value: OUT
>   def localValue: OUT = value
>   final def registerAccumulatorOnExecutor(): Unit = {
> // Automatically register the accumulator when it is deserialized with 
> the task closure.
> // This is for external accumulators and internal ones that do not 
> represent task level
> // metrics, e.g. internal SQL metrics, which are per-operator.
> val taskContext = TaskContext.get()
> if (taskContext != null) {
>   taskContext.registerAccumulator(this)
> }
>   }
>   // Called by Java when deserializing an object
>   private def readObject(in: ObjectInputStream): Unit = 
> Utils.tryOrIOException {
> in.defaultReadObject()
> registerAccumulator()
>   }
> }
> {code}
> Metadata, provided by Spark after registration:
> {code}
> class AccumulatorMetadata(
>   val id: Long,
>   val name: Option[String],
>   val countFailedValues: Boolean
> ) extends Serializable
> {code}
> and an implementation that also offers specialized getters and setters
> {code}
> class LongAccumulator extends Accumulator[jl.Long, jl.Long] {
>   private[this] var _sum = 0L
>   override def reset(): Unit = _sum = 0L
>   override def add(v: jl.Long): Unit = {
> _sum += v
>   }
>   override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other 
> match {
> case o: LongAccumulator => _sum += o.sum
> case _ => throw new UnsupportedOperationException(
>   s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
>   }
>   override def value: jl.Long = _sum
>   def sum: Long = _sum
> }
> {code}
> and SparkContext...
> {code}
> class SparkContext {
>   ...
>   def newLongAccumulator(): LongAccumulator
>   def newLongAccumulator(name: Long): LongAccumulator
>   def newLongAccumulator(name: Long, dedup: Boolean): LongAccumulator
>   def registerAccumulator[IN, OUT](acc: Accumulator[IN, OUT]): 
> Accumulator[IN, OUT]
>   ...
> }
> {code}
> To use it ...
> {code}
> val acc = sc.newLongAccumulator()
> sc.parallelize(1 to 1000).map { i =>
>   acc.add(1)
>   i
> }
> {code}
> A work-in-progress prototype here: 
> https://github.com/rxin/spark/tree/accumulator-refactor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14863) Cache TreeNode's hashCode

2016-04-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-14863.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12626
[https://github.com/apache/spark/pull/12626]

> Cache TreeNode's hashCode
> -
>
> Key: SPARK-14863
> URL: https://issues.apache.org/jira/browse/SPARK-14863
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> Caching TreeNode's hashCode can lead to orders-of-magnitude performance 
> improvement in certain optimizer rules when operating on data with 
> huge/complex schemas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14856) Returning batch unexpected from wide table

2016-04-22 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14856.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12619
[https://github.com/apache/spark/pull/12619]

> Returning batch unexpected from wide table
> --
>
> Key: SPARK-14856
> URL: https://issues.apache.org/jira/browse/SPARK-14856
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> When there the required schema support batch, but not full schema, the 
> parquet reader may return batch unexpectedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14867) Make `build/mvn` to use the downloaded maven if it exist.

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14867:


Assignee: Apache Spark

> Make `build/mvn` to use the downloaded maven if it exist.
> -
>
> Key: SPARK-14867
> URL: https://issues.apache.org/jira/browse/SPARK-14867
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Trivial
>
> Currently, `build/mvn` provides a convenient option, `--force`, in order to 
> use the recommended version of maven without changing PATH environment 
> variable.
> However, there were two problems.
> - `dev/lint-java` does not use the newly installed maven.
> - It's not easy to type `--force` option always.
> If we use '--force' option once, we had better prefer the Spark recommended 
> maven.
> This issue makes `build/mvn` check the existence of maven installed by 
> `--force` option first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14867) Make `build/mvn` to use the downloaded maven if it exist.

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14867:


Assignee: (was: Apache Spark)

> Make `build/mvn` to use the downloaded maven if it exist.
> -
>
> Key: SPARK-14867
> URL: https://issues.apache.org/jira/browse/SPARK-14867
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> Currently, `build/mvn` provides a convenient option, `--force`, in order to 
> use the recommended version of maven without changing PATH environment 
> variable.
> However, there were two problems.
> - `dev/lint-java` does not use the newly installed maven.
> - It's not easy to type `--force` option always.
> If we use '--force' option once, we had better prefer the Spark recommended 
> maven.
> This issue makes `build/mvn` check the existence of maven installed by 
> `--force` option first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14867) Make `build/mvn` to use the downloaded maven if it exist.

2016-04-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255144#comment-15255144
 ] 

Apache Spark commented on SPARK-14867:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/12631

> Make `build/mvn` to use the downloaded maven if it exist.
> -
>
> Key: SPARK-14867
> URL: https://issues.apache.org/jira/browse/SPARK-14867
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> Currently, `build/mvn` provides a convenient option, `--force`, in order to 
> use the recommended version of maven without changing PATH environment 
> variable.
> However, there were two problems.
> - `dev/lint-java` does not use the newly installed maven.
> - It's not easy to type `--force` option always.
> If we use '--force' option once, we had better prefer the Spark recommended 
> maven.
> This issue makes `build/mvn` check the existence of maven installed by 
> `--force` option first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14867) Make `build/mvn` to use the downloaded maven if it exist.

2016-04-22 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-14867:
-

 Summary: Make `build/mvn` to use the downloaded maven if it exist.
 Key: SPARK-14867
 URL: https://issues.apache.org/jira/browse/SPARK-14867
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Dongjoon Hyun
Priority: Trivial


Currently, `build/mvn` provides a convenient option, `--force`, in order to use 
the recommended version of maven without changing PATH environment variable.

However, there were two problems.
- `dev/lint-java` does not use the newly installed maven.
- It's not easy to type `--force` option always.

If we use '--force' option once, we had better prefer the Spark recommended 
maven.

This issue makes `build/mvn` check the existence of maven installed by 
`--force` option first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14865) When creating a view, we should verify the generated SQL string

2016-04-22 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255137#comment-15255137
 ] 

Xiao Li edited comment on SPARK-14865 at 4/23/16 4:54 AM:
--

If nobody starts it, I can take it? We just verify if the SQL string can be 
parsed and analyzed? 


was (Author: smilegator):
If nobody starts it, I can take it? We just verify if the SQL string can be 
analyzed? 

> When creating a view, we should verify the generated SQL string
> ---
>
> Key: SPARK-14865
> URL: https://issues.apache.org/jira/browse/SPARK-14865
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> After we generate the SQL string for a create view command, we should verify 
> the string before putting it into metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14865) When creating a view, we should verify the generated SQL string

2016-04-22 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255137#comment-15255137
 ] 

Xiao Li commented on SPARK-14865:
-

If nobody starts it, I can take it? We just verify if the SQL string can be 
analyzed? 

> When creating a view, we should verify the generated SQL string
> ---
>
> Key: SPARK-14865
> URL: https://issues.apache.org/jira/browse/SPARK-14865
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> After we generate the SQL string for a create view command, we should verify 
> the string before putting it into metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14525) DataFrameWriter's save method should delegate to jdbc for jdbc datasource

2016-04-22 Thread Justin Pihony (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255136#comment-15255136
 ] 

Justin Pihony commented on SPARK-14525:
---

If I am to update the jdbc.DefaultSource to be a CreatableRelationProvider, 
then that also means that I have to update it to a SchemaRelationProvider. This 
would require a change to the JDBCRelation class so that it can optionally 
accept a user-specified schema. This is all possible and I see it as a change 
from either:

{code}
override val schema: StructType = JDBCRDD.resolveTable(url, table, properties)
{code}

To:

{code}
override val schema: StructType = {
  val resolvedSchema = JDBCRDD.resolveTable(url, table, properties)
  providedSchemaOption match {
case Some(providedSchema) => {
  if(providedSchema == resolvedSchema) resolvedSchema
  else sys.error("User specified schema does not match the actual schema")
}
case None => resolvedSchema
  }
}
{code}

Or, do the checking on initialization, which would not be lazy.

Thoughts/Preferences? Should I just skip making it a CreatableRelationProvider 
if none of the above work?

> DataFrameWriter's save method should delegate to jdbc for jdbc datasource
> -
>
> Key: SPARK-14525
> URL: https://issues.apache.org/jira/browse/SPARK-14525
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Justin Pihony
>Priority: Minor
>
> If you call {code}df.write.format("jdbc")...save(){code} then you get an 
> error  
> bq. org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not 
> allow create table as select
> save is a more intuitive guess on the appropriate method to call, so the user 
> should not be punished for not knowing about the jdbc method. 
> Obviously, this will require the caller to have set up the correct parameters 
> for jdbc to work :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13891) Issue an exception when hitting max iteration limit in testing

2016-04-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-13891.
---
Resolution: Duplicate

> Issue an exception when hitting max iteration limit in testing
> --
>
> Key: SPARK-13891
> URL: https://issues.apache.org/jira/browse/SPARK-13891
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Issue an exception in the unit tests of Spark SQL when hitting the max 
> iteration limit. Then, we can catch the infinite loop bugs in Analyzer and 
> Optimizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14525) DataFrameWriter's save method should delegate to jdbc for jdbc datasource

2016-04-22 Thread Justin Pihony (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255130#comment-15255130
 ] 

Justin Pihony commented on SPARK-14525:
---

To address any concerns about taking Properties to a 
{code}Map[String,String]{code} please refer to this [StackOverflow 
question|https://stackoverflow.com/questions/873510/why-does-java-util-properties-implement-mapobject-object-and-not-mapstring-st]

> DataFrameWriter's save method should delegate to jdbc for jdbc datasource
> -
>
> Key: SPARK-14525
> URL: https://issues.apache.org/jira/browse/SPARK-14525
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Justin Pihony
>Priority: Minor
>
> If you call {code}df.write.format("jdbc")...save(){code} then you get an 
> error  
> bq. org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not 
> allow create table as select
> save is a more intuitive guess on the appropriate method to call, so the user 
> should not be punished for not knowing about the jdbc method. 
> Obviously, this will require the caller to have set up the correct parameters 
> for jdbc to work :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14654) New accumulator API

2016-04-22 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255123#comment-15255123
 ] 

holdenk commented on SPARK-14654:
-

Giving it a bit of thought on the flight, is there a reason why we don't want 
to use reflection in the Java API(e.g. something like 
https://github.com/holdenk/spark/tree/alternative-java-scala-acc-api )? It 
seems like it would make it even simpler.

> New accumulator API
> ---
>
> Key: SPARK-14654
> URL: https://issues.apache.org/jira/browse/SPARK-14654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The current accumulator API has a few problems:
> 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, 
> AccumulatorParam, AccumulableParam, etc.
> 2. The intermediate buffer type must be the same as the output type, so there 
> is no way to define an accumulator that computes averages.
> 3. It is very difficult to specialize the methods, leading to excessive 
> boxing and making accumulators bad for metrics that change for each record.
> 4. There is not a single coherent API that works for both Java and Scala.
> This is a proposed new API that addresses all of the above. In this new API:
> 1. There is only a single class (Accumulator) that is user facing
> 2. The intermediate value is stored in the accumulator itself and can be 
> different from the output type.
> 3. Concrete implementations can provide its own specialized methods.
> 4. Designed to work for both Java and Scala.
> {code}
> abstract class Accumulator[IN, OUT] extends Serializable {
>   def isRegistered: Boolean = ...
>   def register(metadata: AccumulatorMetadata): Unit = ...
>   def metadata: AccumulatorMetadata = ...
>   def reset(): Unit
>   def add(v: IN): Unit
>   def merge(other: Accumulator[IN, OUT]): Unit
>   def value: OUT
>   def localValue: OUT = value
>   final def registerAccumulatorOnExecutor(): Unit = {
> // Automatically register the accumulator when it is deserialized with 
> the task closure.
> // This is for external accumulators and internal ones that do not 
> represent task level
> // metrics, e.g. internal SQL metrics, which are per-operator.
> val taskContext = TaskContext.get()
> if (taskContext != null) {
>   taskContext.registerAccumulator(this)
> }
>   }
>   // Called by Java when deserializing an object
>   private def readObject(in: ObjectInputStream): Unit = 
> Utils.tryOrIOException {
> in.defaultReadObject()
> registerAccumulator()
>   }
> }
> {code}
> Metadata, provided by Spark after registration:
> {code}
> class AccumulatorMetadata(
>   val id: Long,
>   val name: Option[String],
>   val countFailedValues: Boolean
> ) extends Serializable
> {code}
> and an implementation that also offers specialized getters and setters
> {code}
> class LongAccumulator extends Accumulator[jl.Long, jl.Long] {
>   private[this] var _sum = 0L
>   override def reset(): Unit = _sum = 0L
>   override def add(v: jl.Long): Unit = {
> _sum += v
>   }
>   override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other 
> match {
> case o: LongAccumulator => _sum += o.sum
> case _ => throw new UnsupportedOperationException(
>   s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
>   }
>   override def value: jl.Long = _sum
>   def sum: Long = _sum
> }
> {code}
> and SparkContext...
> {code}
> class SparkContext {
>   ...
>   def newLongAccumulator(): LongAccumulator
>   def newLongAccumulator(name: Long): LongAccumulator
>   def newLongAccumulator(name: Long, dedup: Boolean): LongAccumulator
>   def registerAccumulator[IN, OUT](acc: Accumulator[IN, OUT]): 
> Accumulator[IN, OUT]
>   ...
> }
> {code}
> To use it ...
> {code}
> val acc = sc.newLongAccumulator()
> sc.parallelize(1 to 1000).map { i =>
>   acc.add(1)
>   i
> }
> {code}
> A work-in-progress prototype here: 
> https://github.com/rxin/spark/tree/accumulator-refactor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7008) An implementation of Factorization Machine (LibFM)

2016-04-22 Thread Ben McCann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255122#comment-15255122
 ] 

Ben McCann commented on SPARK-7008:
---

I've found a number of implementations:
https://github.com/zhengruifeng/spark-libFM
https://github.com/skrusche63/spark-fm
https://github.com/blebreton/spark-FM-parallelSGD
https://github.com/witgo/zen/tree/master/ml/src/main/scala/com/github/cloudml/zen/ml/recommendation

> An implementation of Factorization Machine (LibFM)
> --
>
> Key: SPARK-7008
> URL: https://issues.apache.org/jira/browse/SPARK-7008
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: zhengruifeng
>  Labels: features
> Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, 
> QQ20150421-2.png
>
>
> An implementation of Factorization Machines based on Scala and Spark MLlib.
> FM is a kind of machine learning algorithm for multi-linear regression, and 
> is widely used for recommendation.
> FM works well in recent years' recommendation competitions.
> Ref:
> http://libfm.org/
> http://doi.acm.org/10.1145/2168752.2168771
> http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14866) Break SQLQuerySuite out into smaller test suites

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14866:


Assignee: Apache Spark  (was: Reynold Xin)

> Break SQLQuerySuite out into smaller test suites
> 
>
> Key: SPARK-14866
> URL: https://issues.apache.org/jira/browse/SPARK-14866
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14866) Break SQLQuerySuite out into smaller test suites

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14866:


Assignee: Reynold Xin  (was: Apache Spark)

> Break SQLQuerySuite out into smaller test suites
> 
>
> Key: SPARK-14866
> URL: https://issues.apache.org/jira/browse/SPARK-14866
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14866) Break SQLQuerySuite out into smaller test suites

2016-04-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255117#comment-15255117
 ] 

Apache Spark commented on SPARK-14866:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12630

> Break SQLQuerySuite out into smaller test suites
> 
>
> Key: SPARK-14866
> URL: https://issues.apache.org/jira/browse/SPARK-14866
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14866) Break SQLQuerySuite out into smaller test suites

2016-04-22 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-14866:
---

 Summary: Break SQLQuerySuite out into smaller test suites
 Key: SPARK-14866
 URL: https://issues.apache.org/jira/browse/SPARK-14866
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14842) Implement view creation in sql/core

2016-04-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14842.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Implement view creation in sql/core
> ---
>
> Key: SPARK-14842
> URL: https://issues.apache.org/jira/browse/SPARK-14842
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14800) Dealing with null as a value in options for each internal data source

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14800:


Assignee: Apache Spark

> Dealing with null as a value in options for each internal data source
> -
>
> Key: SPARK-14800
> URL: https://issues.apache.org/jira/browse/SPARK-14800
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> This was found in https://github.com/apache/spark/pull/12494.
> Currently, most of options in data sources throws {{NullPointerException}} 
> when given value is {{null}}.
> For example, the codes below:
> {code}
> sqlContext.read
>   .format("csv")
>   .option("compression", null)
> {code}
> throws an exception below:
> {code}
> java.lang.NullPointerException was thrown.
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.getChar(CSVOptions.scala:32)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:68)
> at 
> org.apache.spark.sql.execution.datasources.csv.DefaultSource.prepareWrite(DefaultSource.scala:86)
> ...
> {code}
> while some of options such as {{nullValue}} in CSV data source accepts 
> {{null}} .
> So, the options throwing {{NullPointerException}} when it sets to {{null}} 
> might have to be handled separately for each data source.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14800) Dealing with null as a value in options for each internal data source

2016-04-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255099#comment-15255099
 ] 

Apache Spark commented on SPARK-14800:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/12629

> Dealing with null as a value in options for each internal data source
> -
>
> Key: SPARK-14800
> URL: https://issues.apache.org/jira/browse/SPARK-14800
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This was found in https://github.com/apache/spark/pull/12494.
> Currently, most of options in data sources throws {{NullPointerException}} 
> when given value is {{null}}.
> For example, the codes below:
> {code}
> sqlContext.read
>   .format("csv")
>   .option("compression", null)
> {code}
> throws an exception below:
> {code}
> java.lang.NullPointerException was thrown.
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.getChar(CSVOptions.scala:32)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:68)
> at 
> org.apache.spark.sql.execution.datasources.csv.DefaultSource.prepareWrite(DefaultSource.scala:86)
> ...
> {code}
> while some of options such as {{nullValue}} in CSV data source accepts 
> {{null}} .
> So, the options throwing {{NullPointerException}} when it sets to {{null}} 
> might have to be handled separately for each data source.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14800) Dealing with null as a value in options for each internal data source

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14800:


Assignee: (was: Apache Spark)

> Dealing with null as a value in options for each internal data source
> -
>
> Key: SPARK-14800
> URL: https://issues.apache.org/jira/browse/SPARK-14800
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This was found in https://github.com/apache/spark/pull/12494.
> Currently, most of options in data sources throws {{NullPointerException}} 
> when given value is {{null}}.
> For example, the codes below:
> {code}
> sqlContext.read
>   .format("csv")
>   .option("compression", null)
> {code}
> throws an exception below:
> {code}
> java.lang.NullPointerException was thrown.
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.getChar(CSVOptions.scala:32)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:68)
> at 
> org.apache.spark.sql.execution.datasources.csv.DefaultSource.prepareWrite(DefaultSource.scala:86)
> ...
> {code}
> while some of options such as {{nullValue}} in CSV data source accepts 
> {{null}} .
> So, the options throwing {{NullPointerException}} when it sets to {{null}} 
> might have to be handled separately for each data source.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14865) When creating a view, we should verify the generated SQL string

2016-04-22 Thread Yin Huai (JIRA)
Yin Huai created SPARK-14865:


 Summary: When creating a view, we should verify the generated SQL 
string
 Key: SPARK-14865
 URL: https://issues.apache.org/jira/browse/SPARK-14865
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Critical


After we generate the SQL string for a create view command, we should verify 
the string before putting it into metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14757) Incorrect behavior of Join operation in Spqrk SQL JOIN : "false" in the left table is joined to "null" on the right table

2016-04-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-14757:

Target Version/s: 1.6.1, 2.0.0

> Incorrect behavior of Join operation in Spqrk SQL JOIN : "false" in the left 
> table is joined to "null" on the right table
> -
>
> Key: SPARK-14757
> URL: https://issues.apache.org/jira/browse/SPARK-14757
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Hong Huang
>
> Content of table a:
> |outgoing_0|
> | false |
> |  true |
> |  null  |
> a has only one field: outgoing_0 
> Content of table b:
> |outgoing_1|
> | false  |
> |  true  |
> |  null   |
> b has only one filed: outgoing_1
> After running this query:
> select * from a FULL JOIN b ON ( outgoing_0<=>outgoing_1)
> I got the following result:
> |outgoing_0|outgoing_1|
> |  true  |  true  |
> | false  | false |
> | false  |  null  |
> |  null   |  null  |
> The row with "false" as outgoing_0 and "null" as outgoing_1 is unexpected. 
> The operator <=> should match null with null. 
> While left "false" is matched with right "null", it is also strange to find 
> that the "false" on the right table does not match with "null" on the left 
> table (no row with "null" as outgoing_0 and "false" as outgoing_1)
> You can easily reproduce this bug by pasting the following code fragment:
> case class A( outgoing_0: Option[Boolean] )
> case class B( outgoing_1: Option[Boolean] )
> val a = sc.parallelize( Seq(
>   A( Some( false ) ),
>   A( Some( true ) ),
>   A( None )
> ) ).toDF()
> a.show
> val b = sc.parallelize( Seq(
>   B( Some( false ) ),
>   B( Some( true ) ),
>   B( None )
> ) ).toDF()
> b.show
> a.registerTempTable( "a" )
> b.registerTempTable( "b" )
> sqlContext.sql( "select * from a FULL JOIN b ON ( 
> outgoing_0<=>outgoing_1)" ).show()
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14757) Incorrect behavior of Join operation in Spqrk SQL JOIN : "false" in the left table is joined to "null" on the right table

2016-04-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-14757:

Target Version/s: 1.6.2, 2.0.0  (was: 1.6.1, 2.0.0)

> Incorrect behavior of Join operation in Spqrk SQL JOIN : "false" in the left 
> table is joined to "null" on the right table
> -
>
> Key: SPARK-14757
> URL: https://issues.apache.org/jira/browse/SPARK-14757
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Hong Huang
>
> Content of table a:
> |outgoing_0|
> | false |
> |  true |
> |  null  |
> a has only one field: outgoing_0 
> Content of table b:
> |outgoing_1|
> | false  |
> |  true  |
> |  null   |
> b has only one filed: outgoing_1
> After running this query:
> select * from a FULL JOIN b ON ( outgoing_0<=>outgoing_1)
> I got the following result:
> |outgoing_0|outgoing_1|
> |  true  |  true  |
> | false  | false |
> | false  |  null  |
> |  null   |  null  |
> The row with "false" as outgoing_0 and "null" as outgoing_1 is unexpected. 
> The operator <=> should match null with null. 
> While left "false" is matched with right "null", it is also strange to find 
> that the "false" on the right table does not match with "null" on the left 
> table (no row with "null" as outgoing_0 and "false" as outgoing_1)
> You can easily reproduce this bug by pasting the following code fragment:
> case class A( outgoing_0: Option[Boolean] )
> case class B( outgoing_1: Option[Boolean] )
> val a = sc.parallelize( Seq(
>   A( Some( false ) ),
>   A( Some( true ) ),
>   A( None )
> ) ).toDF()
> a.show
> val b = sc.parallelize( Seq(
>   B( Some( false ) ),
>   B( Some( true ) ),
>   B( None )
> ) ).toDF()
> b.show
> a.registerTempTable( "a" )
> b.registerTempTable( "b" )
> sqlContext.sql( "select * from a FULL JOIN b ON ( 
> outgoing_0<=>outgoing_1)" ).show()
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14862) Tree and ensemble classification: do not require label metadata

2016-04-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14862:
--
Description: 
spark.ml DecisionTreeClassifier, RandomForestClassifier, and GBTClassifier 
require that the labelCol have metadata specifying the number of classes.  
Instead, if the number of classes is not specified, we should automatically 
scan the column to identify numClasses.

This differs from [SPARK-7126] in that this requires labels to be indexed (but 
without metadata).  This issue is not for supporting String labels.

Note: This could cause problems with very small datasets + cross validation if 
there are k classes but class index k-1 does not appear in the training data.  
We should make sure the error thrown helps the user understand the solution, 
which is probably to use StringIndexer to index the whole dataset's labelCol 
before doing cross validation.


  was:
spark.ml DecisionTreeClassifier, RandomForestClassifier, and GBTClassifier 
require that the labelCol have metadata specifying the number of classes.  
Instead, if the number of classes is not specified, we should automatically 
scan the column to identify numClasses.

Note: This could cause problems with very small datasets + cross validation if 
there are k classes but class index k-1 does not appear in the training data.  
We should make sure the error thrown helps the user understand the solution, 
which is probably to use StringIndexer to index the whole dataset's labelCol 
before doing cross validation.


> Tree and ensemble classification: do not require label metadata
> ---
>
> Key: SPARK-14862
> URL: https://issues.apache.org/jira/browse/SPARK-14862
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> spark.ml DecisionTreeClassifier, RandomForestClassifier, and GBTClassifier 
> require that the labelCol have metadata specifying the number of classes.  
> Instead, if the number of classes is not specified, we should automatically 
> scan the column to identify numClasses.
> This differs from [SPARK-7126] in that this requires labels to be indexed 
> (but without metadata).  This issue is not for supporting String labels.
> Note: This could cause problems with very small datasets + cross validation 
> if there are k classes but class index k-1 does not appear in the training 
> data.  We should make sure the error thrown helps the user understand the 
> solution, which is probably to use StringIndexer to index the whole dataset's 
> labelCol before doing cross validation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14694) Thrift Server + Hive Metastore + Kerberos doesn't work

2016-04-22 Thread zhangguancheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255036#comment-15255036
 ] 

zhangguancheng edited comment on SPARK-14694 at 4/23/16 2:08 AM:
-

Content of hive-site.xml:
{quote}




hive.server2.thrift.port
1 



hive.metastore.sasl.enabled
true 



hive.metastore.kerberos.keytab.file
/opt/hive/apache-hive-1.1.1-bin/conf/hive.keytab 



hive.metastore.kerberos.principal
hive/c1@C1 



hive.server2.authentication
KERBEROS 



hive.server2.authentication.kerberos.principal
hive/c1@C1 



hive.server2.authentication.kerberos.keytab
/opt/hive/apache-hive-1.1.1-bin/conf/hive.keytab 



  javax.jdo.option.ConnectionURL
  jdbc:mysql://localhost/test
  the URL of the MySQL database



  javax.jdo.option.ConnectionDriverName
  com.mysql.jdbc.Driver



  javax.jdo.option.ConnectionUserName
  xxx



  javax.jdo.option.ConnectionPassword
  x



  datanucleus.autoCreateSchema
  false



  datanucleus.fixedDatastore
  true



  hive.metastore.uris
  thrift://localhost:9083
  IP address (or fully-qualified domain name) and port of the 
metastore host



{quote}



was (Author: zhangguancheng):
Content of hive-site.xml:
{quote}




hive.server2.thrift.port
1 



hive.metastore.sasl.enabled
true 



hive.metastore.kerberos.keytab.file
/Users/zhangguancheng/Documents/github/bigdata/hive/apache-hive-1.1.1-bin/conf/hive.keytab
 



hive.metastore.kerberos.principal
hive/c1@C1 



hive.server2.authentication
KERBEROS 



hive.server2.authentication.kerberos.principal
hive/c1@C1 



hive.server2.authentication.kerberos.keytab
/Users/zhangguancheng/Documents/github/bigdata/hive/apache-hive-1.1.1-bin/conf/hive.keytab
 



  javax.jdo.option.ConnectionURL
  jdbc:mysql://localhost/test
  the URL of the MySQL database



  javax.jdo.option.ConnectionDriverName
  com.mysql.jdbc.Driver



  javax.jdo.option.ConnectionUserName
  test



  javax.jdo.option.ConnectionPassword
  test123



  datanucleus.autoCreateSchema
  false



  datanucleus.fixedDatastore
  true



  hive.metastore.uris
  thrift://localhost:9083
  IP address (or fully-qualified domain name) and port of the 
metastore host



{quote}


> Thrift Server + Hive Metastore + Kerberos doesn't work
> --
>
> Key: SPARK-14694
> URL: https://issues.apache.org/jira/browse/SPARK-14694
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1
> Environment: Spark 1.6.1. compiled with hadoop 2.6.0, yarn, hive
> Hadoop 2.6.4 
> Hive 1.1.1 
> Kerberos
>Reporter: zhangguancheng
>  Labels: security
>
> My Hive Metasore is MySQL based. I started a spark thrift server on the same 
> node as the Hive Metastore. I can open beeline and run select statements but 
> for some commands like "show databases", I get an error:
> {quote}
> ERROR pool-24-thread-1 org.apache.thrift.transport.TSaslTransport:315 SASL 
> negotiation failure
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
> at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
> at 
> org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
> at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
> at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:236)
> at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at 
> 

[jira] [Commented] (SPARK-14694) Thrift Server + Hive Metastore + Kerberos doesn't work

2016-04-22 Thread zhangguancheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255036#comment-15255036
 ] 

zhangguancheng commented on SPARK-14694:


Content of hive-site.xml:
{quote}




hive.server2.thrift.port
1 



hive.metastore.sasl.enabled
true 



hive.metastore.kerberos.keytab.file
/Users/zhangguancheng/Documents/github/bigdata/hive/apache-hive-1.1.1-bin/conf/hive.keytab
 



hive.metastore.kerberos.principal
hive/c1@C1 



hive.server2.authentication
KERBEROS 



hive.server2.authentication.kerberos.principal
hive/c1@C1 



hive.server2.authentication.kerberos.keytab
/Users/zhangguancheng/Documents/github/bigdata/hive/apache-hive-1.1.1-bin/conf/hive.keytab
 



  javax.jdo.option.ConnectionURL
  jdbc:mysql://localhost/test
  the URL of the MySQL database



  javax.jdo.option.ConnectionDriverName
  com.mysql.jdbc.Driver



  javax.jdo.option.ConnectionUserName
  test



  javax.jdo.option.ConnectionPassword
  test123



  datanucleus.autoCreateSchema
  false



  datanucleus.fixedDatastore
  true



  hive.metastore.uris
  thrift://localhost:9083
  IP address (or fully-qualified domain name) and port of the 
metastore host



{quote}


> Thrift Server + Hive Metastore + Kerberos doesn't work
> --
>
> Key: SPARK-14694
> URL: https://issues.apache.org/jira/browse/SPARK-14694
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1
> Environment: Spark 1.6.1. compiled with hadoop 2.6.0, yarn, hive
> Hadoop 2.6.4 
> Hive 1.1.1 
> Kerberos
>Reporter: zhangguancheng
>  Labels: security
>
> My Hive Metasore is MySQL based. I started a spark thrift server on the same 
> node as the Hive Metastore. I can open beeline and run select statements but 
> for some commands like "show databases", I get an error:
> {quote}
> ERROR pool-24-thread-1 org.apache.thrift.transport.TSaslTransport:315 SASL 
> negotiation failure
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
> at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
> at 
> org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
> at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
> at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:236)
> at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
> at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
> at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
> at org.apache.hadoop.hive.ql.exec.DDLTask.showDatabases(DDLTask.java:2223)
> at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:385)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
> at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88)
> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1653)
> at 

[jira] [Resolved] (SPARK-14807) Create a compatibility module

2016-04-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-14807.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12580
[https://github.com/apache/spark/pull/12580]

> Create a compatibility module
> -
>
> Key: SPARK-14807
> URL: https://issues.apache.org/jira/browse/SPARK-14807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>
> In 2.0, SparkSession will replace SQLContext/HiveContext. We will move 
> HiveContext to a compatibility module and users can optionally use this 
> module to access HiveContext. 
> This jira is to create this compatibility module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14855) Add "Exec" suffix to all physical operators

2016-04-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14855.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Add "Exec" suffix to all physical operators
> ---
>
> Key: SPARK-14855
> URL: https://issues.apache.org/jira/browse/SPARK-14855
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> Our physical operators have identical names as our logical operators, which 
> cause some issues in code reviews and refactoring. It would've been easier if 
> they are named differently so we can quickly tell which is which.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14817) ML 2.0 QA: Programming guide update and migration guide

2016-04-22 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254974#comment-15254974
 ] 

zhengruifeng commented on SPARK-14817:
--

count me in too ☺

> ML 2.0 QA: Programming guide update and migration guide
> ---
>
> Key: SPARK-14817
> URL: https://issues.apache.org/jira/browse/SPARK-14817
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>
> Before the release, we need to update the MLlib Programming Guide.  Updates 
> will include:
> * Make the DataFrame-based API (spark.ml) front-and-center, to make it clear 
> the RDD-based API is the older, maintenance-mode one.
> ** No docs for spark.mllib will be deleted; they will just be reorganized and 
> put in a subsection.
> ** If spark.ml docs are less complete, or if spark.ml docs say "refer to the 
> spark.mllib docs for details," then we should copy those details to the 
> spark.ml docs.
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs.
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work (which should be broken into pieces for 
> this larger 2.0 release).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14862) Tree and ensemble classification: do not require label metadata

2016-04-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-14862:
-

Assignee: Joseph K. Bradley

> Tree and ensemble classification: do not require label metadata
> ---
>
> Key: SPARK-14862
> URL: https://issues.apache.org/jira/browse/SPARK-14862
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> spark.ml DecisionTreeClassifier, RandomForestClassifier, and GBTClassifier 
> require that the labelCol have metadata specifying the number of classes.  
> Instead, if the number of classes is not specified, we should automatically 
> scan the column to identify numClasses.
> Note: This could cause problems with very small datasets + cross validation 
> if there are k classes but class index k-1 does not appear in the training 
> data.  We should make sure the error thrown helps the user understand the 
> solution, which is probably to use StringIndexer to index the whole dataset's 
> labelCol before doing cross validation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14832) Refactor DataSource to ensure schema is inferred only once when creating a file stream

2016-04-22 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-14832.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> Refactor DataSource to ensure schema is inferred only once when creating a 
> file stream
> --
>
> Key: SPARK-14832
> URL: https://issues.apache.org/jira/browse/SPARK-14832
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.0.0
>
>
> When creating a file stream using sqlContext.write.stream(), existing files 
> are scanned twice for finding the schema 
> - Once, when creating a DataSource + StreamingRelation in the 
> DataFrameReader.stream()
> - Again, when creating streaming Source from the DataSource, in 
> DataSource.createSource()
> Instead, the schema should be generated only once, at the time of creating 
> the dataframe, and when the streaming source is created, it should just reuse 
> that schema



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14757) Incorrect behavior of Join operation in Spqrk SQL JOIN : "false" in the left table is joined to "null" on the right table

2016-04-22 Thread Arash Nabili (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254949#comment-15254949
 ] 

Arash Nabili commented on SPARK-14757:
--

I have submitted a PR to the master branch which should fix this issue.

> Incorrect behavior of Join operation in Spqrk SQL JOIN : "false" in the left 
> table is joined to "null" on the right table
> -
>
> Key: SPARK-14757
> URL: https://issues.apache.org/jira/browse/SPARK-14757
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Hong Huang
>
> Content of table a:
> |outgoing_0|
> | false |
> |  true |
> |  null  |
> a has only one field: outgoing_0 
> Content of table b:
> |outgoing_1|
> | false  |
> |  true  |
> |  null   |
> b has only one filed: outgoing_1
> After running this query:
> select * from a FULL JOIN b ON ( outgoing_0<=>outgoing_1)
> I got the following result:
> |outgoing_0|outgoing_1|
> |  true  |  true  |
> | false  | false |
> | false  |  null  |
> |  null   |  null  |
> The row with "false" as outgoing_0 and "null" as outgoing_1 is unexpected. 
> The operator <=> should match null with null. 
> While left "false" is matched with right "null", it is also strange to find 
> that the "false" on the right table does not match with "null" on the left 
> table (no row with "null" as outgoing_0 and "false" as outgoing_1)
> You can easily reproduce this bug by pasting the following code fragment:
> case class A( outgoing_0: Option[Boolean] )
> case class B( outgoing_1: Option[Boolean] )
> val a = sc.parallelize( Seq(
>   A( Some( false ) ),
>   A( Some( true ) ),
>   A( None )
> ) ).toDF()
> a.show
> val b = sc.parallelize( Seq(
>   B( Some( false ) ),
>   B( Some( true ) ),
>   B( None )
> ) ).toDF()
> b.show
> a.registerTempTable( "a" )
> b.registerTempTable( "b" )
> sqlContext.sql( "select * from a FULL JOIN b ON ( 
> outgoing_0<=>outgoing_1)" ).show()
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14582) Increase the parallelism for small tables

2016-04-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14582.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Increase the parallelism for small tables
> -
>
> Key: SPARK-14582
> URL: https://issues.apache.org/jira/browse/SPARK-14582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14757) Incorrect behavior of Join operation in Spqrk SQL JOIN : "false" in the left table is joined to "null" on the right table

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14757:


Assignee: Apache Spark

> Incorrect behavior of Join operation in Spqrk SQL JOIN : "false" in the left 
> table is joined to "null" on the right table
> -
>
> Key: SPARK-14757
> URL: https://issues.apache.org/jira/browse/SPARK-14757
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Hong Huang
>Assignee: Apache Spark
>
> Content of table a:
> |outgoing_0|
> | false |
> |  true |
> |  null  |
> a has only one field: outgoing_0 
> Content of table b:
> |outgoing_1|
> | false  |
> |  true  |
> |  null   |
> b has only one filed: outgoing_1
> After running this query:
> select * from a FULL JOIN b ON ( outgoing_0<=>outgoing_1)
> I got the following result:
> |outgoing_0|outgoing_1|
> |  true  |  true  |
> | false  | false |
> | false  |  null  |
> |  null   |  null  |
> The row with "false" as outgoing_0 and "null" as outgoing_1 is unexpected. 
> The operator <=> should match null with null. 
> While left "false" is matched with right "null", it is also strange to find 
> that the "false" on the right table does not match with "null" on the left 
> table (no row with "null" as outgoing_0 and "false" as outgoing_1)
> You can easily reproduce this bug by pasting the following code fragment:
> case class A( outgoing_0: Option[Boolean] )
> case class B( outgoing_1: Option[Boolean] )
> val a = sc.parallelize( Seq(
>   A( Some( false ) ),
>   A( Some( true ) ),
>   A( None )
> ) ).toDF()
> a.show
> val b = sc.parallelize( Seq(
>   B( Some( false ) ),
>   B( Some( true ) ),
>   B( None )
> ) ).toDF()
> b.show
> a.registerTempTable( "a" )
> b.registerTempTable( "b" )
> sqlContext.sql( "select * from a FULL JOIN b ON ( 
> outgoing_0<=>outgoing_1)" ).show()
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14757) Incorrect behavior of Join operation in Spqrk SQL JOIN : "false" in the left table is joined to "null" on the right table

2016-04-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254939#comment-15254939
 ] 

Apache Spark commented on SPARK-14757:
--

User 'arashn' has created a pull request for this issue:
https://github.com/apache/spark/pull/12628

> Incorrect behavior of Join operation in Spqrk SQL JOIN : "false" in the left 
> table is joined to "null" on the right table
> -
>
> Key: SPARK-14757
> URL: https://issues.apache.org/jira/browse/SPARK-14757
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Hong Huang
>
> Content of table a:
> |outgoing_0|
> | false |
> |  true |
> |  null  |
> a has only one field: outgoing_0 
> Content of table b:
> |outgoing_1|
> | false  |
> |  true  |
> |  null   |
> b has only one filed: outgoing_1
> After running this query:
> select * from a FULL JOIN b ON ( outgoing_0<=>outgoing_1)
> I got the following result:
> |outgoing_0|outgoing_1|
> |  true  |  true  |
> | false  | false |
> | false  |  null  |
> |  null   |  null  |
> The row with "false" as outgoing_0 and "null" as outgoing_1 is unexpected. 
> The operator <=> should match null with null. 
> While left "false" is matched with right "null", it is also strange to find 
> that the "false" on the right table does not match with "null" on the left 
> table (no row with "null" as outgoing_0 and "false" as outgoing_1)
> You can easily reproduce this bug by pasting the following code fragment:
> case class A( outgoing_0: Option[Boolean] )
> case class B( outgoing_1: Option[Boolean] )
> val a = sc.parallelize( Seq(
>   A( Some( false ) ),
>   A( Some( true ) ),
>   A( None )
> ) ).toDF()
> a.show
> val b = sc.parallelize( Seq(
>   B( Some( false ) ),
>   B( Some( true ) ),
>   B( None )
> ) ).toDF()
> b.show
> a.registerTempTable( "a" )
> b.registerTempTable( "b" )
> sqlContext.sql( "select * from a FULL JOIN b ON ( 
> outgoing_0<=>outgoing_1)" ).show()
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14757) Incorrect behavior of Join operation in Spqrk SQL JOIN : "false" in the left table is joined to "null" on the right table

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14757:


Assignee: (was: Apache Spark)

> Incorrect behavior of Join operation in Spqrk SQL JOIN : "false" in the left 
> table is joined to "null" on the right table
> -
>
> Key: SPARK-14757
> URL: https://issues.apache.org/jira/browse/SPARK-14757
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Hong Huang
>
> Content of table a:
> |outgoing_0|
> | false |
> |  true |
> |  null  |
> a has only one field: outgoing_0 
> Content of table b:
> |outgoing_1|
> | false  |
> |  true  |
> |  null   |
> b has only one filed: outgoing_1
> After running this query:
> select * from a FULL JOIN b ON ( outgoing_0<=>outgoing_1)
> I got the following result:
> |outgoing_0|outgoing_1|
> |  true  |  true  |
> | false  | false |
> | false  |  null  |
> |  null   |  null  |
> The row with "false" as outgoing_0 and "null" as outgoing_1 is unexpected. 
> The operator <=> should match null with null. 
> While left "false" is matched with right "null", it is also strange to find 
> that the "false" on the right table does not match with "null" on the left 
> table (no row with "null" as outgoing_0 and "false" as outgoing_1)
> You can easily reproduce this bug by pasting the following code fragment:
> case class A( outgoing_0: Option[Boolean] )
> case class B( outgoing_1: Option[Boolean] )
> val a = sc.parallelize( Seq(
>   A( Some( false ) ),
>   A( Some( true ) ),
>   A( None )
> ) ).toDF()
> a.show
> val b = sc.parallelize( Seq(
>   B( Some( false ) ),
>   B( Some( true ) ),
>   B( None )
> ) ).toDF()
> b.show
> a.registerTempTable( "a" )
> b.registerTempTable( "b" )
> sqlContext.sql( "select * from a FULL JOIN b ON ( 
> outgoing_0<=>outgoing_1)" ).show()
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14864) [MLLIB] Implement Doc2Vec

2016-04-22 Thread Peter Mountanos (JIRA)
Peter Mountanos created SPARK-14864:
---

 Summary: [MLLIB] Implement Doc2Vec
 Key: SPARK-14864
 URL: https://issues.apache.org/jira/browse/SPARK-14864
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Peter Mountanos
Priority: Minor


It would be useful to implement Doc2Vec, as described in the paper [Distributed 
Representations of Sentences and 
Documents|https://cs.stanford.edu/~quocle/paragraph_vector.pdf]. Gensim has an 
implementation [Deep learning with 
paragraph2vec|https://radimrehurek.com/gensim/models/doc2vec.html]. 

Le & Mikolov show that when aggregating Word2Vec vector representations for a 
paragraph/document, it does not perform well for prediction tasks. Instead, 
they propose the Paragraph Vector implementation, which provides 
state-of-the-art results on several text classification and sentiment analysis 
tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14615) Use the new ML Vector and Matrix in the ML pipeline based algorithms

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14615:


Assignee: DB Tsai  (was: Apache Spark)

> Use the new ML Vector and Matrix in the ML pipeline based algorithms 
> -
>
> Key: SPARK-14615
> URL: https://issues.apache.org/jira/browse/SPARK-14615
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, ML
>Reporter: DB Tsai
>Assignee: DB Tsai
>
> Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new 
> vector and matrix type in the new ml pipeline based apis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14615) Use the new ML Vector and Matrix in the ML pipeline based algorithms

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14615:


Assignee: Apache Spark  (was: DB Tsai)

> Use the new ML Vector and Matrix in the ML pipeline based algorithms 
> -
>
> Key: SPARK-14615
> URL: https://issues.apache.org/jira/browse/SPARK-14615
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, ML
>Reporter: DB Tsai
>Assignee: Apache Spark
>
> Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new 
> vector and matrix type in the new ml pipeline based apis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14615) Use the new ML Vector and Matrix in the ML pipeline based algorithms

2016-04-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254901#comment-15254901
 ] 

Apache Spark commented on SPARK-14615:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/12627

> Use the new ML Vector and Matrix in the ML pipeline based algorithms 
> -
>
> Key: SPARK-14615
> URL: https://issues.apache.org/jira/browse/SPARK-14615
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, ML
>Reporter: DB Tsai
>Assignee: DB Tsai
>
> Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new 
> vector and matrix type in the new ml pipeline based apis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14863) Cache TreeNode's hashCode

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14863:


Assignee: Josh Rosen  (was: Apache Spark)

> Cache TreeNode's hashCode
> -
>
> Key: SPARK-14863
> URL: https://issues.apache.org/jira/browse/SPARK-14863
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Caching TreeNode's hashCode can lead to orders-of-magnitude performance 
> improvement in certain optimizer rules when operating on data with 
> huge/complex schemas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14863) Cache TreeNode's hashCode

2016-04-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254878#comment-15254878
 ] 

Apache Spark commented on SPARK-14863:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/12626

> Cache TreeNode's hashCode
> -
>
> Key: SPARK-14863
> URL: https://issues.apache.org/jira/browse/SPARK-14863
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Caching TreeNode's hashCode can lead to orders-of-magnitude performance 
> improvement in certain optimizer rules when operating on data with 
> huge/complex schemas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14863) Cache TreeNode's hashCode

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14863:


Assignee: Apache Spark  (was: Josh Rosen)

> Cache TreeNode's hashCode
> -
>
> Key: SPARK-14863
> URL: https://issues.apache.org/jira/browse/SPARK-14863
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> Caching TreeNode's hashCode can lead to orders-of-magnitude performance 
> improvement in certain optimizer rules when operating on data with 
> huge/complex schemas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14863) Cache TreeNode's hashCode

2016-04-22 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-14863:
--

 Summary: Cache TreeNode's hashCode
 Key: SPARK-14863
 URL: https://issues.apache.org/jira/browse/SPARK-14863
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen


Caching TreeNode's hashCode can lead to orders-of-magnitude performance 
improvement in certain optimizer rules when operating on data with huge/complex 
schemas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14862) Tree and ensemble classification: do not require label metadata

2016-04-22 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-14862:
-

 Summary: Tree and ensemble classification: do not require label 
metadata
 Key: SPARK-14862
 URL: https://issues.apache.org/jira/browse/SPARK-14862
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley


spark.ml DecisionTreeClassifier, RandomForestClassifier, and GBTClassifier 
require that the labelCol have metadata specifying the number of classes.  
Instead, if the number of classes is not specified, we should automatically 
scan the column to identify numClasses.

Note: This could cause problems with very small datasets + cross validation if 
there are k classes but class index k-1 does not appear in the training data.  
We should make sure the error thrown helps the user understand the solution, 
which is probably to use StringIndexer to index the whole dataset's labelCol 
before doing cross validation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14701) checkpointWriter is stopped before eventLoop. Hence rejectedExecution exception is coming in StreamingContext.stop

2016-04-22 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-14701.
--
   Resolution: Fixed
 Assignee: Liwei Lin
Fix Version/s: 2.0.0

> checkpointWriter is stopped before eventLoop. Hence rejectedExecution 
> exception is coming in StreamingContext.stop
> --
>
> Key: SPARK-14701
> URL: https://issues.apache.org/jira/browse/SPARK-14701
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.1, 1.6.1
> Environment: Windows, local[*] mode as well as  Redhat Linux , Yarn 
> Cluster
>Reporter: Sreelal S L
>Assignee: Liwei Lin
>Priority: Minor
> Fix For: 2.0.0
>
>
> In org.apache.spark.streaming.scheduler.JobGenerator.stop() , the 
> checkpointWriter.stop is called before eventLoop.stop.
> If i call the streamingContext.stop when a batch is about to complete(Im 
> invoking it from a StreamingListener.onBatchCompleted callback) , a 
> rejectedException may get thrown from checkPointWriter.executor, since the 
> eventLoop will try to process DoCheckpoint events even after the 
> checkPointWriter.executor was stopped.
> 16/04/18 19:22:10 ERROR CheckpointWriter: Could not submit checkpoint task to 
> the thread pool executor
> java.util.concurrent.RejectedExecutionException: Task 
> org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler@76e12f8 
> rejected from java.util.concurrent.ThreadPoolExecutor@4b9f5b97[Terminated, 
> pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 49]
>   at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
>   at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
>   at 
> org.apache.spark.streaming.CheckpointWriter.write(Checkpoint.scala:253)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:294)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:184)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:87)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:86)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> I think the order of the stopping should be changed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14861) Replace internal usages of SQLContext with SparkSession

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14861:


Assignee: Apache Spark  (was: Andrew Or)

> Replace internal usages of SQLContext with SparkSession
> ---
>
> Key: SPARK-14861
> URL: https://issues.apache.org/jira/browse/SPARK-14861
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> We should try to use SparkSession (the new thing) in as many places as 
> possible. We should be careful not to break the public datasource API though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14861) Replace internal usages of SQLContext with SparkSession

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14861:


Assignee: Andrew Or  (was: Apache Spark)

> Replace internal usages of SQLContext with SparkSession
> ---
>
> Key: SPARK-14861
> URL: https://issues.apache.org/jira/browse/SPARK-14861
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> We should try to use SparkSession (the new thing) in as many places as 
> possible. We should be careful not to break the public datasource API though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14861) Replace internal usages of SQLContext with SparkSession

2016-04-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254828#comment-15254828
 ] 

Apache Spark commented on SPARK-14861:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/12625

> Replace internal usages of SQLContext with SparkSession
> ---
>
> Key: SPARK-14861
> URL: https://issues.apache.org/jira/browse/SPARK-14861
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> We should try to use SparkSession (the new thing) in as many places as 
> possible. We should be careful not to break the public datasource API though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14861) Replace internal usages of SQLContext with SparkSession

2016-04-22 Thread Andrew Or (JIRA)
Andrew Or created SPARK-14861:
-

 Summary: Replace internal usages of SQLContext with SparkSession
 Key: SPARK-14861
 URL: https://issues.apache.org/jira/browse/SPARK-14861
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


We should try to use SparkSession (the new thing) in as many places as 
possible. We should be careful not to break the public datasource API though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14852) Update GeneralizedLinearRegressionSummary API

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14852:


Assignee: Apache Spark  (was: Joseph K. Bradley)

> Update GeneralizedLinearRegressionSummary API
> -
>
> Key: SPARK-14852
> URL: https://issues.apache.org/jira/browse/SPARK-14852
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> See parent issue for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14852) Update GeneralizedLinearRegressionSummary API

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14852:


Assignee: Joseph K. Bradley  (was: Apache Spark)

> Update GeneralizedLinearRegressionSummary API
> -
>
> Key: SPARK-14852
> URL: https://issues.apache.org/jira/browse/SPARK-14852
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> See parent issue for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14852) Update GeneralizedLinearRegressionSummary API

2016-04-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254813#comment-15254813
 ] 

Apache Spark commented on SPARK-14852:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/12624

> Update GeneralizedLinearRegressionSummary API
> -
>
> Key: SPARK-14852
> URL: https://issues.apache.org/jira/browse/SPARK-14852
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> See parent issue for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14860) Fix flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite "event ordering"

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14860:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Fix flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite "event ordering"
> 
>
> Key: SPARK-14860
> URL: https://issues.apache.org/jira/browse/SPARK-14860
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.0.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>  Labels: flaky-test
>
> See 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/593/testReport/junit/org.apache.spark.sql.util/ContinuousQueryListenerSuite/event_ordering/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14860) Fix flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite "event ordering"

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14860:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Fix flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite "event ordering"
> 
>
> Key: SPARK-14860
> URL: https://issues.apache.org/jira/browse/SPARK-14860
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.0.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>  Labels: flaky-test
>
> See 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/593/testReport/junit/org.apache.spark.sql.util/ContinuousQueryListenerSuite/event_ordering/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14860) Fix flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite "event ordering"

2016-04-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254808#comment-15254808
 ] 

Apache Spark commented on SPARK-14860:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/12623

> Fix flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite "event ordering"
> 
>
> Key: SPARK-14860
> URL: https://issues.apache.org/jira/browse/SPARK-14860
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.0.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>  Labels: flaky-test
>
> See 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/593/testReport/junit/org.apache.spark.sql.util/ContinuousQueryListenerSuite/event_ordering/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14860) Fix flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite "event ordering"

2016-04-22 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-14860:


 Summary: Fix flaky test: 
o.a.s.sql.util.ContinuousQueryListenerSuite "event ordering"
 Key: SPARK-14860
 URL: https://issues.apache.org/jira/browse/SPARK-14860
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 2.0.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


See 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/593/testReport/junit/org.apache.spark.sql.util/ContinuousQueryListenerSuite/event_ordering/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11057) SQL: corr and cov for many columns

2016-04-22 Thread Sajjad Bey (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254778#comment-15254778
 ] 

Sajjad Bey commented on SPARK-11057:


I have been trying to use correlation on a matrix with many columns. @NarineK 
menthioned R like correlation. I wish we had something like what 
[pandas](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html)
 offers. It handles missing data automatically. Take a look 
[here](http://stackoverflow.com/questions/31619578/numpy-corrcoef-compute-correlation-matrix-while-ignoring-missing-data).
 Even the 
[corr()](http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics)
 function from MLlib can not handle missing data. These features are really 
missing from SparkSQL:

- Apply correlation on all columns and return a matrix

- Handle missing data automatically like how [pandas 
](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html)does

> SQL: corr and cov for many columns
> --
>
> Key: SPARK-11057
> URL: https://issues.apache.org/jira/browse/SPARK-11057
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Narine Kokhlikyan
>
> Hi there,
> As we know R has the option to calculate the correlation and covariance for 
> all columns of a dataframe or between columns of two dataframes.
> If we look at apache math package we can see that, they have that too. 
> http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29
> In case we have as input only one DataFrame:
> --
> for correlation:
> cor[i,j] = cor[j,i]
> and for the main diagonal we can have 1s.
> -
> for covariance: 
> cov[i,j] = cov[j,i]
> and for main diagonal: we can compute the variance for that specific column:
> See:
> http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29
> Let me know what do you think.
> I'm working on this and will make a pull request soon.
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14594) Improve error messages for RDD API

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14594:


Assignee: Apache Spark

> Improve error messages for RDD API
> --
>
> Key: SPARK-14594
> URL: https://issues.apache.org/jira/browse/SPARK-14594
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Marco Gaido
>Assignee: Apache Spark
>
> When you have an error in your R code using the RDD API, you always get as 
> error message:
> Error in if (returnStatus != 0) { : argument is of length zero
> This is not very useful and I think it might be better to catch the R 
> exception and show it instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14594) Improve error messages for RDD API

2016-04-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254775#comment-15254775
 ] 

Apache Spark commented on SPARK-14594:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/12622

> Improve error messages for RDD API
> --
>
> Key: SPARK-14594
> URL: https://issues.apache.org/jira/browse/SPARK-14594
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Marco Gaido
>
> When you have an error in your R code using the RDD API, you always get as 
> error message:
> Error in if (returnStatus != 0) { : argument is of length zero
> This is not very useful and I think it might be better to catch the R 
> exception and show it instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14594) Improve error messages for RDD API

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14594:


Assignee: (was: Apache Spark)

> Improve error messages for RDD API
> --
>
> Key: SPARK-14594
> URL: https://issues.apache.org/jira/browse/SPARK-14594
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Marco Gaido
>
> When you have an error in your R code using the RDD API, you always get as 
> error message:
> Error in if (returnStatus != 0) { : argument is of length zero
> This is not very useful and I think it might be better to catch the R 
> exception and show it instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12148:


Assignee: Apache Spark

> SparkR: rename DataFrame to SparkDataFrame
> --
>
> Key: SPARK-12148
> URL: https://issues.apache.org/jira/browse/SPARK-12148
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Michael Lawrence
>Assignee: Apache Spark
>
> The SparkR package represents a Spark DataFrame with the class "DataFrame". 
> That conflicts with the more general DataFrame class defined in the S4Vectors 
> package. Would it not be more appropriate to use the name "SparkDataFrame" 
> instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12148:


Assignee: (was: Apache Spark)

> SparkR: rename DataFrame to SparkDataFrame
> --
>
> Key: SPARK-12148
> URL: https://issues.apache.org/jira/browse/SPARK-12148
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Michael Lawrence
>
> The SparkR package represents a Spark DataFrame with the class "DataFrame". 
> That conflicts with the more general DataFrame class defined in the S4Vectors 
> package. Would it not be more appropriate to use the name "SparkDataFrame" 
> instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame

2016-04-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254762#comment-15254762
 ] 

Apache Spark commented on SPARK-12148:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/12621

> SparkR: rename DataFrame to SparkDataFrame
> --
>
> Key: SPARK-12148
> URL: https://issues.apache.org/jira/browse/SPARK-12148
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Michael Lawrence
>
> The SparkR package represents a Spark DataFrame with the class "DataFrame". 
> That conflicts with the more general DataFrame class defined in the S4Vectors 
> package. Would it not be more appropriate to use the name "SparkDataFrame" 
> instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14858) Push predicates with subquery

2016-04-22 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254759#comment-15254759
 ] 

Herman van Hovell commented on SPARK-14858:
---

Sure I'll take a stab at them; would to great these before the code freeze.

> Push predicates with subquery 
> --
>
> Key: SPARK-14858
> URL: https://issues.apache.org/jira/browse/SPARK-14858
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> Currently we rewrite the subquery as Join in the beginning of Optimizer, we 
> should defer that to enable predicates push down (because Join can't be 
> easily pushed down).
> cc [~hvanhovell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-14797) Spark SQL should not hardcode dependency on spark-sketch_2.11

2016-04-22 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-14797:
--
Comment: was deleted

(was: [~joshrosen], I can no longer build using maven after this commit. I'm 
getting an error from the new enforcer rules:

{code}
[INFO] --- maven-enforcer-plugin:1.4.1:enforce (enforce-versions) @ 
spark-parent_2.11 ---
[WARNING] Rule 0: org.apache.maven.plugins.enforcer.BannedDependencies failed 
with message:
Found Banned Dependency: org.scala-lang.modules:scala-xml_2.11:jar:1.0.2
Found Banned Dependency: org.scalatest:scalatest_2.11:jar:2.2.6
Use 'mvn dependency:tree' to locate the source of the banned dependencies.
{code}

The version set in the root POM for scalatest_2.11 is 2.2.6, so I think that 
there's a problem with the rule.)

> Spark SQL should not hardcode dependency on spark-sketch_2.11
> -
>
> Key: SPARK-14797
> URL: https://issues.apache.org/jira/browse/SPARK-14797
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> Spark SQL's POM hardcodes a dependency on spark-sketch_2.11, but it 
> substitute scala.binary.version instead of hardcoding 2.11. We should fix 
> this and add a Maven Enforcer rule to prevent this from ever happening again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14797) Spark SQL should not hardcode dependency on spark-sketch_2.11

2016-04-22 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254746#comment-15254746
 ] 

Ryan Blue commented on SPARK-14797:
---

[~joshrosen], I can no longer build using maven after this commit. I'm getting 
an error from the new enforcer rules:

{code}
[INFO] --- maven-enforcer-plugin:1.4.1:enforce (enforce-versions) @ 
spark-parent_2.11 ---
[WARNING] Rule 0: org.apache.maven.plugins.enforcer.BannedDependencies failed 
with message:
Found Banned Dependency: org.scala-lang.modules:scala-xml_2.11:jar:1.0.2
Found Banned Dependency: org.scalatest:scalatest_2.11:jar:2.2.6
Use 'mvn dependency:tree' to locate the source of the banned dependencies.
{code}

The version set in the root POM for scalatest_2.11 is 2.2.6, so I think that 
there's a problem with the rule.

> Spark SQL should not hardcode dependency on spark-sketch_2.11
> -
>
> Key: SPARK-14797
> URL: https://issues.apache.org/jira/browse/SPARK-14797
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> Spark SQL's POM hardcodes a dependency on spark-sketch_2.11, but it 
> substitute scala.binary.version instead of hardcoding 2.11. We should fix 
> this and add a Maven Enforcer rule to prevent this from ever happening again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14858) Push predicates with subquery

2016-04-22 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254744#comment-15254744
 ] 

Davies Liu commented on SPARK-14858:


I created a few JIRA related to subquery, it will be great if you could work on 
them. You can adjust the priority as you want.

> Push predicates with subquery 
> --
>
> Key: SPARK-14858
> URL: https://issues.apache.org/jira/browse/SPARK-14858
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> Currently we rewrite the subquery as Join in the beginning of Optimizer, we 
> should defer that to enable predicates push down (because Join can't be 
> easily pushed down).
> cc [~hvanhovell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12543) Support subquery in select/where/having

2016-04-22 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12543.

   Resolution: Fixed
 Assignee: Herman van Hovell  (was: Davies Liu)
Fix Version/s: 2.0.0

> Support subquery in select/where/having
> ---
>
> Key: SPARK-12543
> URL: https://issues.apache.org/jira/browse/SPARK-12543
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Herman van Hovell
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14785) Support correlated scalar subquery

2016-04-22 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14785:
---
Description: 
For example:
{code}
SELECT a from t where b > (select avg(c) from t2 where t.id = t2.id)
{code}
it could be rewritten as 

{code}
SELECT a FROM t JOIN (SELECT id, AVG(c) FROM t2 GROUP by id) t3 ON t3.id = t.id
{code}

TPCDS Q92, Q81, Q6 required this

  was:
For example:

SELECT a from t where b > (select avg(c) from t2 where t.id = t2.id)

it could be rewritten as 

{code}
SELECT a FROM t JOIN (SELECT id, AVG(c) FROM t2 GROUP by id) t3 ON t3.id = t.id
{code}

TPCDS Q92, Q81, Q6 required this


> Support correlated scalar subquery
> --
>
> Key: SPARK-14785
> URL: https://issues.apache.org/jira/browse/SPARK-14785
> Project: Spark
>  Issue Type: New Feature
>Reporter: Davies Liu
>
> For example:
> {code}
> SELECT a from t where b > (select avg(c) from t2 where t.id = t2.id)
> {code}
> it could be rewritten as 
> {code}
> SELECT a FROM t JOIN (SELECT id, AVG(c) FROM t2 GROUP by id) t3 ON t3.id = 
> t.id
> {code}
> TPCDS Q92, Q81, Q6 required this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14785) Support correlated scalar subquery

2016-04-22 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14785:
---
Description: 
For example:

SELECT a from t where b > (select avg(c) from t2 where t.id = t2.id)

it could be rewritten as 

{code}
SELECT a FROM t JOIN (SELECT id, AVG(c) FROM t2 GROUP by id) t3 ON t3.id = t.id
{code}

TPCDS Q92, Q81, Q6 required this

  was:
For example:

SELECT a from t where b > (select c from t2 where t.id = t2.id)

TPCDS Q92, Q81, Q6 required this


> Support correlated scalar subquery
> --
>
> Key: SPARK-14785
> URL: https://issues.apache.org/jira/browse/SPARK-14785
> Project: Spark
>  Issue Type: New Feature
>Reporter: Davies Liu
>
> For example:
> SELECT a from t where b > (select avg(c) from t2 where t.id = t2.id)
> it could be rewritten as 
> {code}
> SELECT a FROM t JOIN (SELECT id, AVG(c) FROM t2 GROUP by id) t3 ON t3.id = 
> t.id
> {code}
> TPCDS Q92, Q81, Q6 required this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14781) Support subquery in nested predicates

2016-04-22 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14781:
---
Assignee: (was: Davies Liu)

> Support subquery in nested predicates
> -
>
> Key: SPARK-14781
> URL: https://issues.apache.org/jira/browse/SPARK-14781
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> Right now, we does not support nested IN/EXISTS subquery, for example 
> EXISTS( x1) OR EXISTS( x2)
> In order to do that, we could use an internal-only join type SemiPlus, which 
> will output every row from left, plus additional column as the result of join 
> condition. Then we could replace the EXISTS() or IN() by the result column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14785) Support correlated scalar subquery

2016-04-22 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14785:
---
Assignee: (was: Davies Liu)

> Support correlated scalar subquery
> --
>
> Key: SPARK-14785
> URL: https://issues.apache.org/jira/browse/SPARK-14785
> Project: Spark
>  Issue Type: New Feature
>Reporter: Davies Liu
>
> For example:
> SELECT a from t where b > (select c from t2 where t.id = t2.id)
> TPCDS Q92, Q81, Q6 required this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14773) Enable the tests in HiveCompatibilitySuite for subquery

2016-04-22 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254709#comment-15254709
 ] 

Herman van Hovell commented on SPARK-14773:
---

Sure

> Enable the tests in HiveCompatibilitySuite for subquery
> ---
>
> Key: SPARK-14773
> URL: https://issues.apache.org/jira/browse/SPARK-14773
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Herman van Hovell
>
> There are a few test cases in HiveCompatibilitySuite  for subquery, we should 
> enable them to have better coverage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14858) Push predicates with subquery

2016-04-22 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254708#comment-15254708
 ] 

Herman van Hovell commented on SPARK-14858:
---

You want me to take this one?

> Push predicates with subquery 
> --
>
> Key: SPARK-14858
> URL: https://issues.apache.org/jira/browse/SPARK-14858
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> Currently we rewrite the subquery as Join in the beginning of Optimizer, we 
> should defer that to enable predicates push down (because Join can't be 
> easily pushed down).
> cc [~hvanhovell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14796) Add spark.sql.optimizer.inSetConversionThreshold config option

2016-04-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14796.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.0.0

> Add spark.sql.optimizer.inSetConversionThreshold config option
> --
>
> Key: SPARK-14796
> URL: https://issues.apache.org/jira/browse/SPARK-14796
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> Currently, `OptimizeIn` optimizer replaces `In` expression into `InSet` 
> expression if the size of set is greater than a constant, 10. 
> This issue aims to make a configuration 
> `spark.sql.optimizer.inSetConversionThreshold` for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-22 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254680#comment-15254680
 ] 

Seth Hendrickson commented on SPARK-14489:
--

This is an interesting idea. I would say that under the current framework for 
stratified sampling, there is not a performant way to guarantee each split 
contains every user at least once (even if we filter out users with < k * n 
items). In naive stratified sampling, you would simply generate a random key 
for each user, and sort the entire dataset, taking even splits amongst each 
user. I am not sure if that is an acceptable option given how expensive a sort 
over the entire dataset would be. Using ScaSRS might actually be worse in this 
case, if the waitlist is close to the size of the requested sample, since the 
waitlists are collected on the driver. I am not sure what options open up if we 
don't require even splits, but just that each split contains every user, but 
there might be something to that.

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14859) [PYSPARK] Make Lambda Serializer Configurable

2016-04-22 Thread Nick White (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254665#comment-15254665
 ] 

Nick White commented on SPARK-14859:


I've got a PR for this here: https://github.com/apache/spark/pull/12620

> [PYSPARK] Make Lambda Serializer Configurable
> -
>
> Key: SPARK-14859
> URL: https://issues.apache.org/jira/browse/SPARK-14859
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Nick White
>
> Currently lambdas (e.g. used in RDD.map) are serialized by a hardcoded 
> reference to the CloudPickleSerializer. The serializer should be 
> configurable, as these lambdas may contain complex objects that need custom 
> serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14859) [PYSPARK] Make Lambda Serializer Configurable

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14859:


Assignee: Apache Spark

> [PYSPARK] Make Lambda Serializer Configurable
> -
>
> Key: SPARK-14859
> URL: https://issues.apache.org/jira/browse/SPARK-14859
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Nick White
>Assignee: Apache Spark
>
> Currently lambdas (e.g. used in RDD.map) are serialized by a hardcoded 
> reference to the CloudPickleSerializer. The serializer should be 
> configurable, as these lambdas may contain complex objects that need custom 
> serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14859) [PYSPARK] Make Lambda Serializer Configurable

2016-04-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254666#comment-15254666
 ] 

Apache Spark commented on SPARK-14859:
--

User 'njwhite' has created a pull request for this issue:
https://github.com/apache/spark/pull/12620

> [PYSPARK] Make Lambda Serializer Configurable
> -
>
> Key: SPARK-14859
> URL: https://issues.apache.org/jira/browse/SPARK-14859
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Nick White
>
> Currently lambdas (e.g. used in RDD.map) are serialized by a hardcoded 
> reference to the CloudPickleSerializer. The serializer should be 
> configurable, as these lambdas may contain complex objects that need custom 
> serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14859) [PYSPARK] Make Lambda Serializer Configurable

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14859:


Assignee: (was: Apache Spark)

> [PYSPARK] Make Lambda Serializer Configurable
> -
>
> Key: SPARK-14859
> URL: https://issues.apache.org/jira/browse/SPARK-14859
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Nick White
>
> Currently lambdas (e.g. used in RDD.map) are serialized by a hardcoded 
> reference to the CloudPickleSerializer. The serializer should be 
> configurable, as these lambdas may contain complex objects that need custom 
> serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14859) [PYSPARK] Make Lambda Serializer Configurable

2016-04-22 Thread Nick White (JIRA)
Nick White created SPARK-14859:
--

 Summary: [PYSPARK] Make Lambda Serializer Configurable
 Key: SPARK-14859
 URL: https://issues.apache.org/jira/browse/SPARK-14859
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.0.0
Reporter: Nick White


Currently lambdas (e.g. used in RDD.map) are serialized by a hardcoded 
reference to the CloudPickleSerializer. The serializer should be configurable, 
as these lambdas may contain complex objects that need custom serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14831) Make ML APIs in SparkR consistent

2016-04-22 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254641#comment-15254641
 ] 

Felix Cheung edited comment on SPARK-14831 at 4/22/16 8:45 PM:
---

2. +1 read.spark.model and write.spark.model might be more consistent with the 
existing R convention for read.* and write.*



was (Author: felixcheung):
2. +1 read.spark.model and write.spark.model might be more consistent with the 
existing R convention.


> Make ML APIs in SparkR consistent
> -
>
> Key: SPARK-14831
> URL: https://issues.apache.org/jira/browse/SPARK-14831
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> In current master, we have 4 ML methods in SparkR:
> {code:none}
> glm(formula, family, data, ...)
> kmeans(data, centers, ...)
> naiveBayes(formula, data, ...)
> survreg(formula, data, ...)
> {code}
> We tried to keep the signatures similar to existing ones in R. However, if we 
> put them together, they are not consistent. One example is k-means, which 
> doesn't accept a formula. Instead of looking at each method independently, we 
> might want to update the signature of kmeans to
> {code:none}
> kmeans(formula, data, centers, ...)
> {code}
> We can also discuss possible global changes here. For example, `glm` puts 
> `family` before `data` while `kmeans` puts `centers` after `data`. This is 
> not consistent. And logically, the formula doesn't mean anything without 
> associating with a DataFrame. So it makes more sense to me to have the 
> following signature:
> {code:none}
> algorithm(df, formula, [required params], [optional params])
> {code}
> If we make this change, we might want to avoid name collisions because they 
> have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.
> Sorry for discussing API changes in the last minute. But I think it would be 
> better to have consistent signatures in SparkR.
> cc: [~shivaram] [~josephkb] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent

2016-04-22 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254641#comment-15254641
 ] 

Felix Cheung commented on SPARK-14831:
--

2. +1 read.spark.model and write.spark.model might be more consistent with the 
existing R convention.


> Make ML APIs in SparkR consistent
> -
>
> Key: SPARK-14831
> URL: https://issues.apache.org/jira/browse/SPARK-14831
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> In current master, we have 4 ML methods in SparkR:
> {code:none}
> glm(formula, family, data, ...)
> kmeans(data, centers, ...)
> naiveBayes(formula, data, ...)
> survreg(formula, data, ...)
> {code}
> We tried to keep the signatures similar to existing ones in R. However, if we 
> put them together, they are not consistent. One example is k-means, which 
> doesn't accept a formula. Instead of looking at each method independently, we 
> might want to update the signature of kmeans to
> {code:none}
> kmeans(formula, data, centers, ...)
> {code}
> We can also discuss possible global changes here. For example, `glm` puts 
> `family` before `data` while `kmeans` puts `centers` after `data`. This is 
> not consistent. And logically, the formula doesn't mean anything without 
> associating with a DataFrame. So it makes more sense to me to have the 
> following signature:
> {code:none}
> algorithm(df, formula, [required params], [optional params])
> {code}
> If we make this change, we might want to avoid name collisions because they 
> have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.
> Sorry for discussing API changes in the last minute. But I think it would be 
> better to have consistent signatures in SparkR.
> cc: [~shivaram] [~josephkb] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14834) Force adding doc for new api in pyspark with @since annotation

2016-04-22 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254639#comment-15254639
 ] 

Reynold Xin commented on SPARK-14834:
-

What is this ticket about?


> Force adding doc for new api in pyspark with @since annotation
> --
>
> Key: SPARK-14834
> URL: https://issues.apache.org/jira/browse/SPARK-14834
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14347) Require Java 8 for Spark 2.x

2016-04-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-14347.
---
Resolution: Later

> Require Java 8 for Spark 2.x
> 
>
> Key: SPARK-14347
> URL: https://issues.apache.org/jira/browse/SPARK-14347
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, Spark Core, SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>
> Putting this down as a JIRA to advance the discussion -- I think this is far 
> enough along to consensus for that.
> The change here is to require Java 8. This means:
> - Require Java 8 in the build
> - Only build and test with Java 8, removing other older Jenkins configs
> - Remove MaxPermSize
> - Remove reflection to use Java 8-only methods
> - Move external/java8-tests to core/streaming and remove profile
> And optionally:
> - Update all Java 8 code to take advantage of 8+ features, like lambdas, for 
> simplification



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14858) Push predicates with subquery

2016-04-22 Thread Davies Liu (JIRA)
Davies Liu created SPARK-14858:
--

 Summary: Push predicates with subquery 
 Key: SPARK-14858
 URL: https://issues.apache.org/jira/browse/SPARK-14858
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu


Currently we rewrite the subquery as Join in the beginning of Optimizer, we 
should defer that to enable predicates push down (because Join can't be easily 
pushed down).

cc [~hvanhovell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14594) Improve error messages for RDD API

2016-04-22 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254631#comment-15254631
 ] 

Felix Cheung commented on SPARK-14594:
--

I see - it is likely then the JVM process died running out of memory possibly?

{code}
  returnStatus <- readInt(conn)
  if (returnStatus != 0) {
stop(readString(conn))
  }
  readObject(conn)
{code}

It is possible that readInt does not return a valid status in that case. I'll 
look into this.

> Improve error messages for RDD API
> --
>
> Key: SPARK-14594
> URL: https://issues.apache.org/jira/browse/SPARK-14594
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Marco Gaido
>
> When you have an error in your R code using the RDD API, you always get as 
> error message:
> Error in if (returnStatus != 0) { : argument is of length zero
> This is not very useful and I think it might be better to catch the R 
> exception and show it instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14846) Driver process fails to terminate when graceful shutdown is used

2016-04-22 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254628#comment-15254628
 ] 

Shixiong Zhu commented on SPARK-14846:
--

`awaitTermination` doesn't need to wait at least one hour. 
`jobExecutor.shutdown()` is called before `jobExecutor.awaitTermination`. So 
when all threads in `jobExecutor` are done, awaitTermination will return.

> Driver process fails to terminate when graceful shutdown is used
> 
>
> Key: SPARK-14846
> URL: https://issues.apache.org/jira/browse/SPARK-14846
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: Mattias Aspholm
>
> During shutdown, the job scheduler in Streaming (JobScheduler.stop) spends 
> some time waiting for all queued work to complete. If graceful shutdown is 
> used, the time is 1 hour, for non-graceful shutdown it's 2 seconds.
> The wait is implemented using the ThreadPoolExecutor.awaitTermination method 
> in java.util.concurrent. The problem is that instead of looping over the 
> method for the desired period of time, the wait period is passed in as the 
> timeout parameter to awaitTermination. 
> The result is that if the termination condition is false the first time, the 
> method will sleep for the timeout period before trying again. In the case of 
> graceful shutdown this means at least an hour's wait before the condition is 
> checked again, even though all work is completed in just a few seconds. The 
> driver process will continue to live during this time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14846) Driver process fails to terminate when graceful shutdown is used

2016-04-22 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254628#comment-15254628
 ] 

Shixiong Zhu edited comment on SPARK-14846 at 4/22/16 8:39 PM:
---

`awaitTermination` doesn't need to wait at least one hour. 
`jobExecutor.shutdown()` is called before `jobExecutor.awaitTermination`. So 
when all tasks in `jobExecutor` are done, awaitTermination will return.


was (Author: zsxwing):
`awaitTermination` doesn't need to wait at least one hour. 
`jobExecutor.shutdown()` is called before `jobExecutor.awaitTermination`. So 
when all threads in `jobExecutor` are done, awaitTermination will return.

> Driver process fails to terminate when graceful shutdown is used
> 
>
> Key: SPARK-14846
> URL: https://issues.apache.org/jira/browse/SPARK-14846
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: Mattias Aspholm
>
> During shutdown, the job scheduler in Streaming (JobScheduler.stop) spends 
> some time waiting for all queued work to complete. If graceful shutdown is 
> used, the time is 1 hour, for non-graceful shutdown it's 2 seconds.
> The wait is implemented using the ThreadPoolExecutor.awaitTermination method 
> in java.util.concurrent. The problem is that instead of looping over the 
> method for the desired period of time, the wait period is passed in as the 
> timeout parameter to awaitTermination. 
> The result is that if the termination condition is false the first time, the 
> method will sleep for the timeout period before trying again. In the case of 
> graceful shutdown this means at least an hour's wait before the condition is 
> checked again, even though all work is completed in just a few seconds. The 
> driver process will continue to live during this time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14785) Support correlated scalar subquery

2016-04-22 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-14785:
--

Assignee: Davies Liu

> Support correlated scalar subquery
> --
>
> Key: SPARK-14785
> URL: https://issues.apache.org/jira/browse/SPARK-14785
> Project: Spark
>  Issue Type: New Feature
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> For example:
> SELECT a from t where b > (select c from t2 where t.id = t2.id)
> TPCDS Q92, Q81, Q6 required this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14773) Enable the tests in HiveCompatibilitySuite for subquery

2016-04-22 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14773:
---
Assignee: Herman van Hovell

> Enable the tests in HiveCompatibilitySuite for subquery
> ---
>
> Key: SPARK-14773
> URL: https://issues.apache.org/jira/browse/SPARK-14773
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Herman van Hovell
>
> There are a few test cases in HiveCompatibilitySuite  for subquery, we should 
> enable them to have better coverage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14773) Enable the tests in HiveCompatibilitySuite for subquery

2016-04-22 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254622#comment-15254622
 ] 

Davies Liu commented on SPARK-14773:


[~hvanhovell] Could you take this?

> Enable the tests in HiveCompatibilitySuite for subquery
> ---
>
> Key: SPARK-14773
> URL: https://issues.apache.org/jira/browse/SPARK-14773
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Herman van Hovell
>
> There are a few test cases in HiveCompatibilitySuite  for subquery, we should 
> enable them to have better coverage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14856) Returning batch unexpected from wide table

2016-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14856:


Assignee: Davies Liu  (was: Apache Spark)

> Returning batch unexpected from wide table
> --
>
> Key: SPARK-14856
> URL: https://issues.apache.org/jira/browse/SPARK-14856
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> When there the required schema support batch, but not full schema, the 
> parquet reader may return batch unexpectedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14856) Returning batch unexpected from wide table

2016-04-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254592#comment-15254592
 ] 

Apache Spark commented on SPARK-14856:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/12619

> Returning batch unexpected from wide table
> --
>
> Key: SPARK-14856
> URL: https://issues.apache.org/jira/browse/SPARK-14856
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> When there the required schema support batch, but not full schema, the 
> parquet reader may return batch unexpectedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >