[jira] [Commented] (SPARK-20334) Return a better error message when correlated predicates contain aggregate expression that has mixture of outer and local references

2017-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968650#comment-15968650
 ] 

Apache Spark commented on SPARK-20334:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/17636

> Return a better error message when correlated predicates contain aggregate 
> expression that has mixture of outer and local references
> 
>
> Key: SPARK-20334
> URL: https://issues.apache.org/jira/browse/SPARK-20334
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Dilip Biswal
>Priority: Minor
>
> Currently subqueries with correlated predicates containing aggregate 
> expression having mixture of outer references and local references generate a 
> code gen error like following :
> {code:java}
> Cannot evaluate expression: min((input[0, int, false] + input[4, int, false]))
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:226)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.nullSafeCodeGen(Expression.scala:461)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.defineCodeGen(Expression.scala:443)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryComparison.doGenCode(predicates.scala:431)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103)
> {code}
> We should catch this situation and return a better error message to the user.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20334) Return a better error message when correlated predicates contain aggregate expression that has mixture of outer and local references

2017-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20334:


Assignee: Apache Spark

> Return a better error message when correlated predicates contain aggregate 
> expression that has mixture of outer and local references
> 
>
> Key: SPARK-20334
> URL: https://issues.apache.org/jira/browse/SPARK-20334
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Dilip Biswal
>Assignee: Apache Spark
>Priority: Minor
>
> Currently subqueries with correlated predicates containing aggregate 
> expression having mixture of outer references and local references generate a 
> code gen error like following :
> {code:java}
> Cannot evaluate expression: min((input[0, int, false] + input[4, int, false]))
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:226)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.nullSafeCodeGen(Expression.scala:461)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.defineCodeGen(Expression.scala:443)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryComparison.doGenCode(predicates.scala:431)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103)
> {code}
> We should catch this situation and return a better error message to the user.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20334) Return a better error message when correlated predicates contain aggregate expression that has mixture of outer and local references

2017-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20334:


Assignee: (was: Apache Spark)

> Return a better error message when correlated predicates contain aggregate 
> expression that has mixture of outer and local references
> 
>
> Key: SPARK-20334
> URL: https://issues.apache.org/jira/browse/SPARK-20334
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Dilip Biswal
>Priority: Minor
>
> Currently subqueries with correlated predicates containing aggregate 
> expression having mixture of outer references and local references generate a 
> code gen error like following :
> {code:java}
> Cannot evaluate expression: min((input[0, int, false] + input[4, int, false]))
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:226)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.nullSafeCodeGen(Expression.scala:461)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.defineCodeGen(Expression.scala:443)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryComparison.doGenCode(predicates.scala:431)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103)
> {code}
> We should catch this situation and return a better error message to the user.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20335) Children expressions of Hive UDF impacts the determinism of Hive UDF

2017-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20335:


Assignee: Apache Spark  (was: Xiao Li)

> Children expressions of Hive UDF impacts the determinism of Hive UDF
> 
>
> Key: SPARK-20335
> URL: https://issues.apache.org/jira/browse/SPARK-20335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> {noformat}
>   /**
>* Certain optimizations should not be applied if UDF is not deterministic.
>* Deterministic UDF returns same result each time it is invoked with a
>* particular input. This determinism just needs to hold within the context 
> of
>* a query.
>*
>* @return true if the UDF is deterministic
>*/
>   boolean deterministic() default true;
> {noformat}
> Based on the definition o UDFType, when Hive UDF's children are 
> non-deterministic, Hive UDF is also non-deterministic.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20335) Children expressions of Hive UDF impacts the determinism of Hive UDF

2017-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20335:


Assignee: Xiao Li  (was: Apache Spark)

> Children expressions of Hive UDF impacts the determinism of Hive UDF
> 
>
> Key: SPARK-20335
> URL: https://issues.apache.org/jira/browse/SPARK-20335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> {noformat}
>   /**
>* Certain optimizations should not be applied if UDF is not deterministic.
>* Deterministic UDF returns same result each time it is invoked with a
>* particular input. This determinism just needs to hold within the context 
> of
>* a query.
>*
>* @return true if the UDF is deterministic
>*/
>   boolean deterministic() default true;
> {noformat}
> Based on the definition o UDFType, when Hive UDF's children are 
> non-deterministic, Hive UDF is also non-deterministic.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20335) Children expressions of Hive UDF impacts the determinism of Hive UDF

2017-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968639#comment-15968639
 ] 

Apache Spark commented on SPARK-20335:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17635

> Children expressions of Hive UDF impacts the determinism of Hive UDF
> 
>
> Key: SPARK-20335
> URL: https://issues.apache.org/jira/browse/SPARK-20335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> {noformat}
>   /**
>* Certain optimizations should not be applied if UDF is not deterministic.
>* Deterministic UDF returns same result each time it is invoked with a
>* particular input. This determinism just needs to hold within the context 
> of
>* a query.
>*
>* @return true if the UDF is deterministic
>*/
>   boolean deterministic() default true;
> {noformat}
> Based on the definition o UDFType, when Hive UDF's children are 
> non-deterministic, Hive UDF is also non-deterministic.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20335) Children expressions of Hive UDF impacts the determinism of Hive UDF

2017-04-13 Thread Xiao Li (JIRA)
Xiao Li created SPARK-20335:
---

 Summary: Children expressions of Hive UDF impacts the determinism 
of Hive UDF
 Key: SPARK-20335
 URL: https://issues.apache.org/jira/browse/SPARK-20335
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.2
Reporter: Xiao Li
Assignee: Xiao Li


{noformat}
  /**
   * Certain optimizations should not be applied if UDF is not deterministic.
   * Deterministic UDF returns same result each time it is invoked with a
   * particular input. This determinism just needs to hold within the context of
   * a query.
   *
   * @return true if the UDF is deterministic
   */
  boolean deterministic() default true;
{noformat}

Based on the definition o UDFType, when Hive UDF's children are 
non-deterministic, Hive UDF is also non-deterministic.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20334) Return a better error message when correlated predicates contain aggregate expression that has mixture of outer and local references

2017-04-13 Thread Dilip Biswal (JIRA)
Dilip Biswal created SPARK-20334:


 Summary: Return a better error message when correlated predicates 
contain aggregate expression that has mixture of outer and local references
 Key: SPARK-20334
 URL: https://issues.apache.org/jira/browse/SPARK-20334
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.1.0
Reporter: Dilip Biswal
Priority: Minor


Currently subqueries with correlated predicates containing aggregate expression 
having mixture of outer references and local references generate a code gen 
error like following :

{code:java}
Cannot evaluate expression: min((input[0, int, false] + input[4, int, false]))
at 
org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:226)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106)
at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103)
at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.nullSafeCodeGen(Expression.scala:461)
at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.defineCodeGen(Expression.scala:443)
at 
org.apache.spark.sql.catalyst.expressions.BinaryComparison.doGenCode(predicates.scala:431)
at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106)
at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103)
{code}

We should catch this situation and return a better error message to the user.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20333) Fix HashPartitioner in DAGSchedulerSuite

2017-04-13 Thread jin xing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jin xing updated SPARK-20333:
-
Description: 
In test 
"don't submit stage until its dependencies map outputs are registered 
(SPARK-5259)"
"run trivial shuffle with out-of-band executor failure and retry"
"reduce tasks should be placed locally with map output"
HashPartitioner should be compatible with num of child RDD's partitions. 

  was:In test "don't submit stage until its dependencies map outputs are 
registered (SPARK-5259)", HashPartitioner should be compatible with num of 
child RDD's partitions. 


> Fix HashPartitioner in DAGSchedulerSuite
> 
>
> Key: SPARK-20333
> URL: https://issues.apache.org/jira/browse/SPARK-20333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: jin xing
>Priority: Minor
>
> In test 
> "don't submit stage until its dependencies map outputs are registered 
> (SPARK-5259)"
> "run trivial shuffle with out-of-band executor failure and retry"
> "reduce tasks should be placed locally with map output"
> HashPartitioner should be compatible with num of child RDD's partitions. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20333) Fix HashPartitioner in DAGSchedulerSuite

2017-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20333:


Assignee: Apache Spark

> Fix HashPartitioner in DAGSchedulerSuite
> 
>
> Key: SPARK-20333
> URL: https://issues.apache.org/jira/browse/SPARK-20333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: Apache Spark
>Priority: Minor
>
> In test "don't submit stage until its dependencies map outputs are registered 
> (SPARK-5259)", HashPartitioner should be compatible with num of child RDD's 
> partitions. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20333) Fix HashPartitioner in DAGSchedulerSuite

2017-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20333:


Assignee: (was: Apache Spark)

> Fix HashPartitioner in DAGSchedulerSuite
> 
>
> Key: SPARK-20333
> URL: https://issues.apache.org/jira/browse/SPARK-20333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: jin xing
>Priority: Minor
>
> In test "don't submit stage until its dependencies map outputs are registered 
> (SPARK-5259)", HashPartitioner should be compatible with num of child RDD's 
> partitions. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20333) Fix HashPartitioner in DAGSchedulerSuite

2017-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968576#comment-15968576
 ] 

Apache Spark commented on SPARK-20333:
--

User 'jinxing64' has created a pull request for this issue:
https://github.com/apache/spark/pull/17634

> Fix HashPartitioner in DAGSchedulerSuite
> 
>
> Key: SPARK-20333
> URL: https://issues.apache.org/jira/browse/SPARK-20333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: jin xing
>Priority: Minor
>
> In test "don't submit stage until its dependencies map outputs are registered 
> (SPARK-5259)", HashPartitioner should be compatible with num of child RDD's 
> partitions. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20333) Fix HashPartitioner in DAGSchedulerSuite

2017-04-13 Thread jin xing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jin xing updated SPARK-20333:
-
Description: In test "don't submit stage until its dependencies map outputs 
are registered (SPARK-5259)", HashPartitioner should be compatible with num of 
child RDD's partitions. 

> Fix HashPartitioner in DAGSchedulerSuite
> 
>
> Key: SPARK-20333
> URL: https://issues.apache.org/jira/browse/SPARK-20333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: jin xing
>Priority: Minor
>
> In test "don't submit stage until its dependencies map outputs are registered 
> (SPARK-5259)", HashPartitioner should be compatible with num of child RDD's 
> partitions. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20333) Fix HashPartitioner in DAGSchedulerSuite

2017-04-13 Thread jin xing (JIRA)
jin xing created SPARK-20333:


 Summary: Fix HashPartitioner in DAGSchedulerSuite
 Key: SPARK-20333
 URL: https://issues.apache.org/jira/browse/SPARK-20333
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: jin xing
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos

2017-04-13 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968545#comment-15968545
 ] 

Saisai Shao commented on SPARK-16742:
-

[~mgummelt], do you have a design doc of the kerberos support for Spark on 
Mesos, so that my work of SPARK-19143 could be based on yours.

> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20332) Avro/Parquet GenericFixed decimal is not read into Spark correctly

2017-04-13 Thread Justin Pihony (JIRA)
Justin Pihony created SPARK-20332:
-

 Summary: Avro/Parquet GenericFixed decimal is not read into Spark 
correctly
 Key: SPARK-20332
 URL: https://issues.apache.org/jira/browse/SPARK-20332
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Justin Pihony
Priority: Minor


Take the following code:

spark-shell --packages org.apache.avro:avro:1.8.1
import org.apache.avro.{Conversions, LogicalTypes, Schema}
import java.math.BigDecimal
val dc = new Conversions.DecimalConversion()
val javaBD = BigDecimal.valueOf(643.85924958)
val schema = 
Schema.parse("{\"type\":\"record\",\"name\":\"Header\",\"namespace\":\"org.apache.avro.file\",\"fields\":["+
"{\"name\":\"COLUMN\",\"type\":[\"null\",{\"type\":\"fixed\",\"name\":\"COLUMN\","+
"\"size\":19,\"precision\":17,\"scale\":8,\"logicalType\":\"decimal\"}]}]}")
val schemaDec = schema.getField("COLUMN").schema()
val fieldSchema = if(schemaDec.getType() == Schema.Type.UNION)
schemaDec.getTypes.get(1) else schemaDec
val converted = dc.toFixed(javaBD, fieldSchema,
LogicalTypes.decimal(javaBD.precision, javaBD.scale))
sqlContext.createDataFrame(List(("value",converted)))

and you'll get this error:

java.lang.UnsupportedOperationException: Schema for type
org.apache.avro.generic.GenericFixed is not supported

However if you write out a parquet file using the AvroParquetWriter and the
above GenericFixed value (converted), then read it in via the
DataFrameReader the decimal value that is retrieved is not accurate (ie.
643... above is listed as -0.5...)

Even if not supported, is there any way to at least have it throw an
UnsupportedOperationException as it does when you try to do it directly (as
compared to read in from a file)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20293) In the page of 'jobs' or 'stages' of history server web ui,,click the 'Go' button, query paging data, the page error

2017-04-13 Thread Alex Bozarth (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Bozarth resolved SPARK-20293.
--
Resolution: Duplicate

Already fixed in master and branch-2.1

> In the page of 'jobs' or 'stages' of history server web ui,,click the 'Go' 
> button,  query paging data, the page error
> -
>
> Key: SPARK-20293
> URL: https://issues.apache.org/jira/browse/SPARK-20293
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
> Attachments: error1.png, error2.png, jobs.png, stages.png
>
>
> In the page of 'jobs' or 'stages' of history server web ui,
> Click on the 'Go' button, query paging data, the page error, function can not 
> be used.
> The reasons are as follows:
> '#' Was escaped by the browser as% 23.
> & CompletedStage.desc = true% 23completed, the parameter value desc becomes = 
> true% 23, causing the page to report an error. The error is as follows:
> HTTP ERROR 400
> Problem Access / history / app-20170411132432-0004 / stages /. Reason:
>  For input string: "true # completed"
> Powered by Jetty: //
> The amendments are as follows:
> The URL of the accessed URL is escaped to ensure that the URL is not escaped 
> by the browser.
> please see attachment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

2017-04-13 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968474#comment-15968474
 ] 

Marcelo Vanzin commented on SPARK-20328:


bq. Since the driver is authenticated, it can request further delegation tokens

No. To create a delegation token you need a TGT. You can't create a delegation 
token just with an existing delegation token. If that were possible, all the 
shenanigans to distribute the user's keytab for long running applications 
wouldn't be needed.

> HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
> -
>
> Key: SPARK-20328
> URL: https://issues.apache.org/jira/browse/SPARK-20328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1, 2.1.2
>Reporter: Michael Gummelt
>
> In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
> MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138
> Semantically, this is a problem because a HadoopRDD does not represent a 
> Hadoop MapReduce job.  Practically, this is a problem because this line: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
>  results in this MapReduce-specific security code being called: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
>  which assumes the MapReduce master is configured (e.g. via 
> {{yarn.resourcemanager.*}}).  If it isn't, an exception is thrown.
> So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
> the Spark Mesos scheduler:
> {code}
> Exception in thread "main" java.io.IOException: Can't get Master Kerberos 
> principal for use as renewer
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
> {code}
> I have a workaround where I set a YARN-specific configuration variable to 
> trick {{TokenCache}} into thinking YARN is configured, but this is obviously 
> suboptimal.
> The proper fix to this would likely require significant {{hadoop}} 
> refactoring to make split information available without going through 
> {{JobConf}}, so I'm not yet sure what the best course of action is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20293) In the page of 'jobs' or 'stages' of history server web ui,,click the 'Go' button, query paging data, the page error

2017-04-13 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968473#comment-15968473
 ] 

Hyukjin Kwon commented on SPARK-20293:
--

I can't reproduce in the current master. See - 
https://github.com/apache/spark/pull/17608#issuecomment-294059200

> In the page of 'jobs' or 'stages' of history server web ui,,click the 'Go' 
> button,  query paging data, the page error
> -
>
> Key: SPARK-20293
> URL: https://issues.apache.org/jira/browse/SPARK-20293
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
> Attachments: error1.png, error2.png, jobs.png, stages.png
>
>
> In the page of 'jobs' or 'stages' of history server web ui,
> Click on the 'Go' button, query paging data, the page error, function can not 
> be used.
> The reasons are as follows:
> '#' Was escaped by the browser as% 23.
> & CompletedStage.desc = true% 23completed, the parameter value desc becomes = 
> true% 23, causing the page to report an error. The error is as follows:
> HTTP ERROR 400
> Problem Access / history / app-20170411132432-0004 / stages /. Reason:
>  For input string: "true # completed"
> Powered by Jetty: //
> The amendments are as follows:
> The URL of the accessed URL is escaped to ensure that the URL is not escaped 
> by the browser.
> please see attachment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

2017-04-13 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968469#comment-15968469
 ] 

Michael Gummelt commented on SPARK-20328:
-

bq. I have no idea what that means.

I'm pretty sure a delegation token is just another way for a subject to 
authenticate.  So the driver uses the delegation token provided to it by 
{{spark-submit}} to authenticate.  This is what I mean by "driver is already 
logged in via the delegation token".  Since the driver is authenticated, it can 
request further delegation tokens.  But my point is that it shouldn't need to, 
because that code is not "delegating" the tokens to any other process, which is 
the only thing delegation tokens are needed for.

But this is neither here nor there.  I think I know what I have to do.

> HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
> -
>
> Key: SPARK-20328
> URL: https://issues.apache.org/jira/browse/SPARK-20328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1, 2.1.2
>Reporter: Michael Gummelt
>
> In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
> MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138
> Semantically, this is a problem because a HadoopRDD does not represent a 
> Hadoop MapReduce job.  Practically, this is a problem because this line: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
>  results in this MapReduce-specific security code being called: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
>  which assumes the MapReduce master is configured (e.g. via 
> {{yarn.resourcemanager.*}}).  If it isn't, an exception is thrown.
> So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
> the Spark Mesos scheduler:
> {code}
> Exception in thread "main" java.io.IOException: Can't get Master Kerberos 
> principal for use as renewer
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
> {code}
> I have a workaround where I set a YARN-specific configuration variable to 
> trick {{TokenCache}} into thinking YARN is configured, but this is obviously 
> suboptimal.
> The proper fix to this would likely require significant {{hadoop}} 
> refactoring to make split information available without going through 
> {{JobConf}}, so I'm not yet sure what the best course of action is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

2017-04-13 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968440#comment-15968440
 ] 

Marcelo Vanzin commented on SPARK-20328:


bq. that the driver is already logged in via the delegation token

I have no idea what that means.

Yes, "spark-submit" has a TGT. It uses it to login to e.g. HDFS and generate 
delegation tokens.

But in this situation, the *driver* has no TGT; it only has the delegation 
token generated by the spark-submit process, which has no communication with 
the driver. So when the driver calls into that code you linked that tries to 
fetch delegation tokens, it should fail. But it doesn't. Which tells me the 
code detects whether there is already a valid delegation token and, in that 
case, doesn't try to create a new one.

So what I'm saying is that if the above is correct, all you have to do is 
create delegation tokens yourself when the Mesos backend initializes (i.e. 
before any HadoopRDD code is run), and you'll avoid the issue with setting the 
configuration options.

That's something you'd have to do anyway, because delegation tokens are needed 
by the executors to talk to the data nodes.

> HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
> -
>
> Key: SPARK-20328
> URL: https://issues.apache.org/jira/browse/SPARK-20328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1, 2.1.2
>Reporter: Michael Gummelt
>
> In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
> MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138
> Semantically, this is a problem because a HadoopRDD does not represent a 
> Hadoop MapReduce job.  Practically, this is a problem because this line: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
>  results in this MapReduce-specific security code being called: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
>  which assumes the MapReduce master is configured (e.g. via 
> {{yarn.resourcemanager.*}}).  If it isn't, an exception is thrown.
> So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
> the Spark Mesos scheduler:
> {code}
> Exception in thread "main" java.io.IOException: Can't get Master Kerberos 
> principal for use as renewer
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
> {code}
> I have a workaround where I set a YARN-specific configuration variable to 
> trick {{TokenCache}} into thinking YARN is configured, but this is obviously 
> suboptimal.
> The proper fix to this would likely require significant {{hadoop}} 
> refactoring to make split information available without going through 
> {{JobConf}}, so I'm not yet sure what the best course of action is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20331) Broaden support for Hive partition pruning predicate pushdown

2017-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20331:


Assignee: Apache Spark

> Broaden support for Hive partition pruning predicate pushdown
> -
>
> Key: SPARK-20331
> URL: https://issues.apache.org/jira/browse/SPARK-20331
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Michael Allman
>Assignee: Apache Spark
>
> Spark 2.1 introduced scalable support for Hive tables with huge numbers of 
> partitions. Key to leveraging this support is the ability to prune 
> unnecessary table partitions to answer queries. Spark supports a subset of 
> the class of partition pruning predicates that the Hive metastore supports. 
> If a user writes a query with a partition pruning predicate that is *not* 
> supported by Spark, Spark falls back to loading all partitions and pruning 
> client-side. We want to broaden Spark's current partition pruning predicate 
> pushdown capabilities.
> One of the key missing capabilities is support for disjunctions. For example, 
> for a table partitioned by date, specifying with a predicate like
> {code}date = 20161011 or date = 20161014{code}
> will result in Spark fetching all partitions. For a table partitioned by date 
> and hour, querying a range of hours across dates can be quite difficult to 
> accomplish without fetching all partition metadata.
> The current partition pruning support supports only comparisons against 
> literals. We can expand that to foldable expressions by evaluating them at 
> planning time.
> We can also implement support for the "IN" comparison by expanding it to a 
> sequence of "OR"s.
> This ticket covers those enhancements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20331) Broaden support for Hive partition pruning predicate pushdown

2017-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20331:


Assignee: (was: Apache Spark)

> Broaden support for Hive partition pruning predicate pushdown
> -
>
> Key: SPARK-20331
> URL: https://issues.apache.org/jira/browse/SPARK-20331
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Michael Allman
>
> Spark 2.1 introduced scalable support for Hive tables with huge numbers of 
> partitions. Key to leveraging this support is the ability to prune 
> unnecessary table partitions to answer queries. Spark supports a subset of 
> the class of partition pruning predicates that the Hive metastore supports. 
> If a user writes a query with a partition pruning predicate that is *not* 
> supported by Spark, Spark falls back to loading all partitions and pruning 
> client-side. We want to broaden Spark's current partition pruning predicate 
> pushdown capabilities.
> One of the key missing capabilities is support for disjunctions. For example, 
> for a table partitioned by date, specifying with a predicate like
> {code}date = 20161011 or date = 20161014{code}
> will result in Spark fetching all partitions. For a table partitioned by date 
> and hour, querying a range of hours across dates can be quite difficult to 
> accomplish without fetching all partition metadata.
> The current partition pruning support supports only comparisons against 
> literals. We can expand that to foldable expressions by evaluating them at 
> planning time.
> We can also implement support for the "IN" comparison by expanding it to a 
> sequence of "OR"s.
> This ticket covers those enhancements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20331) Broaden support for Hive partition pruning predicate pushdown

2017-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968439#comment-15968439
 ] 

Apache Spark commented on SPARK-20331:
--

User 'mallman' has created a pull request for this issue:
https://github.com/apache/spark/pull/17633

> Broaden support for Hive partition pruning predicate pushdown
> -
>
> Key: SPARK-20331
> URL: https://issues.apache.org/jira/browse/SPARK-20331
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Michael Allman
>
> Spark 2.1 introduced scalable support for Hive tables with huge numbers of 
> partitions. Key to leveraging this support is the ability to prune 
> unnecessary table partitions to answer queries. Spark supports a subset of 
> the class of partition pruning predicates that the Hive metastore supports. 
> If a user writes a query with a partition pruning predicate that is *not* 
> supported by Spark, Spark falls back to loading all partitions and pruning 
> client-side. We want to broaden Spark's current partition pruning predicate 
> pushdown capabilities.
> One of the key missing capabilities is support for disjunctions. For example, 
> for a table partitioned by date, specifying with a predicate like
> {code}date = 20161011 or date = 20161014{code}
> will result in Spark fetching all partitions. For a table partitioned by date 
> and hour, querying a range of hours across dates can be quite difficult to 
> accomplish without fetching all partition metadata.
> The current partition pruning support supports only comparisons against 
> literals. We can expand that to foldable expressions by evaluating them at 
> planning time.
> We can also implement support for the "IN" comparison by expanding it to a 
> sequence of "OR"s.
> This ticket covers those enhancements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20331) Broaden support for Hive partition pruning predicate pushdown

2017-04-13 Thread Michael Allman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Allman updated SPARK-20331:
---
Description: 
Spark 2.1 introduced scalable support for Hive tables with huge numbers of 
partitions. Key to leveraging this support is the ability to prune unnecessary 
table partitions to answer queries. Spark supports a subset of the class of 
partition pruning predicates that the Hive metastore supports. If a user writes 
a query with a partition pruning predicate that is *not* supported by Spark, 
Spark falls back to loading all partitions and pruning client-side. We want to 
broaden Spark's current partition pruning predicate pushdown capabilities.

One of the key missing capabilities is support for disjunctions. For example, 
for a table partitioned by date, specifying with a predicate like

{code}date = 20161011 or date = 20161014{code}

will result in Spark fetching all partitions. For a table partitioned by date 
and hour, querying a range of hours across dates can be quite difficult to 
accomplish without fetching all partition metadata.

The current partition pruning support supports only comparisons against 
literals. We can expand that to foldable expressions by evaluating them at 
planning time.

We can also implement support for the "IN" comparison by expanding it to a 
sequence of "OR"s.

This ticket covers those enhancements.

  was:
Spark 2.1 introduced scalable support for Hive tables with huge numbers of 
partitions. Key to leveraging this support is the ability to prune unnecessary 
table partitions to answer queries. Spark supports a subset of the class of 
partition pruning predicates that the Hive metastore supports. If a user writes 
a query with a partition pruning predicate that is *not* supported by Spark, 
Spark falls back to loading all partitions and pruning client-side. We want to 
broaden Spark's current partition pruning predicate pushdown capabilities.

One of the key missing capabilities is support for disjunctions. For example, 
for a table partitioned by date, specifying with a predicate like

{code}date = 20161011 or date = 20161014{code}

will result in Spark fetching all partitions. For a table partitioned by date 
and hour, querying a range of hours across dates can be quite difficult to 
accomplish without fetching all partition metadata.

The current partition pruning support supports only comparisons against 
literals. We can expand that to foldable expressions by evaluating them at 
planning time.

We can also implement support for the "IN" comparison by expanding it to a 
sequence of "OR"s.

This ticket covers those enhancements.

A PR will follow.


> Broaden support for Hive partition pruning predicate pushdown
> -
>
> Key: SPARK-20331
> URL: https://issues.apache.org/jira/browse/SPARK-20331
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Michael Allman
>
> Spark 2.1 introduced scalable support for Hive tables with huge numbers of 
> partitions. Key to leveraging this support is the ability to prune 
> unnecessary table partitions to answer queries. Spark supports a subset of 
> the class of partition pruning predicates that the Hive metastore supports. 
> If a user writes a query with a partition pruning predicate that is *not* 
> supported by Spark, Spark falls back to loading all partitions and pruning 
> client-side. We want to broaden Spark's current partition pruning predicate 
> pushdown capabilities.
> One of the key missing capabilities is support for disjunctions. For example, 
> for a table partitioned by date, specifying with a predicate like
> {code}date = 20161011 or date = 20161014{code}
> will result in Spark fetching all partitions. For a table partitioned by date 
> and hour, querying a range of hours across dates can be quite difficult to 
> accomplish without fetching all partition metadata.
> The current partition pruning support supports only comparisons against 
> literals. We can expand that to foldable expressions by evaluating them at 
> planning time.
> We can also implement support for the "IN" comparison by expanding it to a 
> sequence of "OR"s.
> This ticket covers those enhancements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

2017-04-13 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968432#comment-15968432
 ] 

Michael Gummelt commented on SPARK-20328:
-

bq. It depends. e.g. on YARN, when you submit in cluster mode, the driver is 
running in the cluster and all it has are delegation tokens. (The TGT is only 
available to the launcher process.)

Right, but my understanding is that the driver is already logged in via the 
delegation token provided to it by the {{spark-submit}} process (via 
{{amContainer.setTokens}}), so it wouldn't need to then fetch further 
delegation tokens.

> HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
> -
>
> Key: SPARK-20328
> URL: https://issues.apache.org/jira/browse/SPARK-20328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1, 2.1.2
>Reporter: Michael Gummelt
>
> In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
> MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138
> Semantically, this is a problem because a HadoopRDD does not represent a 
> Hadoop MapReduce job.  Practically, this is a problem because this line: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
>  results in this MapReduce-specific security code being called: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
>  which assumes the MapReduce master is configured (e.g. via 
> {{yarn.resourcemanager.*}}).  If it isn't, an exception is thrown.
> So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
> the Spark Mesos scheduler:
> {code}
> Exception in thread "main" java.io.IOException: Can't get Master Kerberos 
> principal for use as renewer
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
> {code}
> I have a workaround where I set a YARN-specific configuration variable to 
> trick {{TokenCache}} into thinking YARN is configured, but this is obviously 
> suboptimal.
> The proper fix to this would likely require significant {{hadoop}} 
> refactoring to make split information available without going through 
> {{JobConf}}, so I'm not yet sure what the best course of action is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

2017-04-13 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968423#comment-15968423
 ] 

Marcelo Vanzin commented on SPARK-20328:


bq. But it shouldn't need delegation tokens at all, right?

It depends. e.g. on YARN, when you submit in cluster mode, the driver is 
running in the cluster and all it has are delegation tokens. (The TGT is only 
available to the launcher process.)

Actually it would be interesting to understand how that case works internally; 
because if that code is trying to generate delegation tokens, it should 
theoretically fail in the above scenario. So maybe it doesn't generate tokens 
if they're already there, and that could be a workaround for your case too.

> HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
> -
>
> Key: SPARK-20328
> URL: https://issues.apache.org/jira/browse/SPARK-20328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1, 2.1.2
>Reporter: Michael Gummelt
>
> In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
> MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138
> Semantically, this is a problem because a HadoopRDD does not represent a 
> Hadoop MapReduce job.  Practically, this is a problem because this line: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
>  results in this MapReduce-specific security code being called: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
>  which assumes the MapReduce master is configured (e.g. via 
> {{yarn.resourcemanager.*}}).  If it isn't, an exception is thrown.
> So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
> the Spark Mesos scheduler:
> {code}
> Exception in thread "main" java.io.IOException: Can't get Master Kerberos 
> principal for use as renewer
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
> {code}
> I have a workaround where I set a YARN-specific configuration variable to 
> trick {{TokenCache}} into thinking YARN is configured, but this is obviously 
> suboptimal.
> The proper fix to this would likely require significant {{hadoop}} 
> refactoring to make split information available without going through 
> {{JobConf}}, so I'm not yet sure what the best course of action is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

2017-04-13 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968416#comment-15968416
 ] 

Michael Gummelt edited comment on SPARK-20328 at 4/14/17 12:02 AM:
---

bq. It shouldn't need to do it not for the reasons you mention, but because 
Spark already the necessary credentials available (either a TGT, or a valid 
delegation token for HDFS).

But it shouldn't need delegation tokens at all, right?  The authentication of 
the currently logged in user, whether it be through the OS or through Kerberos, 
should be sufficient.


was (Author: mgummelt):
bq. It shouldn't need to do it not for the reasons you mention, but because 
Spark already the necessary credentials available (either a TGT, or a valid 
delegation token for HDFS).

But it shouldn't need delegation tokens at all, right?

> HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
> -
>
> Key: SPARK-20328
> URL: https://issues.apache.org/jira/browse/SPARK-20328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1, 2.1.2
>Reporter: Michael Gummelt
>
> In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
> MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138
> Semantically, this is a problem because a HadoopRDD does not represent a 
> Hadoop MapReduce job.  Practically, this is a problem because this line: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
>  results in this MapReduce-specific security code being called: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
>  which assumes the MapReduce master is configured (e.g. via 
> {{yarn.resourcemanager.*}}).  If it isn't, an exception is thrown.
> So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
> the Spark Mesos scheduler:
> {code}
> Exception in thread "main" java.io.IOException: Can't get Master Kerberos 
> principal for use as renewer
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
> {code}
> I have a workaround where I set a YARN-specific configuration variable to 
> trick {{TokenCache}} into thinking YARN is configured, but this is obviously 
> suboptimal.
> The proper fix to this would likely require significant {{hadoop}} 
> refactoring to make split information available without going through 
> {{JobConf}}, so I'm not yet sure what the best course of action is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

2017-04-13 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968411#comment-15968411
 ] 

Michael Gummelt edited comment on SPARK-20328 at 4/13/17 11:59 PM:
---

bq. The Mesos backend (I mean the code in Spark, not the Mesos service) can set 
the configs in the SparkContext's "hadoopConfiguration" object, can't it?

I suppose this would work.  It would rely on the assumption that the Mesos 
scheduler backend is started before the HadoopRDD is created, which happens to 
be true, but ideally we wouldn't have to rely on that ordering.  Right now I'm 
just setting it in {{SparkSubmit}}, but that's not great either.

I filed a Hadoop ticket for the {{FileInputFormat}} issue and linked it here.


was (Author: mgummelt):
> The Mesos backend (I mean the code in Spark, not the Mesos service) can set 
> the configs in the SparkContext's "hadoopConfiguration" object, can't it?

I suppose this would work.  It would rely on the assumption that the Mesos 
scheduler backend is started before the HadoopRDD is created, which happens to 
be true, but ideally we wouldn't have to rely on that ordering.  Right now I'm 
just setting it in {{SparkSubmit}}, but that's not great either.

I filed a Hadoop ticket for the {{FileInputFormat}} issue and linked it here.

> HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
> -
>
> Key: SPARK-20328
> URL: https://issues.apache.org/jira/browse/SPARK-20328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1, 2.1.2
>Reporter: Michael Gummelt
>
> In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
> MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138
> Semantically, this is a problem because a HadoopRDD does not represent a 
> Hadoop MapReduce job.  Practically, this is a problem because this line: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
>  results in this MapReduce-specific security code being called: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
>  which assumes the MapReduce master is configured (e.g. via 
> {{yarn.resourcemanager.*}}).  If it isn't, an exception is thrown.
> So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
> the Spark Mesos scheduler:
> {code}
> Exception in thread "main" java.io.IOException: Can't get Master Kerberos 
> principal for use as renewer
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
> {code}
> I have a workaround where I set a YARN-specific configuration variable to 
> trick {{TokenCache}} into thinking YARN is configured, but this is obviously 
> suboptimal.
> The proper fix to this would likely require significant {{hadoop}} 
> refactoring to make split information available without going through 
> {{JobConf}}, so I'm not yet sure what the best course of action is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

2017-04-13 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968416#comment-15968416
 ] 

Michael Gummelt edited comment on SPARK-20328 at 4/13/17 11:59 PM:
---

bq. It shouldn't need to do it not for the reasons you mention, but because 
Spark already the necessary credentials available (either a TGT, or a valid 
delegation token for HDFS).

But it shouldn't need delegation tokens at all, right?


was (Author: mgummelt):
> It shouldn't need to do it not for the reasons you mention, but because Spark 
> already the necessary credentials available (either a TGT, or a valid 
> delegation token for HDFS).

But it shouldn't need delegation tokens at all, right?

> HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
> -
>
> Key: SPARK-20328
> URL: https://issues.apache.org/jira/browse/SPARK-20328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1, 2.1.2
>Reporter: Michael Gummelt
>
> In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
> MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138
> Semantically, this is a problem because a HadoopRDD does not represent a 
> Hadoop MapReduce job.  Practically, this is a problem because this line: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
>  results in this MapReduce-specific security code being called: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
>  which assumes the MapReduce master is configured (e.g. via 
> {{yarn.resourcemanager.*}}).  If it isn't, an exception is thrown.
> So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
> the Spark Mesos scheduler:
> {code}
> Exception in thread "main" java.io.IOException: Can't get Master Kerberos 
> principal for use as renewer
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
> {code}
> I have a workaround where I set a YARN-specific configuration variable to 
> trick {{TokenCache}} into thinking YARN is configured, but this is obviously 
> suboptimal.
> The proper fix to this would likely require significant {{hadoop}} 
> refactoring to make split information available without going through 
> {{JobConf}}, so I'm not yet sure what the best course of action is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

2017-04-13 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968416#comment-15968416
 ] 

Michael Gummelt commented on SPARK-20328:
-

> It shouldn't need to do it not for the reasons you mention, but because Spark 
> already the necessary credentials available (either a TGT, or a valid 
> delegation token for HDFS).

But it shouldn't need delegation tokens at all, right?

> HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
> -
>
> Key: SPARK-20328
> URL: https://issues.apache.org/jira/browse/SPARK-20328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1, 2.1.2
>Reporter: Michael Gummelt
>
> In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
> MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138
> Semantically, this is a problem because a HadoopRDD does not represent a 
> Hadoop MapReduce job.  Practically, this is a problem because this line: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
>  results in this MapReduce-specific security code being called: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
>  which assumes the MapReduce master is configured (e.g. via 
> {{yarn.resourcemanager.*}}).  If it isn't, an exception is thrown.
> So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
> the Spark Mesos scheduler:
> {code}
> Exception in thread "main" java.io.IOException: Can't get Master Kerberos 
> principal for use as renewer
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
> {code}
> I have a workaround where I set a YARN-specific configuration variable to 
> trick {{TokenCache}} into thinking YARN is configured, but this is obviously 
> suboptimal.
> The proper fix to this would likely require significant {{hadoop}} 
> refactoring to make split information available without going through 
> {{JobConf}}, so I'm not yet sure what the best course of action is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

2017-04-13 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968411#comment-15968411
 ] 

Michael Gummelt commented on SPARK-20328:
-

> The Mesos backend (I mean the code in Spark, not the Mesos service) can set 
> the configs in the SparkContext's "hadoopConfiguration" object, can't it?

I suppose this would work.  It would rely on the assumption that the Mesos 
scheduler backend is started before the HadoopRDD is created, which happens to 
be true, but ideally we wouldn't have to rely on that ordering.  Right now I'm 
just setting it in {{SparkSubmit}}, but that's not great either.

I filed a Hadoop ticket for the {{FileInputFormat}} issue and linked it here.

> HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
> -
>
> Key: SPARK-20328
> URL: https://issues.apache.org/jira/browse/SPARK-20328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1, 2.1.2
>Reporter: Michael Gummelt
>
> In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
> MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138
> Semantically, this is a problem because a HadoopRDD does not represent a 
> Hadoop MapReduce job.  Practically, this is a problem because this line: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
>  results in this MapReduce-specific security code being called: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
>  which assumes the MapReduce master is configured (e.g. via 
> {{yarn.resourcemanager.*}}).  If it isn't, an exception is thrown.
> So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
> the Spark Mesos scheduler:
> {code}
> Exception in thread "main" java.io.IOException: Can't get Master Kerberos 
> principal for use as renewer
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
> {code}
> I have a workaround where I set a YARN-specific configuration variable to 
> trick {{TokenCache}} into thinking YARN is configured, but this is obviously 
> suboptimal.
> The proper fix to this would likely require significant {{hadoop}} 
> refactoring to make split information available without going through 
> {{JobConf}}, so I'm not yet sure what the best course of action is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

2017-04-13 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968396#comment-15968396
 ] 

Marcelo Vanzin commented on SPARK-20328:


bq. The problem can't be solved in the Mesos backend

I meant setting the configs. The Mesos backend (I mean the code in Spark, not 
the Mesos service) can set the configs in the SparkContext's 
"hadoopConfiguration" object, can't it? Otherwise you'd be putting a burden on 
the user to have a proper Hadoop config around with those properties set.

bq. is why in the world is FileInputFormat fetching delegation tokens

That's actually a good question. It shouldn't need to do it not for the reasons 
you mention, but because Spark already the necessary credentials available 
(either a TGT, or a valid delegation token for HDFS).

> HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
> -
>
> Key: SPARK-20328
> URL: https://issues.apache.org/jira/browse/SPARK-20328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1, 2.1.2
>Reporter: Michael Gummelt
>
> In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
> MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138
> Semantically, this is a problem because a HadoopRDD does not represent a 
> Hadoop MapReduce job.  Practically, this is a problem because this line: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
>  results in this MapReduce-specific security code being called: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
>  which assumes the MapReduce master is configured (e.g. via 
> {{yarn.resourcemanager.*}}).  If it isn't, an exception is thrown.
> So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
> the Spark Mesos scheduler:
> {code}
> Exception in thread "main" java.io.IOException: Can't get Master Kerberos 
> principal for use as renewer
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
> {code}
> I have a workaround where I set a YARN-specific configuration variable to 
> trick {{TokenCache}} into thinking YARN is configured, but this is obviously 
> suboptimal.
> The proper fix to this would likely require significant {{hadoop}} 
> refactoring to make split information available without going through 
> {{JobConf}}, so I'm not yet sure what the best course of action is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

2017-04-13 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968388#comment-15968388
 ] 

Michael Gummelt edited comment on SPARK-20328 at 4/13/17 11:27 PM:
---

Hey [~vanzin], thanks for the response.

Everything you said is correct, but I want to clarify one thing:

> You just need to make the Mesos backend in Spark do that automatically for 
> the submitting user.

The problem can't be solved in the Mesos backend.  When I fetch delegation 
tokens for transmission to Executors in the Mesos backend, there's no problem.  
I can set whatever renewer I want.

The problem is that there's a second location where delegation tokens are 
fetched: {{HadoopRDD}}.  This is entirely separate from the fetching that the 
scheduler backends do (either Mesos or YARN).  {{HadoopRDD}} tries to fetch 
split data, and ultimately calls into {{TokenCache}} in the hadoop library, 
which fetches delegation tokens with the renewer set to the YARN 
ResourceManager's principal: 
https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L213.
  I'm currently solving this by setting that config var in {{SparkSubmit}}.

The big question I have, which I suppose is more for the {{hadoop}} team, is 
why in the world is {{FileInputFormat}} fetching delegation tokens?  AFAICT, 
they're not sending those tokens to any other process.  They're just fetching 
split data directly from the Name Nodes, and there should be no delegation 
required.


was (Author: mgummelt):
Hey [~vanzin], thanks for the response.

Everything you said is correct, but I want to clarify one thing:

> You just need to make the Mesos backend in Spark do that automatically for 
> the submitting user.

The problem can't be solved in the Mesos backend.  When I fetch delegation 
tokens for transmission to Executors in the Mesos backend, there's no problem.  
I can set whatever renewer I want.

The problem is that there's a second location where delegation tokens are 
fetched: {{HadoopRDD}}.  This is entirely separate from the fetching that the 
scheduler backends do (either Mesos or YARN).  {{HadoopRDD}} tries to fetch 
split data, and ultimately calls into {{TokenCache}} in the hadoop library, 
which fetches delegation tokens with the renewer set to the YARN 
ResourceManager's principal: 
https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L213.
  I'm currently solving this by setting that config var in {{SparkSubmit}}.

The big question I have, which I suppose is more for the {{hadoop}} team, is 
why in the world is {{FileInputFormat}} fetching delegation tokens.  AFAICT, 
they're not sending those tokens to any other process.  They're just fetching 
split data directly from the Name Nodes, and there should be no delegation 
required.

> HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
> -
>
> Key: SPARK-20328
> URL: https://issues.apache.org/jira/browse/SPARK-20328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1, 2.1.2
>Reporter: Michael Gummelt
>
> In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
> MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138
> Semantically, this is a problem because a HadoopRDD does not represent a 
> Hadoop MapReduce job.  Practically, this is a problem because this line: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
>  results in this MapReduce-specific security code being called: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
>  which assumes the MapReduce master is configured (e.g. via 
> {{yarn.resourcemanager.*}}).  If it isn't, an exception is thrown.
> So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
> the Spark Mesos scheduler:
> {code}
> Exception in thread "main" java.io.IOException: Can't get Master Kerberos 
> principal for use as renewer
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
>  

[jira] [Comment Edited] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

2017-04-13 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968388#comment-15968388
 ] 

Michael Gummelt edited comment on SPARK-20328 at 4/13/17 11:27 PM:
---

Hey [~vanzin], thanks for the response.

Everything you said is correct, but I want to clarify one thing:

> You just need to make the Mesos backend in Spark do that automatically for 
> the submitting user.

The problem can't be solved in the Mesos backend.  When I fetch delegation 
tokens for transmission to Executors in the Mesos backend, there's no problem.  
I can set whatever renewer I want.

The problem is that there's a second location where delegation tokens are 
fetched: {{HadoopRDD}}.  This is entirely separate from the fetching that the 
scheduler backends do (either Mesos or YARN).  {{HadoopRDD}} tries to fetch 
split data, and ultimately calls into {{TokenCache}} in the hadoop library, 
which fetches delegation tokens with the renewer set to the YARN 
ResourceManager's principal: 
https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L213.
  I'm currently solving this by setting that config var in {{SparkSubmit}}.

The big question I have, which I suppose is more for the {{hadoop}} team, is 
why in the world is {{FileInputFormat}} fetching delegation tokens.  AFAICT, 
they're not sending those tokens to any other process.  They're just fetching 
split data directly from the Name Nodes, and there should be no delegation 
required.


was (Author: mgummelt):
Hey [~vanzin], thanks for the response.

Everything you said is correct, but I want to clarify one thing:

> You just need to make the Mesos backend in Spark do that automatically for 
> the submitting user.

The problem can't be solved in the Mesos backend.  When I fetch delegation 
tokens for transmission to Executors in the Mesos backend, there's no problem.  
I can set whatever renewer I want.

The problem is that there's a second location where delegation tokens are 
fetched: {{HadoopRDD}}.  This is entirely separate from the fetching that the 
scheduler backends do (either Mesos or YARN).  {{HadoopRDD}} tries to fetch 
split data, and ultimately calls into {{TokenCache}} in the hadoop library, 
which fetches delegation tokens with the renewer set to the YARN 
ResourceManager's principal: 
https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L213

The big question I have, which I suppose is more for the {{hadoop}} team, is 
why in the world is {{FileInputFormat}} fetching delegation tokens.  AFAICT, 
they're not sending those tokens to any other process.  They're just fetching 
split data directly from the Name Nodes, and there should be no delegation 
required.

> HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
> -
>
> Key: SPARK-20328
> URL: https://issues.apache.org/jira/browse/SPARK-20328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1, 2.1.2
>Reporter: Michael Gummelt
>
> In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
> MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138
> Semantically, this is a problem because a HadoopRDD does not represent a 
> Hadoop MapReduce job.  Practically, this is a problem because this line: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
>  results in this MapReduce-specific security code being called: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
>  which assumes the MapReduce master is configured (e.g. via 
> {{yarn.resourcemanager.*}}).  If it isn't, an exception is thrown.
> So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
> the Spark Mesos scheduler:
> {code}
> Exception in thread "main" java.io.IOException: Can't get Master Kerberos 
> principal for use as renewer
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
>   at 
> 

[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

2017-04-13 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968388#comment-15968388
 ] 

Michael Gummelt commented on SPARK-20328:
-

Hey [~vanzin], thanks for the response.

Everything you said is correct, but I want to clarify one thing:

> You just need to make the Mesos backend in Spark do that automatically for 
> the submitting user.

The problem can't be solved in the Mesos backend.  When I fetch delegation 
tokens for transmission to Executors in the Mesos backend, there's no problem.  
I can set whatever renewer I want.

The problem is that there's a second location where delegation tokens are 
fetched: {{HadoopRDD}}.  This is entirely separate from the fetching that the 
scheduler backends do (either Mesos or YARN).  {{HadoopRDD}} tries to fetch 
split data, and ultimately calls into {{TokenCache}} in the hadoop library, 
which fetches delegation tokens with the renewer set to the YARN 
ResourceManager's principal: 
https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L213

The big question I have, which I suppose is more for the {{hadoop}} team, is 
why in the world is {{FileInputFormat}} fetching delegation tokens.  AFAICT, 
they're not sending those tokens to any other process.  They're just fetching 
split data directly from the Name Nodes, and there should be no delegation 
required.

> HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
> -
>
> Key: SPARK-20328
> URL: https://issues.apache.org/jira/browse/SPARK-20328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1, 2.1.2
>Reporter: Michael Gummelt
>
> In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
> MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138
> Semantically, this is a problem because a HadoopRDD does not represent a 
> Hadoop MapReduce job.  Practically, this is a problem because this line: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
>  results in this MapReduce-specific security code being called: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
>  which assumes the MapReduce master is configured (e.g. via 
> {{yarn.resourcemanager.*}}).  If it isn't, an exception is thrown.
> So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
> the Spark Mesos scheduler:
> {code}
> Exception in thread "main" java.io.IOException: Can't get Master Kerberos 
> principal for use as renewer
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
> {code}
> I have a workaround where I set a YARN-specific configuration variable to 
> trick {{TokenCache}} into thinking YARN is configured, but this is obviously 
> suboptimal.
> The proper fix to this would likely require significant {{hadoop}} 
> refactoring to make split information available without going through 
> {{JobConf}}, so I'm not yet sure what the best course of action is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20330) CLONE - SparkContext.localProperties leaked

2017-04-13 Thread Oleg White (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleg White closed SPARK-20330.
--
Resolution: Duplicate

> CLONE - SparkContext.localProperties leaked
> ---
>
> Key: SPARK-20330
> URL: https://issues.apache.org/jira/browse/SPARK-20330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Oleg White
>Priority: Minor
>
> I have a non-deterministic but quite reliable reproduction for a case where 
> {{spark.sql.execution.id}} is leaked. Operations then die with 
> {{spark.sql.execution.id is already set}}. These threads never recover 
> because there is nothing to unset {{spark.sql.execution.id}}. (It's not a 
> case of nested {{withNewExecutionId}} calls.)
> I have figured out why this happens. We are within a {{withNewExecutionId}} 
> block. At some point we call back to user code. (In our case this is a custom 
> data source's {{buildScan}} method.) The user code calls 
> {{scala.concurrent.Await.result}}. Because our thread is a member of a 
> {{ForkJoinPool}} (this is a Play HTTP serving thread) {{Await.result}} starts 
> a new thread. {{SparkContext.localProperties}} is cloned for this thread and 
> then it's ready to serve an HTTP request.
> The first thread then finishes waiting, finishes {{buildScan}}, and leaves 
> {{withNewExecutionId}}, clearing {{spark.sql.execution.id}} in the 
> {{finally}} block. All good. But some time later another HTTP request will be 
> served by the second thread. This thread is "poisoned" with a 
> {{spark.sql.execution.id}}. When it tries to use {{withNewExecutionId}} it 
> fails.
> 
> I don't know who's at fault here. 
>  - I don't like the {{ThreadLocal}} properties anyway. Why not create an 
> Execution object and let it wrap the operation? Then you could have two 
> executions in parallel on the same thread, and other stuff like that. It 
> would be much clearer than storing the execution ID in a kind-of-global 
> variable.
>  - Why do we have to inherit the {{ThreadLocal}} properties? I'm sure there 
> is a good reason, but this is essentially a bug-generator in my view. (It has 
> already generated https://issues.apache.org/jira/browse/SPARK-10563.)
>  - {{Await.result}} --- I never would have thought it starts threads.
>  - We probably shouldn't be calling {{Await.result}} inside {{buildScan}}.
>  - We probably shouldn't call Spark things from HTTP serving threads.
> I'm not sure what could be done on the Spark side, but I thought I should 
> mention this interesting issue. For supporting evidence here is the stack 
> trace when {{localProperties}} is getting cloned. It's contents at that point 
> are:
> {noformat}
> {spark.sql.execution.id=0, spark.rdd.scope.noOverride=true, 
> spark.rdd.scope={"id":"4","name":"ExecutedCommand"}}
> {noformat}
> {noformat}
>   at org.apache.spark.SparkContext$$anon$2.childValue(SparkContext.scala:364) 
> [spark-assembly-1.6.0-hadoop2.4.0.jar:1.6.0]
>   at org.apache.spark.SparkContext$$anon$2.childValue(SparkContext.scala:362) 
> [spark-assembly-1.6.0-hadoop2.4.0.jar:1.6.0]
>   at java.lang.ThreadLocal$ThreadLocalMap.(ThreadLocal.java:353) 
> [na:1.7.0_91]
>   at java.lang.ThreadLocal$ThreadLocalMap.(ThreadLocal.java:261) 
> [na:1.7.0_91]
>   at java.lang.ThreadLocal.createInheritedMap(ThreadLocal.java:236) 
> [na:1.7.0_91]   
>   at java.lang.Thread.init(Thread.java:416) [na:1.7.0_91] 
>   
>   at java.lang.Thread.init(Thread.java:349) [na:1.7.0_91] 
>   
>   at java.lang.Thread.(Thread.java:508) [na:1.7.0_91]   
>   
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.(ForkJoinWorkerThread.java:48)
>  [org.scala-lang.scala-library-2.10.5.jar:na]
>   at 
> scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory$$anon$2.(ExecutionContextImpl.scala:42)
>  [org.scala-lang.scala-library-2.10.5.jar:na]
>   at 
> scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory.newThread(ExecutionContextImpl.scala:42)
>  [org.scala-lang.scala-library-2.10.5.jar:na]
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.tryCompensate(ForkJoinPool.java:2341) 
> [org.scala-lang.scala-library-2.10.5.jar:na]
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.managedBlock(ForkJoinPool.java:3638) 
> [org.scala-lang.scala-library-2.10.5.jar:na]
>   at 
> scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory$$anon$2.blockOn(ExecutionContextImpl.scala:45)
>  [org.scala-lang.scala-library-2.10.5.jar:na]
>   at scala.concurrent.Await$.result(package.scala:107) 
> [org.scala-lang.scala-library-2.10.5.jar:na] 
>   at 
> 

[jira] [Created] (SPARK-20331) Broaden support for Hive partition pruning predicate pushdown

2017-04-13 Thread Michael Allman (JIRA)
Michael Allman created SPARK-20331:
--

 Summary: Broaden support for Hive partition pruning predicate 
pushdown
 Key: SPARK-20331
 URL: https://issues.apache.org/jira/browse/SPARK-20331
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Michael Allman


Spark 2.1 introduced scalable support for Hive tables with huge numbers of 
partitions. Key to leveraging this support is the ability to prune unnecessary 
table partitions to answer queries. Spark supports a subset of the class of 
partition pruning predicates that the Hive metastore supports. If a user writes 
a query with a partition pruning predicate that is *not* supported by Spark, 
Spark falls back to loading all partitions and pruning client-side. We want to 
broaden Spark's current partition pruning predicate pushdown capabilities.

One of the key missing capabilities is support for disjunctions. For example, 
for a table partitioned by date, specifying with a predicate like

{code}date = 20161011 or date = 20161014{code}

will result in Spark fetching all partitions. For a table partitioned by date 
and hour, querying a range of hours across dates can be quite difficult to 
accomplish without fetching all partition metadata.

The current partition pruning support supports only comparisons against 
literals. We can expand that to foldable expressions by evaluating them at 
planning time.

We can also implement support for the "IN" comparison by expanding it to a 
sequence of "OR"s.

This ticket covers those enhancements.

A PR will follow.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

2017-04-13 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968343#comment-15968343
 ] 

Marcelo Vanzin commented on SPARK-20328:


Hmm... that seems related to delegation token support. Delegation tokens need a 
"renewer", and generally in YARN applications the renewer is the YARN service 
(IIRC), since it will take care of renewing delegation tokens submitted with 
your application (and cancel them after the application is done).

In your case Mesos doesn't know about kerberos, so the user submitting the app 
needs to be the renewer; and, aside from this particular issue, you may need to 
add code to actually renew those tokens periodically (this is different from 
creating new tokens after their max life time).

I don't think you'll find a different way around this from the one you have 
(setting the YARN configs). You just need to make the Mesos backend in Spark do 
that automatically for the submitting user. As far as the Hadoop library, you 
could try to open a bug so they can add an explicit option so that non-MR, 
non-YARN applications can set the renewer more easily.

> HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
> -
>
> Key: SPARK-20328
> URL: https://issues.apache.org/jira/browse/SPARK-20328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1, 2.1.2
>Reporter: Michael Gummelt
>
> In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
> MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138
> Semantically, this is a problem because a HadoopRDD does not represent a 
> Hadoop MapReduce job.  Practically, this is a problem because this line: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
>  results in this MapReduce-specific security code being called: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
>  which assumes the MapReduce master is configured (e.g. via 
> {{yarn.resourcemanager.*}}).  If it isn't, an exception is thrown.
> So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
> the Spark Mesos scheduler:
> {code}
> Exception in thread "main" java.io.IOException: Can't get Master Kerberos 
> principal for use as renewer
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
> {code}
> I have a workaround where I set a YARN-specific configuration variable to 
> trick {{TokenCache}} into thinking YARN is configured, but this is obviously 
> suboptimal.
> The proper fix to this would likely require significant {{hadoop}} 
> refactoring to make split information available without going through 
> {{JobConf}}, so I'm not yet sure what the best course of action is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

2017-04-13 Thread Michael Gummelt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gummelt updated SPARK-20328:

Description: 
In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138

Semantically, this is a problem because a HadoopRDD does not represent a Hadoop 
MapReduce job.  Practically, this is a problem because this line: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
 results in this MapReduce-specific security code being called: 
https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
 which assumes the MapReduce master is configured (e.g. via 
{{yarn.resourcemanager.*}}).  If it isn't, an exception is thrown.

So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
the Spark Mesos scheduler:

{code}
Exception in thread "main" java.io.IOException: Can't get Master Kerberos 
principal for use as renewer
at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
{code}

I have a workaround where I set a YARN-specific configuration variable to trick 
{{TokenCache}} into thinking YARN is configured, but this is obviously 
suboptimal.

The proper fix to this would likely require significant {{hadoop}} refactoring 
to make split information available without going through {{JobConf}}, so I'm 
not yet sure what the best course of action is.

  was:
In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138

Semantically, this is a problem because a HadoopRDD does not represent a Hadoop 
MapReduce job.  Practically, this is a problem because this line: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
 results in this MapReduce-specific security code being called: 
https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
 which assumes the MapReduce master is configured.  If it isn't, an exception 
is thrown.

So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
the Spark Mesos scheduler:

{code}
Exception in thread "main" java.io.IOException: Can't get Master Kerberos 
principal for use as renewer
at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
{code}

I have a workaround where I set a YARN-specific configuration variable to trick 
{{TokenCache}} into thinking YARN is configured, but this is obviously 
suboptimal.

The proper fix to this would likely require significant {{hadoop}} refactoring 
to make split information available without going through {{JobConf}}, so I'm 
not yet sure what the best course of action is.


> HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
> -
>
> Key: SPARK-20328
> URL: https://issues.apache.org/jira/browse/SPARK-20328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1, 2.1.2
>Reporter: Michael Gummelt
>
> In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
> MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138
> 

[jira] [Created] (SPARK-20330) CLONE - SparkContext.localProperties leaked

2017-04-13 Thread Oleg White (JIRA)
Oleg White created SPARK-20330:
--

 Summary: CLONE - SparkContext.localProperties leaked
 Key: SPARK-20330
 URL: https://issues.apache.org/jira/browse/SPARK-20330
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Oleg White
Priority: Minor


I have a non-deterministic but quite reliable reproduction for a case where 
{{spark.sql.execution.id}} is leaked. Operations then die with 
{{spark.sql.execution.id is already set}}. These threads never recover because 
there is nothing to unset {{spark.sql.execution.id}}. (It's not a case of 
nested {{withNewExecutionId}} calls.)

I have figured out why this happens. We are within a {{withNewExecutionId}} 
block. At some point we call back to user code. (In our case this is a custom 
data source's {{buildScan}} method.) The user code calls 
{{scala.concurrent.Await.result}}. Because our thread is a member of a 
{{ForkJoinPool}} (this is a Play HTTP serving thread) {{Await.result}} starts a 
new thread. {{SparkContext.localProperties}} is cloned for this thread and then 
it's ready to serve an HTTP request.

The first thread then finishes waiting, finishes {{buildScan}}, and leaves 
{{withNewExecutionId}}, clearing {{spark.sql.execution.id}} in the {{finally}} 
block. All good. But some time later another HTTP request will be served by the 
second thread. This thread is "poisoned" with a {{spark.sql.execution.id}}. 
When it tries to use {{withNewExecutionId}} it fails.



I don't know who's at fault here. 

 - I don't like the {{ThreadLocal}} properties anyway. Why not create an 
Execution object and let it wrap the operation? Then you could have two 
executions in parallel on the same thread, and other stuff like that. It would 
be much clearer than storing the execution ID in a kind-of-global variable.
 - Why do we have to inherit the {{ThreadLocal}} properties? I'm sure there is 
a good reason, but this is essentially a bug-generator in my view. (It has 
already generated https://issues.apache.org/jira/browse/SPARK-10563.)
 - {{Await.result}} --- I never would have thought it starts threads.
 - We probably shouldn't be calling {{Await.result}} inside {{buildScan}}.
 - We probably shouldn't call Spark things from HTTP serving threads.

I'm not sure what could be done on the Spark side, but I thought I should 
mention this interesting issue. For supporting evidence here is the stack trace 
when {{localProperties}} is getting cloned. It's contents at that point are:

{noformat}
{spark.sql.execution.id=0, spark.rdd.scope.noOverride=true, 
spark.rdd.scope={"id":"4","name":"ExecutedCommand"}}
{noformat}

{noformat}
  at org.apache.spark.SparkContext$$anon$2.childValue(SparkContext.scala:364) 
[spark-assembly-1.6.0-hadoop2.4.0.jar:1.6.0]
  at org.apache.spark.SparkContext$$anon$2.childValue(SparkContext.scala:362) 
[spark-assembly-1.6.0-hadoop2.4.0.jar:1.6.0]
  at java.lang.ThreadLocal$ThreadLocalMap.(ThreadLocal.java:353) 
[na:1.7.0_91]
  at java.lang.ThreadLocal$ThreadLocalMap.(ThreadLocal.java:261) 
[na:1.7.0_91]
  at java.lang.ThreadLocal.createInheritedMap(ThreadLocal.java:236) 
[na:1.7.0_91]   
  at java.lang.Thread.init(Thread.java:416) [na:1.7.0_91]   

  at java.lang.Thread.init(Thread.java:349) [na:1.7.0_91]   

  at java.lang.Thread.(Thread.java:508) [na:1.7.0_91] 

  at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.(ForkJoinWorkerThread.java:48)
 [org.scala-lang.scala-library-2.10.5.jar:na]
  at 
scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory$$anon$2.(ExecutionContextImpl.scala:42)
 [org.scala-lang.scala-library-2.10.5.jar:na]
  at 
scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory.newThread(ExecutionContextImpl.scala:42)
 [org.scala-lang.scala-library-2.10.5.jar:na]
  at 
scala.concurrent.forkjoin.ForkJoinPool.tryCompensate(ForkJoinPool.java:2341) 
[org.scala-lang.scala-library-2.10.5.jar:na]
  at 
scala.concurrent.forkjoin.ForkJoinPool.managedBlock(ForkJoinPool.java:3638) 
[org.scala-lang.scala-library-2.10.5.jar:na]
  at 
scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory$$anon$2.blockOn(ExecutionContextImpl.scala:45)
 [org.scala-lang.scala-library-2.10.5.jar:na]
  at scala.concurrent.Await$.result(package.scala:107) 
[org.scala-lang.scala-library-2.10.5.jar:na] 
  at 
com.lynxanalytics.biggraph.graph_api.SafeFuture.awaitResult(SafeFuture.scala:50)
 [biggraph.jar]
  at 
com.lynxanalytics.biggraph.graph_api.DataManager.get(DataManager.scala:315) 
[biggraph.jar] 
  at 
com.lynxanalytics.biggraph.graph_api.Scripting$.getData(Scripting.scala:87) 
[biggraph.jar] 
  at 
com.lynxanalytics.biggraph.table.TableRelation$$anonfun$1.apply(DefaultSource.scala:46)
 

[jira] [Created] (SPARK-20329) Resolution error when HAVING clause uses GROUP BY expression that involves implicit type coercion

2017-04-13 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-20329:
--

 Summary: Resolution error when HAVING clause uses GROUP BY 
expression that involves implicit type coercion
 Key: SPARK-20329
 URL: https://issues.apache.org/jira/browse/SPARK-20329
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Josh Rosen
Priority: Blocker


The following example runs without error on Spark 2.0.x and 2.1.x but fails in 
the current Spark master:

{code}
create temporary view foo (a, b) as values (cast(1 as bigint), 2), (cast(3 as 
bigint), 4);

select a + b from foo group by a + b having (a + b) > 1 
{code}

The error is

{code}
Error in SQL statement: AnalysisException: cannot resolve '`a`' given input 
columns: [(a + CAST(b AS BIGINT))]; line 1 pos 45;
'Filter (('a + 'b) > 1)
+- Aggregate [(a#249243L + cast(b#249244 as bigint))], [(a#249243L + 
cast(b#249244 as bigint)) AS (a + CAST(b AS BIGINT))#249246L]
   +- SubqueryAlias foo
  +- Project [col1#249241L AS a#249243L, col2#249242 AS b#249244]
 +- LocalRelation [col1#249241L, col2#249242]
{code}

I think what's happening here is that the implicit cast is breaking things: if 
we change the types so that both columns are integers then the analysis error 
disappears. Similarly, adding explicit casts, as in

{code}
select a + cast(b as bigint) from foo group by a + cast(b as bigint) having (a 
+ cast(b as bigint)) > 1 
{code}

works so I'm pretty sure that the resolution problem is being introduced when 
the casts are automatically added by the type coercion rule.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16900) Complete-mode output for file sinks

2017-04-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-16900:
-
Component/s: (was: DStreams)
 Structured Streaming

> Complete-mode output for file sinks
> ---
>
> Key: SPARK-16900
> URL: https://issues.apache.org/jira/browse/SPARK-16900
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Vladimir Feinberg
>
> Currently there is no way to checkpoint aggregations (see SPARK-16899), 
> except by using a custom foreach-based sink, which is pretty difficult and 
> requires that the user deal with ensuring idempotency, versioning, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20312) query optimizer calls udf with null values when it doesn't expect them

2017-04-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20312:
-
Component/s: (was: Spark Core)
 SQL

> query optimizer calls udf with null values when it doesn't expect them
> --
>
> Key: SPARK-20312
> URL: https://issues.apache.org/jira/browse/SPARK-20312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Albert Meltzer
>
> When optimizing an outer join, spark passes an empty row to both sides to see 
> if nulls would be ignored (side comment: for half-outer joins it subsequently 
> ignores the assessment on the dominant side).
> For some reason, a condition such as {{xx IS NOT NULL && udf(xx) IS NOT 
> NULL}} might result in checking the right side first, and an exception if the 
> udf doesn't expect a null input (given the left side first).
> A example is SIMILAR to the following (see actual query plans separately):
> {noformat}
> def func(value: Any): Int = ... // return AnyVal which probably causes a IS 
> NOT NULL added filter on the result
> val df1 = sparkSession
>   .table(...)
>   .select("col1", "col2") // LongType both
> val df11 = df1
>   .filter(df1("col1").isNotNull)
>   .withColumn("col3", functions.udf(func)(df1("col1"))
>   .repartition(df1("col2"))
>   .sortWithinPartitions(df1("col2"))
> val df2 = ... // load other data containing col2, similarly repartition and 
> sort
> val df3 =
>   df1.join(df2, Seq("col2"), "left_outer")
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18127) Add hooks and extension points to Spark

2017-04-13 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell reassigned SPARK-18127:
-

Assignee: Sameer Agarwal  (was: Herman van Hovell)

> Add hooks and extension points to Spark
> ---
>
> Key: SPARK-18127
> URL: https://issues.apache.org/jira/browse/SPARK-18127
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Srinath
>Assignee: Sameer Agarwal
>
> As a Spark user I want to be able to customize my spark session. I currently 
> want to be able to do the following things:
> # I want to be able to add custom analyzer rules. This allows me to implement 
> my own logical constructs; an example of this could be a recursive operator.
> # I want to be able to add my own analysis checks. This allows me to catch 
> problems with spark plans early on. An example of this can be some datasource 
> specific checks.
> # I want to be able to add my own optimizations. This allows me to optimize 
> plans in different ways, for instance when you use a very different cluster 
> (for example a one-node X1 instance). This supersedes the current 
> {{spark.experimental}} methods
> # I want to be able to add my own planning strategies. This supersedes the 
> current {{spark.experimental}} methods. This allows me to plan my own 
> physical plan, an example of this would to plan my own heavily integrated 
> data source (CarbonData for example).
> # I want to be able to use my own customized SQL constructs. An example of 
> this would supporting my own dialect, or be able to add constructs to the 
> current SQL language. I should not have to implement a complete parse, and 
> should be able to delegate to an underlying parser.
> # I want to be able to track modifications and calls to the external catalog. 
> I want this API to be stable. This allows me to do synchronize with other 
> systems.
> This API should modify the SparkSession when the session gets started, and it 
> should NOT change the session in flight.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20321) Spark UI cannot be shutdown in spark streaming app

2017-04-13 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968290#comment-15968290
 ] 

Shixiong Zhu commented on SPARK-20321:
--

You cannot stop a StreamingContext in foreachRDD. For safety, I would only stop 
SparkContext/StreamingContext in main thread, or always launch a new thread to 
stop it. Stopping SparkContext/StreamingContext in Spark's threads is not 
supported and pretty easy to cause dead-lock because they usually will wait 
until Spark's threads die. 

> Spark UI cannot be shutdown in spark streaming app
> --
>
> Key: SPARK-20321
> URL: https://issues.apache.org/jira/browse/SPARK-20321
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrei Taleanu
>
> When an exception thrown the transform stage is handled in foreachRDD and the 
> streaming context is forced to stop, the SparkUI appears to hang and 
> continually dump the following logs in an infinite loop:
> {noformat}
> ...
> 2017-04-12 14:11:13,470 
> [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG 
> org.spark_project.jetty.io.SelectorManager - Selector loop woken up from 
> select, 0/0 selected
> 2017-04-12 14:11:13,470 
> [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG 
> org.spark_project.jetty.io.SelectorManager - Selector loop waiting on select
> 2017-04-12 14:11:13,470 
> [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG 
> org.spark_project.jetty.io.SelectorManager - Selector loop woken up from 
> select, 0/0 selected
> 2017-04-12 14:11:13,470 
> [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG 
> org.spark_project.jetty.io.SelectorManager - Selector loop waiting on select
> ...
> {noformat}
> Unfortunately I don't have a minimal example that reproduces this issue but 
> here is what I can share:
> {noformat}
> val dstream = pull data from kafka
> val mapped = dstream.transform { rdd =>
>   val data = getData // Perform a call that potentially throws an exception
>   // broadcast the data
>   // flatMap the RDD using the data
> }
> mapped.foreachRDD {
>   try {
> // write some data in a DB
>   } catch {
> case t: Throwable =>
>   dstream.context.stop(stopSparkContext = true, stopGracefully = false)
>   }
> }
> mapped.foreachRDD {
>   try {
> // write data to Kafka
> // manually checkpoint the Kafka offsets (because I need them in JSON 
> format)
>   } catch {
> case t: Throwable =>
>   dstream.context.stop(stopSparkContext = true, stopGracefully = false)
>   }
> }
> {noformat}
> The issue appears when stop is invoked. At the point when SparkUI is stopped, 
> it enters that infinite loop. Initially I thought it relates to Jetty, as the 
> version used in SparkUI had some bugs (e.g. [this 
> one|https://bugs.eclipse.org/bugs/show_bug.cgi?id=452465]). I bumped Jetty to 
> a more recent version (March 2017) and built Spark 2.1.0 with that one but 
> still got the error.
> I encountered this issue with Spark 2.1.0 built with Hadoop 2.6 on top of 
> Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

2017-04-13 Thread Michael Gummelt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gummelt updated SPARK-20328:

Description: 
In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138

Semantically, this is a problem because a HadoopRDD does not represent a Hadoop 
MapReduce job.  Practically, this is a problem because this line: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
 results in this MapReduce-specific security code being called: 
https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
 which assumes the MapReduce master is configured.  If it isn't, an exception 
is thrown.

So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
the Spark Mesos scheduler:

{code}
Exception in thread "main" java.io.IOException: Can't get Master Kerberos 
principal for use as renewer
at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
{code}

I have a workaround where I set a YARN-specific configuration variable to trick 
{{TokenCache}} into thinking YARN is configured, but this is obviously 
suboptimal.

The proper fix to this would likely require significant {{hadoop}} refactoring 
to make split information available without going through {{JobConf}}, so I'm 
not yet sure what the best course of action is.

  was:
In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138

Semantically, this is a problem because a HadoopRDD does not represent a Hadoop 
MapReduce job.  Practically, this is a problem because this line: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
 results in this MapReduce-specific security code being called: 
https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
 which assumes the MapReduce master is configured.  If it isn't, an exception 
is thrown.

So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
the Spark Mesos scheduler.  I have a workaround where I set a YARN-specific 
configuration variable to trick {{TokenCache}} into thinking YARN is 
configured, but this is obviously suboptimal.

The proper fix to this would likely require significant {{hadoop}} refactoring 
to make split information available without going through {{JobConf}}, so I'm 
not yet sure what the best course of action is.


> HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
> -
>
> Key: SPARK-20328
> URL: https://issues.apache.org/jira/browse/SPARK-20328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1, 2.1.2
>Reporter: Michael Gummelt
>
> In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
> MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138
> Semantically, this is a problem because a HadoopRDD does not represent a 
> Hadoop MapReduce job.  Practically, this is a problem because this line: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
>  results in this MapReduce-specific security code being called: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
>  which assumes the MapReduce master is configured.  If it isn't, an exception 
> is thrown.
> So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
> the Spark Mesos scheduler:
> {code}
> 

[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

2017-04-13 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968232#comment-15968232
 ] 

Michael Gummelt commented on SPARK-20328:
-

cc [~colorant] [~hfeng] [~vanzin]

> HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
> -
>
> Key: SPARK-20328
> URL: https://issues.apache.org/jira/browse/SPARK-20328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1, 2.1.2
>Reporter: Michael Gummelt
>
> In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
> MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138
> Semantically, this is a problem because a HadoopRDD does not represent a 
> Hadoop MapReduce job.  Practically, this is a problem because this line: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
>  results in this MapReduce-specific security code being called: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
>  which assumes the MapReduce master is configured.  If it isn't, an exception 
> is thrown.
> So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
> the Spark Mesos scheduler.  I have a workaround where I set a YARN-specific 
> configuration variable to trick {{TokenCache}} into thinking YARN is 
> configured, but this is obviously suboptimal.
> The proper fix to this would likely require significant {{hadoop}} 
> refactoring to make split information available without going through 
> {{JobConf}}, so I'm not yet sure what the best course of action is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

2017-04-13 Thread Michael Gummelt (JIRA)
Michael Gummelt created SPARK-20328:
---

 Summary: HadoopRDDs create a MapReduce JobConf, but are not 
MapReduce jobs
 Key: SPARK-20328
 URL: https://issues.apache.org/jira/browse/SPARK-20328
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0, 2.1.1, 2.1.2
Reporter: Michael Gummelt


In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138

Semantically, this is a problem because a HadoopRDD does not represent a Hadoop 
MapReduce job.  Practically, this is a problem because this line: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
 results in this MapReduce-specific security code being called: 
https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
 which assumes the MapReduce master is configured.  If it isn't, an exception 
is thrown.

So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
the Spark Mesos scheduler.  I have a workaround where I set a YARN-specific 
configuration variable to trick {{TokenCache}} into thinking YARN is 
configured, but this is obviously suboptimal.

The proper fix to this would likely require significant {{hadoop}} refactoring 
to make split information available without going through {{JobConf}}, so I'm 
not yet sure what the best course of action is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20038) FileFormatWriter.ExecuteWriteTask.releaseResources() implementations to be re-entrant

2017-04-13 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-20038:


Assignee: Steve Loughran

> FileFormatWriter.ExecuteWriteTask.releaseResources() implementations to be 
> re-entrant
> -
>
> Key: SPARK-20038
> URL: https://issues.apache.org/jira/browse/SPARK-20038
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 2.2.0
>
>
> Both {{FileFormatWriter.ExecuteWriteTask.releaseResources()}} implementations 
> {{close()}} any non-null {{currentWriter}}, then set the field to null
> However, if the close() call throws an exception in the execution of 
> {{{FileFormatWriter.executeTask}}, the exception handler will attempt to 
> cleanup, by calling {{releaseResources()}} again. Looking at the codepath, 
> this may cause {{committer.abortTask()}} to get skipped on failure.
> This surfaces in SPARK-10109 and I've just seen it in HADOOP-14204); Parquet 
> seems to be in the trace as it NPEs the second time it's {{close()}} method 
> is called.
> Fix: always set {{currentWriter}} to null,



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20038) FileFormatWriter.ExecuteWriteTask.releaseResources() implementations to be re-entrant

2017-04-13 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-20038.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17364
[https://github.com/apache/spark/pull/17364]

> FileFormatWriter.ExecuteWriteTask.releaseResources() implementations to be 
> re-entrant
> -
>
> Key: SPARK-20038
> URL: https://issues.apache.org/jira/browse/SPARK-20038
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Steve Loughran
>Priority: Minor
> Fix For: 2.2.0
>
>
> Both {{FileFormatWriter.ExecuteWriteTask.releaseResources()}} implementations 
> {{close()}} any non-null {{currentWriter}}, then set the field to null
> However, if the close() call throws an exception in the execution of 
> {{{FileFormatWriter.executeTask}}, the exception handler will attempt to 
> cleanup, by calling {{releaseResources()}} again. Looking at the codepath, 
> this may cause {{committer.abortTask()}} to get skipped on failure.
> This surfaces in SPARK-10109 and I've just seen it in HADOOP-14204); Parquet 
> seems to be in the trace as it NPEs the second time it's {{close()}} method 
> is called.
> Fix: always set {{currentWriter}} to null,



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20327) Add CLI support for YARN-3926

2017-04-13 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968149#comment-15968149
 ] 

Mark Grover commented on SPARK-20327:
-

Daniel, we don't assign JIRAs in Spark. Folks issue a PR and once the PR gets 
merged, the committer will assign the JIRA to the contributor.

> Add CLI support for YARN-3926
> -
>
> Key: SPARK-20327
> URL: https://issues.apache.org/jira/browse/SPARK-20327
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: Daniel Templeton
>  Labels: newbie
>
> YARN-3926 adds the ability for administrators to configure custom resources, 
> like GPUs.  This JIRA is to add support to Spark for requesting resources 
> other than CPU virtual cores and memory.  See YARN-3926.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20232) Better combineByKey documentation: clarify memory allocation, better example

2017-04-13 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-20232:
---

Assignee: David Gingrich

> Better combineByKey documentation: clarify memory allocation, better example
> 
>
> Key: SPARK-20232
> URL: https://issues.apache.org/jira/browse/SPARK-20232
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
> Environment: macOS Sierra 10.12.4
> Spark 2.1.0 installed via Homebrew
>Reporter: David Gingrich
>Assignee: David Gingrich
>Priority: Trivial
> Fix For: 2.2.0
>
>
> combineByKey docs has a few flaws:
> - Doesn't include note about memory allocation (on aggregateBykey)
> - Example doesn't show difference between mergeValue and mergeCombiners (both 
> are add)
> I have a trivial patch, will attach momentarily.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20232) Better combineByKey documentation: clarify memory allocation, better example

2017-04-13 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-20232.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17545
[https://github.com/apache/spark/pull/17545]

> Better combineByKey documentation: clarify memory allocation, better example
> 
>
> Key: SPARK-20232
> URL: https://issues.apache.org/jira/browse/SPARK-20232
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
> Environment: macOS Sierra 10.12.4
> Spark 2.1.0 installed via Homebrew
>Reporter: David Gingrich
>Priority: Trivial
> Fix For: 2.2.0
>
>
> combineByKey docs has a few flaws:
> - Doesn't include note about memory allocation (on aggregateBykey)
> - Example doesn't show difference between mergeValue and mergeCombiners (both 
> are add)
> I have a trivial patch, will attach momentarily.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19909) Batches will fail in case that temporary checkpoint dir is on local file system while metadata dir is on HDFS

2017-04-13 Thread Rob (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968092#comment-15968092
 ] 

Rob commented on SPARK-19909:
-

Is there an easy way to avoid this issue while waiting for it to be resolved?  
Perhaps tweaking a setting?

> Batches will fail in case that temporary checkpoint dir is on local file 
> system while metadata dir is on HDFS
> -
>
> Key: SPARK-19909
> URL: https://issues.apache.org/jira/browse/SPARK-19909
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> When we try to run Structured Streaming in local mode but use HDFS for the 
> storage, batches will be fail because of error like as follows.
> {code}
> val handle = stream.writeStream.format("console").start()
> 17/03/09 16:54:45 ERROR StreamMetadata: Error writing stream metadata 
> StreamMetadata(fc07a0b1-5423-483e-a59d-b2206a49491e) to 
> /private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=kou, access=WRITE, 
> inode="/private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata":hdfs:supergroup:drwxr-xr-x
> {code}
> It's because that a temporary checkpoint directory is created on local file 
> system but metadata whose path is based on the checkpoint directory will be 
> created on HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19946) DebugFilesystem.assertNoOpenStreams should report the open streams to help debugging

2017-04-13 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-19946:
--
Fix Version/s: 2.1.1

> DebugFilesystem.assertNoOpenStreams should report the open streams to help 
> debugging
> 
>
> Key: SPARK-19946
> URL: https://issues.apache.org/jira/browse/SPARK-19946
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Bogdan Raducanu
>Assignee: Bogdan Raducanu
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
>
> In DebugFilesystem.assertNoOpenStreams if there are open streams an exception 
> is thrown showing the number of open streams. This doesn't help much to debug 
> where the open streams were leaked.
> The exception should also report where the stream was leaked. This can be 
> done through a cause exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20327) Add CLI support for YARN-3926

2017-04-13 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967913#comment-15967913
 ] 

Daniel Templeton commented on SPARK-20327:
--

YARN-3926 won't go in until Hadoop 3.0.0, so this one is definitely 
forward-looking.

> Add CLI support for YARN-3926
> -
>
> Key: SPARK-20327
> URL: https://issues.apache.org/jira/browse/SPARK-20327
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: Daniel Templeton
>  Labels: newbie
>
> YARN-3926 adds the ability for administrators to configure custom resources, 
> like GPUs.  This JIRA is to add support to Spark for requesting resources 
> other than CPU virtual cores and memory.  See YARN-3926.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14519) Cross-publish Kafka for Scala 2.12

2017-04-13 Thread Fabio Pinheiro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967897#comment-15967897
 ] 

Fabio Pinheiro edited comment on SPARK-14519 at 4/13/17 5:12 PM:
-

Kafka official supports Scala 2.12 since '0.10.1.1' and it is the recommend 
version of Scala for the latest release '0.10.2.0'. (KAFKA-4376 - Add scala 
2.12 support)
I have experience (as a user) with Scala, Spark and Kafka, but I new in the 
community. Can I help with this ticket?


was (Author: fmgp):
Kafka official supports Scala 2.12 since '0.10.1.1' and it is the recommend 
version of Scala for the latest release '0.10.2.0'.
I have experience (as a user) with Scala, Spark and Kafka, but I new in the 
community. Can I help with this ticket?

> Cross-publish Kafka for Scala 2.12
> --
>
> Key: SPARK-14519
> URL: https://issues.apache.org/jira/browse/SPARK-14519
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>
> In order to build the streaming Kafka connector, we need to publish Kafka for 
> Scala 2.12.0-M4. Someone should file an issue against the Kafka project and 
> work with their developers to figure out what will block their upgrade / 
> release.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14519) Cross-publish Kafka for Scala 2.12

2017-04-13 Thread Fabio Pinheiro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967897#comment-15967897
 ] 

Fabio Pinheiro commented on SPARK-14519:


Kafka official supports Scala 2.12 since '0.10.1.1' and it is the recommend 
version of Scala for the latest release '0.10.2.0'.
I have experience (as a user) with Scala, Spark and Kafka, but I new in the 
community. Can I help with this ticket?

> Cross-publish Kafka for Scala 2.12
> --
>
> Key: SPARK-14519
> URL: https://issues.apache.org/jira/browse/SPARK-14519
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>
> In order to build the streaming Kafka connector, we need to publish Kafka for 
> Scala 2.12.0-M4. Someone should file an issue against the Kafka project and 
> work with their developers to figure out what will block their upgrade / 
> release.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20327) Add CLI support for YARN-3926

2017-04-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967886#comment-15967886
 ] 

Sean Owen commented on SPARK-20327:
---

What version of YARN would this require? Spark is still supporting 2.6.

> Add CLI support for YARN-3926
> -
>
> Key: SPARK-20327
> URL: https://issues.apache.org/jira/browse/SPARK-20327
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: Daniel Templeton
>  Labels: newbie
>
> YARN-3926 adds the ability for administrators to configure custom resources, 
> like GPUs.  This JIRA is to add support to Spark for requesting resources 
> other than CPU virtual cores and memory.  See YARN-3926.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20327) Add CLI support for YARN-3926

2017-04-13 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton updated SPARK-20327:
-
Labels: newbie  (was: )

> Add CLI support for YARN-3926
> -
>
> Key: SPARK-20327
> URL: https://issues.apache.org/jira/browse/SPARK-20327
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: Daniel Templeton
>  Labels: newbie
>
> YARN-3926 adds the ability for administrators to configure custom resources, 
> like GPUs.  This JIRA is to add support to Spark for requesting resources 
> other than CPU virtual cores and memory.  See YARN-3926.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20327) Add CLI support for YARN-3926

2017-04-13 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967855#comment-15967855
 ] 

Daniel Templeton commented on SPARK-20327:
--

[~mgrover], would you mind assigning this one to me?

> Add CLI support for YARN-3926
> -
>
> Key: SPARK-20327
> URL: https://issues.apache.org/jira/browse/SPARK-20327
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: Daniel Templeton
>
> YARN-3926 adds the ability for administrators to configure custom resources, 
> like GPUs.  This JIRA is to add support to Spark for requesting resources 
> other than CPU virtual cores and memory.  See YARN-3926.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20327) Add CLI support for YARN-3926

2017-04-13 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton updated SPARK-20327:
-
Description: YARN-3926 adds the ability for administrators to configure 
custom resources, like GPUs.  This JIRA is to add support to Spark for 
requesting resources other than CPU virtual cores and memory.  See YARN-3926.

> Add CLI support for YARN-3926
> -
>
> Key: SPARK-20327
> URL: https://issues.apache.org/jira/browse/SPARK-20327
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: Daniel Templeton
>
> YARN-3926 adds the ability for administrators to configure custom resources, 
> like GPUs.  This JIRA is to add support to Spark for requesting resources 
> other than CPU virtual cores and memory.  See YARN-3926.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20327) Add CLI support for YARN-3926

2017-04-13 Thread Daniel Templeton (JIRA)
Daniel Templeton created SPARK-20327:


 Summary: Add CLI support for YARN-3926
 Key: SPARK-20327
 URL: https://issues.apache.org/jira/browse/SPARK-20327
 Project: Spark
  Issue Type: Improvement
  Components: Spark Shell, Spark Submit
Affects Versions: 2.1.0
Reporter: Daniel Templeton






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20326) Run Spark in local mode from IntelliJ (or other IDE)

2017-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20326.
---
Resolution: Invalid

Questions belong on the mailing list u...@spark.apache.org

> Run Spark in local mode from IntelliJ (or other IDE)
> 
>
> Key: SPARK-20326
> URL: https://issues.apache.org/jira/browse/SPARK-20326
> Project: Spark
>  Issue Type: Question
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Marcello Oliva
>
> I wrote a simple Spark test in IntelliJ and it fails when creating the 
> SparkSession:
> val spark = SparkSession.builder().master("local[2]").getOrCreate()
> Error:
> java.lang.IllegalArgumentException: Can't get Kerberos realm
> ...
> Caused by: java.lang.reflect.InvocationTargetException
> ...
> Caused by: KrbException: Cannot locate default realm
> How do I run tests from an IDE? 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20326) Run Spark in local mode from IntelliJ (or other IDE)

2017-04-13 Thread Marcello Oliva (JIRA)
Marcello Oliva created SPARK-20326:
--

 Summary: Run Spark in local mode from IntelliJ (or other IDE)
 Key: SPARK-20326
 URL: https://issues.apache.org/jira/browse/SPARK-20326
 Project: Spark
  Issue Type: Question
  Components: Documentation
Affects Versions: 2.1.0
Reporter: Marcello Oliva


I wrote a simple Spark test in IntelliJ and it fails when creating the 
SparkSession:

val spark = SparkSession.builder().master("local[2]").getOrCreate()

Error:
java.lang.IllegalArgumentException: Can't get Kerberos realm
...
Caused by: java.lang.reflect.InvocationTargetException
...
Caused by: KrbException: Cannot locate default realm

How do I run tests from an IDE? 




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8586) SQL add jar command does not work well with Scala REPL

2017-04-13 Thread wasif masood (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967689#comment-15967689
 ] 

wasif masood commented on SPARK-8586:
-

I am using spark in Zeppelin and facing the same error. I am trying to add the 
jar using the following:
sqlContext.sql("ADD JAR /user/hive/udf/myLib.jar")
Please advise.

> SQL add jar command does not work well with Scala REPL
> --
>
> Key: SPARK-8586
> URL: https://issues.apache.org/jira/browse/SPARK-8586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 2.1.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> Seems SparkIMain always resets the context class loader in {{loadAndRunReq}}. 
> So, SerDe added through add jar command may not be loaded in the context 
> class loader when we lookup the table.
> For example, the following code will fail when we try to show the table. 
> {code}
> hive.sql("add jar sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar")
> hive.sql("drop table if exists jsonTable")
> hive.sql("CREATE TABLE jsonTable(key int, val string) ROW FORMAT SERDE 
> 'org.apache.hive.hcatalog.data.JsonSerDe'")
> hive.createDataFrame((1 to 100).map(i => (i, s"str$i"))).toDF("key", 
> "val").insertInto("jsonTable")
> hive.table("jsonTable").show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition

2017-04-13 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger updated SPARK-20287:
---

What you're describing is closer to the receiver-based implementation,
which had a number of issues.  What I tried to achieve with the direct
stream implementation was to have the driver figure out offset ranges
for the next batch, then have executors deterministically consume
exactly those messages with a 1:1 mapping between kafka partition and
spark partition.

If you have a single consumer subscribed to multiple topicpartitions,
you'll get intermingled messages for all of those partitions.  With
the new consumer api subscribed to multiple partitions, there isn't a
way to say "get topicpartition A until offset 1234", which is what we
need.




-- 
Cody Koeninger


> Kafka Consumer should be able to subscribe to more than one topic partition
> ---
>
> Key: SPARK-20287
> URL: https://issues.apache.org/jira/browse/SPARK-20287
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each 
> topic partition in the source Kafka topics, and they're cached.
> cf 
> https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly 
> unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it 
> automatically assigns and balances partitions amongst all the consumers that 
> share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD 
> architecture, but from a Kafka standpoint, I believe creating one consumer 
> per partition is a big overhead, and a risk as the user may have to increase 
> the spark.streaming.kafka.consumer.cache.maxCapacity parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20233) Apply star-join filter heuristics to dynamic programming join enumeration

2017-04-13 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20233:
---

Assignee: Ioana Delaney

> Apply star-join filter heuristics to dynamic programming join enumeration
> -
>
> Key: SPARK-20233
> URL: https://issues.apache.org/jira/browse/SPARK-20233
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ioana Delaney
>Assignee: Ioana Delaney
>Priority: Critical
> Fix For: 2.2.0
>
>
> This JIRA integrates star-join detection with the cost-based optimizer. 
> The join enumeration using dynamic programming generates a set of feasible 
> joins. The sub-optimal plans can be eliminated by a sequence of independent, 
> optional filters. The optional filters include heuristics for reducing the 
> search space. For example,
> # Star-join: Tables in a star schema relationship are planned together since 
> they are assumed to have an optimal execution.
> # Cartesian products: Cartesian products are deferred as late as possible to 
> avoid large intermediate results (expanding joins, in general).
> # Composite inners: “Bushy tree” plans are not generated to avoid 
> materializing intermediate result.
> For reference, see “Measuring the Complexity of Join Enumeration in Query 
> Optimization” by Ono et al.
> This JIRA implements the star join filter. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20233) Apply star-join filter heuristics to dynamic programming join enumeration

2017-04-13 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20233.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17546
[https://github.com/apache/spark/pull/17546]

> Apply star-join filter heuristics to dynamic programming join enumeration
> -
>
> Key: SPARK-20233
> URL: https://issues.apache.org/jira/browse/SPARK-20233
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ioana Delaney
>Priority: Critical
> Fix For: 2.2.0
>
>
> This JIRA integrates star-join detection with the cost-based optimizer. 
> The join enumeration using dynamic programming generates a set of feasible 
> joins. The sub-optimal plans can be eliminated by a sequence of independent, 
> optional filters. The optional filters include heuristics for reducing the 
> search space. For example,
> # Star-join: Tables in a star schema relationship are planned together since 
> they are assumed to have an optimal execution.
> # Cartesian products: Cartesian products are deferred as late as possible to 
> avoid large intermediate results (expanding joins, in general).
> # Composite inners: “Bushy tree” plans are not generated to avoid 
> materializing intermediate result.
> For reference, see “Measuring the Complexity of Join Enumeration in Query 
> Optimization” by Ono et al.
> This JIRA implements the star join filter. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19946) DebugFilesystem.assertNoOpenStreams should report the open streams to help debugging

2017-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967599#comment-15967599
 ] 

Apache Spark commented on SPARK-19946:
--

User 'bogdanrdc' has created a pull request for this issue:
https://github.com/apache/spark/pull/17632

> DebugFilesystem.assertNoOpenStreams should report the open streams to help 
> debugging
> 
>
> Key: SPARK-19946
> URL: https://issues.apache.org/jira/browse/SPARK-19946
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Bogdan Raducanu
>Assignee: Bogdan Raducanu
>Priority: Minor
> Fix For: 2.2.0
>
>
> In DebugFilesystem.assertNoOpenStreams if there are open streams an exception 
> is thrown showing the number of open streams. This doesn't help much to debug 
> where the open streams were leaked.
> The exception should also report where the stream was leaked. This can be 
> done through a cause exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter

2017-04-13 Thread pralabhkumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967574#comment-15967574
 ] 

pralabhkumar commented on SPARK-20199:
--

Shouldn't there be a parameter in GBMParameters() 

val gbmParams = new GBMParameters()
gbmParams._featuresubsetStrategy="auto"

and then in 

private[ml] def train(data: RDD[LabeledPoint],
  oldStrategy: OldStrategy): DecisionTreeRegressionModel = {

we should have
oldStrategy.getStrategy() and set it in

val trees = RandomForest.run(data, oldStrategy, numTrees = 1, 
featureSubsetStrategy = oldStrategy.getStrategy(), seed = $(seed), instr = 
Some(instr), parentUID = Some(uid))

> GradientBoostedTreesModel doesn't have  Column Sampling Rate Paramenter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
>
> Spark GradientBoostedTreesModel doesn't have Column  sampling rate parameter 
> . This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20325) Spark Structured Streaming documentation Update: checkpoint configuration

2017-04-13 Thread Kate Eri (JIRA)
Kate Eri created SPARK-20325:


 Summary: Spark Structured Streaming documentation Update: 
checkpoint configuration
 Key: SPARK-20325
 URL: https://issues.apache.org/jira/browse/SPARK-20325
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Kate Eri
Priority: Minor


I have configured the following stream outputting to Kafka:

{code}
map.foreach(metric => {
  streamToProcess
.groupBy(metric)
.agg(count(metric))
.writeStream
.outputMode("complete")
.option("checkpointLocation", checkpointDir)
.foreach(kafkaWriter)
.start()
})
{code}

And configured the checkpoint Dir for each of output sinks like: 
.option("checkpointLocation", checkpointDir)  according to the documentation => 
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing
 
As a result I've got the following exception: 
Cannot start query with id bf6a1003-6252-4c62-8249-c6a189701255 as another 
query with same id is already active. Perhaps you are attempting to restart a 
query from checkpoint that is already active.
java.lang.IllegalStateException: Cannot start query with id 
bf6a1003-6252-4c62-8249-c6a189701255 as another query with same id is already 
active. Perhaps you are attempting to restart a query from checkpoint that is 
already active.
at 
org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:291)

So according to current spark logic for “foreach” sink the checkpoint 
configuration is loaded in the following way: 
{code:title=StreamingQueryManager.scala}
   val checkpointLocation = userSpecifiedCheckpointLocation.map { userSpecified 
=>
  new Path(userSpecified).toUri.toString
}.orElse {
  df.sparkSession.sessionState.conf.checkpointLocation.map { location =>
new Path(location, 
userSpecifiedName.getOrElse(UUID.randomUUID().toString)).toUri.toString
  }
}.getOrElse {
  if (useTempCheckpointLocation) {
Utils.createTempDir(namePrefix = s"temporary").getCanonicalPath
  } else {
throw new AnalysisException(
  "checkpointLocation must be specified either " +
"""through option("checkpointLocation", ...) or """ +
s"""SparkSession.conf.set("${SQLConf.CHECKPOINT_LOCATION.key}", 
...)""")
  }
}
{code}

so first spark take checkpointDir from query, then from sparksession 
(spark.sql.streaming.checkpointLocation) and so on. 
But this behavior was not documented, thus two questions:
1) could we update documentation for Structured Streaming and describe this 
behavior
2) Do we really need to specify the checkpoint dir per query? what the reason 
for this? finally we will be forced to write some checkpointDir name generator, 
for example associate it with some particular named query and so on?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2017-04-13 Thread Mathieu Boespflug (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967554#comment-15967554
 ] 

Mathieu Boespflug commented on SPARK-650:
-

[~Skamandros] how did you manage to hook `JavaSerializer`? I tried doing so 
myself, by defining a new subclass, but then I need to make sure that new class 
is installed on all executors. Meaning I have to copy a .jar on all my nodes 
manually. For some reason Spark won't try looking for the serializer inside my 
application JAR.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20323) Calling stop in a transform stage causes the app to hang

2017-04-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967535#comment-15967535
 ] 

Sean Owen commented on SPARK-20323:
---

Yeah I don't think it would do anything because that's not even the same copy 
of the context. If it happens to execute locally it might work. It might cause 
undefined behavior. If there's an easy way to explicitly error in that case, 
great, but otherwise would just avoid it.

> Calling stop in a transform stage causes the app to hang
> 
>
> Key: SPARK-20323
> URL: https://issues.apache.org/jira/browse/SPARK-20323
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrei Taleanu
>
> I'm not sure if this is a bug or just the way it needs to happen but I've run 
> in this issue with the following code:
> {noformat}
> object ImmortalStreamingJob extends App {
>   val conf = new SparkConf().setAppName("fun-spark").setMaster("local[*]")
>   val ssc  = new StreamingContext(conf, Seconds(1))
>   val elems = (1 to 1000).grouped(10)
> .map(seq => ssc.sparkContext.parallelize(seq))
> .toSeq
>   val stream = ssc.queueStream(mutable.Queue[RDD[Int]](elems: _*))
>   val transformed = stream.transform { rdd =>
> try {
>   if (Random.nextInt(6) == 5) throw new RuntimeException("boom")
>   else println("lucky bastard")
>   rdd
> } catch {
>   case e: Throwable =>
> println("stopping streaming context", e)
> ssc.stop(stopSparkContext = true, stopGracefully = false)
> throw e
> }
>   }
>   transformed.foreachRDD { rdd =>
> println(rdd.collect().mkString(","))
>   }
>   ssc.start()
>   ssc.awaitTermination()
> }
> {noformat}
> There are two things I can note here:
> * if the exception is thrown in the first transformation (when the first RDD 
> is processed), the spark context is stopped and the app dies
> * if the exception is thrown after at least one RDD has been processed, the 
> app hangs after printing the error message and never stops
> I think there's some sort of deadlock in the second case, is that normal? I 
> also asked this 
> [here|http://stackoverflow.com/questions/43273783/immortal-spark-streaming-job/43373624#43373624]
>  but up two this point there's no answer pointing exactly to what happens, 
> only guidelines.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20323) Calling stop in a transform stage causes the app to hang

2017-04-13 Thread Andrei Taleanu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967516#comment-15967516
 ] 

Andrei Taleanu commented on SPARK-20323:


All right, so the behaviour is undefined in that case?

> Calling stop in a transform stage causes the app to hang
> 
>
> Key: SPARK-20323
> URL: https://issues.apache.org/jira/browse/SPARK-20323
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrei Taleanu
>
> I'm not sure if this is a bug or just the way it needs to happen but I've run 
> in this issue with the following code:
> {noformat}
> object ImmortalStreamingJob extends App {
>   val conf = new SparkConf().setAppName("fun-spark").setMaster("local[*]")
>   val ssc  = new StreamingContext(conf, Seconds(1))
>   val elems = (1 to 1000).grouped(10)
> .map(seq => ssc.sparkContext.parallelize(seq))
> .toSeq
>   val stream = ssc.queueStream(mutable.Queue[RDD[Int]](elems: _*))
>   val transformed = stream.transform { rdd =>
> try {
>   if (Random.nextInt(6) == 5) throw new RuntimeException("boom")
>   else println("lucky bastard")
>   rdd
> } catch {
>   case e: Throwable =>
> println("stopping streaming context", e)
> ssc.stop(stopSparkContext = true, stopGracefully = false)
> throw e
> }
>   }
>   transformed.foreachRDD { rdd =>
> println(rdd.collect().mkString(","))
>   }
>   ssc.start()
>   ssc.awaitTermination()
> }
> {noformat}
> There are two things I can note here:
> * if the exception is thrown in the first transformation (when the first RDD 
> is processed), the spark context is stopped and the app dies
> * if the exception is thrown after at least one RDD has been processed, the 
> app hangs after printing the error message and never stops
> I think there's some sort of deadlock in the second case, is that normal? I 
> also asked this 
> [here|http://stackoverflow.com/questions/43273783/immortal-spark-streaming-job/43373624#43373624]
>  but up two this point there's no answer pointing exactly to what happens, 
> only guidelines.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20321) Spark UI cannot be shutdown in spark streaming app

2017-04-13 Thread Andrei Taleanu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967514#comment-15967514
 ] 

Andrei Taleanu commented on SPARK-20321:


Not sure what it would help. I could provide more logs, but I'm not sure how 
that helps as they're typical stopping logs. Also, the SparkUI remains 
accessible when this happens and you can see a single job that runs forever. 
I've tried recreating a minimal example but somehow I can't reproduce it in 
another minimal streaming app. In the app where I've encountered this issue it 
happens every time so it's really easy to reproduce. I'm not sure what I should 
be looking at so I can provide more details.

> Spark UI cannot be shutdown in spark streaming app
> --
>
> Key: SPARK-20321
> URL: https://issues.apache.org/jira/browse/SPARK-20321
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrei Taleanu
>
> When an exception thrown the transform stage is handled in foreachRDD and the 
> streaming context is forced to stop, the SparkUI appears to hang and 
> continually dump the following logs in an infinite loop:
> {noformat}
> ...
> 2017-04-12 14:11:13,470 
> [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG 
> org.spark_project.jetty.io.SelectorManager - Selector loop woken up from 
> select, 0/0 selected
> 2017-04-12 14:11:13,470 
> [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG 
> org.spark_project.jetty.io.SelectorManager - Selector loop waiting on select
> 2017-04-12 14:11:13,470 
> [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG 
> org.spark_project.jetty.io.SelectorManager - Selector loop woken up from 
> select, 0/0 selected
> 2017-04-12 14:11:13,470 
> [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG 
> org.spark_project.jetty.io.SelectorManager - Selector loop waiting on select
> ...
> {noformat}
> Unfortunately I don't have a minimal example that reproduces this issue but 
> here is what I can share:
> {noformat}
> val dstream = pull data from kafka
> val mapped = dstream.transform { rdd =>
>   val data = getData // Perform a call that potentially throws an exception
>   // broadcast the data
>   // flatMap the RDD using the data
> }
> mapped.foreachRDD {
>   try {
> // write some data in a DB
>   } catch {
> case t: Throwable =>
>   dstream.context.stop(stopSparkContext = true, stopGracefully = false)
>   }
> }
> mapped.foreachRDD {
>   try {
> // write data to Kafka
> // manually checkpoint the Kafka offsets (because I need them in JSON 
> format)
>   } catch {
> case t: Throwable =>
>   dstream.context.stop(stopSparkContext = true, stopGracefully = false)
>   }
> }
> {noformat}
> The issue appears when stop is invoked. At the point when SparkUI is stopped, 
> it enters that infinite loop. Initially I thought it relates to Jetty, as the 
> version used in SparkUI had some bugs (e.g. [this 
> one|https://bugs.eclipse.org/bugs/show_bug.cgi?id=452465]). I bumped Jetty to 
> a more recent version (March 2017) and built Spark 2.1.0 with that one but 
> still got the error.
> I encountered this issue with Spark 2.1.0 built with Hadoop 2.6 on top of 
> Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20321) Spark UI cannot be shutdown in spark streaming app

2017-04-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967502#comment-15967502
 ] 

Sean Owen commented on SPARK-20321:
---

I don't recall any other reports like this. Yeah, it's hard to action this 
without a reproduction. Unless you have more detail about what's executing, 
what's stuck where, I am not sure how this would proceed.

> Spark UI cannot be shutdown in spark streaming app
> --
>
> Key: SPARK-20321
> URL: https://issues.apache.org/jira/browse/SPARK-20321
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrei Taleanu
>
> When an exception thrown the transform stage is handled in foreachRDD and the 
> streaming context is forced to stop, the SparkUI appears to hang and 
> continually dump the following logs in an infinite loop:
> {noformat}
> ...
> 2017-04-12 14:11:13,470 
> [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG 
> org.spark_project.jetty.io.SelectorManager - Selector loop woken up from 
> select, 0/0 selected
> 2017-04-12 14:11:13,470 
> [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG 
> org.spark_project.jetty.io.SelectorManager - Selector loop waiting on select
> 2017-04-12 14:11:13,470 
> [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG 
> org.spark_project.jetty.io.SelectorManager - Selector loop woken up from 
> select, 0/0 selected
> 2017-04-12 14:11:13,470 
> [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG 
> org.spark_project.jetty.io.SelectorManager - Selector loop waiting on select
> ...
> {noformat}
> Unfortunately I don't have a minimal example that reproduces this issue but 
> here is what I can share:
> {noformat}
> val dstream = pull data from kafka
> val mapped = dstream.transform { rdd =>
>   val data = getData // Perform a call that potentially throws an exception
>   // broadcast the data
>   // flatMap the RDD using the data
> }
> mapped.foreachRDD {
>   try {
> // write some data in a DB
>   } catch {
> case t: Throwable =>
>   dstream.context.stop(stopSparkContext = true, stopGracefully = false)
>   }
> }
> mapped.foreachRDD {
>   try {
> // write data to Kafka
> // manually checkpoint the Kafka offsets (because I need them in JSON 
> format)
>   } catch {
> case t: Throwable =>
>   dstream.context.stop(stopSparkContext = true, stopGracefully = false)
>   }
> }
> {noformat}
> The issue appears when stop is invoked. At the point when SparkUI is stopped, 
> it enters that infinite loop. Initially I thought it relates to Jetty, as the 
> version used in SparkUI had some bugs (e.g. [this 
> one|https://bugs.eclipse.org/bugs/show_bug.cgi?id=452465]). I bumped Jetty to 
> a more recent version (March 2017) and built Spark 2.1.0 with that one but 
> still got the error.
> I encountered this issue with Spark 2.1.0 built with Hadoop 2.6 on top of 
> Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20323) Calling stop in a transform stage causes the app to hang

2017-04-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967500#comment-15967500
 ] 

Sean Owen commented on SPARK-20323:
---

You can't manipulate the context in a transform.

> Calling stop in a transform stage causes the app to hang
> 
>
> Key: SPARK-20323
> URL: https://issues.apache.org/jira/browse/SPARK-20323
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrei Taleanu
>
> I'm not sure if this is a bug or just the way it needs to happen but I've run 
> in this issue with the following code:
> {noformat}
> object ImmortalStreamingJob extends App {
>   val conf = new SparkConf().setAppName("fun-spark").setMaster("local[*]")
>   val ssc  = new StreamingContext(conf, Seconds(1))
>   val elems = (1 to 1000).grouped(10)
> .map(seq => ssc.sparkContext.parallelize(seq))
> .toSeq
>   val stream = ssc.queueStream(mutable.Queue[RDD[Int]](elems: _*))
>   val transformed = stream.transform { rdd =>
> try {
>   if (Random.nextInt(6) == 5) throw new RuntimeException("boom")
>   else println("lucky bastard")
>   rdd
> } catch {
>   case e: Throwable =>
> println("stopping streaming context", e)
> ssc.stop(stopSparkContext = true, stopGracefully = false)
> throw e
> }
>   }
>   transformed.foreachRDD { rdd =>
> println(rdd.collect().mkString(","))
>   }
>   ssc.start()
>   ssc.awaitTermination()
> }
> {noformat}
> There are two things I can note here:
> * if the exception is thrown in the first transformation (when the first RDD 
> is processed), the spark context is stopped and the app dies
> * if the exception is thrown after at least one RDD has been processed, the 
> app hangs after printing the error message and never stops
> I think there's some sort of deadlock in the second case, is that normal? I 
> also asked this 
> [here|http://stackoverflow.com/questions/43273783/immortal-spark-streaming-job/43373624#43373624]
>  but up two this point there's no answer pointing exactly to what happens, 
> only guidelines.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20324) Control itemSets length in PrefixSpan

2017-04-13 Thread Cyril de Vogelaere (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cyril de Vogelaere updated SPARK-20324:
---
Description: 
The idea behind this improvement would be to allow better control over the size 
of itemSets in solution patterns.

For example, assuming you posses a huge dataset of series product bought 
together, one sequence per client. And you want to find item frequently bough 
in pairs, as to make interesting promotions to your client or boost certains 
sales.

In the current implementation, all solutions would have to be calculated, 
before the user can sort through them and select only interesting ones.

What i'm proposing here, is the addition of two parameters : 

First, a maxItemPerItemset parameter which would limit the maximum number of 
item per itemset to a certain size X. Allowing potential important reduction in 
the search space, hastening the process of finding theses specific solutions.

Second a tandem minItemPerItemset parameter  that would limit the minimum 
number of item per itemset. Discarding solution that do not fit this 
constraint. Although this wouldn't entail a reduction of the constraint, this 
should still allow interested user to reduce the number of solutions collected 
by the driver.

If this improvement proposition seems interesting to the community, I will 
implement a solution along with test to guarantee the correcteness of it's 
implementation.



  was:
The idea behind this improvement would be to allow better control over the size 
of itemSets in solution patterns.

For example, assuming you posses a huge dataset of series product bought 
together, one sequence per client. And you want to find item frequently bough 
in pairs, as to make interesting promotions to your client or boost certains 
sales.

In the current implementation, all solutions would have to be calculated, 
before the user can sort through them and select only interesting ones.

What i'm proposing here, is the addition of two parameters : 

First, a maxItemPerItemset parameter which would limit the maximum number of 
item per itemset to a certain size X. Allowing potential important reduction in 
the search space, hastening the process of finding theses specific solutions.

Second a tandem minItemPerItemset parameter  that would limit the minimum 
number of item per itemset. Discarding solution that do not fit this 
constraint. Although this wouldn't entail a reduction of the constraint, this 
should still allow interested user to reduce the number of solutions collected 
by the driver.

If this solution seems interesting to the community, I will implement a 
solution along with test to guarantee the correcteness of it's implementation.




> Control itemSets length in PrefixSpan
> -
>
> Key: SPARK-20324
> URL: https://issues.apache.org/jira/browse/SPARK-20324
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> The idea behind this improvement would be to allow better control over the 
> size of itemSets in solution patterns.
> For example, assuming you posses a huge dataset of series product bought 
> together, one sequence per client. And you want to find item frequently bough 
> in pairs, as to make interesting promotions to your client or boost certains 
> sales.
> In the current implementation, all solutions would have to be calculated, 
> before the user can sort through them and select only interesting ones.
> What i'm proposing here, is the addition of two parameters : 
> First, a maxItemPerItemset parameter which would limit the maximum number of 
> item per itemset to a certain size X. Allowing potential important reduction 
> in the search space, hastening the process of finding theses specific 
> solutions.
> Second a tandem minItemPerItemset parameter  that would limit the minimum 
> number of item per itemset. Discarding solution that do not fit this 
> constraint. Although this wouldn't entail a reduction of the constraint, this 
> should still allow interested user to reduce the number of solutions 
> collected by the driver.
> If this improvement proposition seems interesting to the community, I will 
> implement a solution along with test to guarantee the correcteness of it's 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20324) Control itemSets length in PrefixSpan

2017-04-13 Thread Cyril de Vogelaere (JIRA)
Cyril de Vogelaere created SPARK-20324:
--

 Summary: Control itemSets length in PrefixSpan
 Key: SPARK-20324
 URL: https://issues.apache.org/jira/browse/SPARK-20324
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.1.0
Reporter: Cyril de Vogelaere
Priority: Minor


The idea behind this improvement would be to allow better control over the size 
of itemSets in solution patterns.

For example, assuming you posses a huge dataset of series product bought 
together, one sequence per client. And you want to find item frequently bough 
in pairs, as to make interesting promotions to your client or boost certains 
sales.

In the current implementation, all solutions would have to be calculated, 
before the user can sort through them and select only interesting ones.

What i'm proposing here, is the addition of two parameters : 

First, a maxItemPerItemset parameter which would limit the maximum number of 
item per itemset to a certain size X. Allowing potential important reduction in 
the search space, hastening the process of finding theses specific solutions.

Second a tandem minItemPerItemset parameter  that would limit the minimum 
number of item per itemset. Discarding solution that do not fit this 
constraint. Although this wouldn't entail a reduction of the constraint, this 
should still allow interested user to reduce the number of solutions collected 
by the driver.

If this solution seems interesting to the community, I will implement a 
solution along with test to guarantee the correcteness of it's implementation.





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20321) Spark UI cannot be shutdown in spark streaming app

2017-04-13 Thread Andrei Taleanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Taleanu updated SPARK-20321:
---
Description: 
When an exception thrown the transform stage is handled in foreachRDD and the 
streaming context is forced to stop, the SparkUI appears to hang and 
continually dump the following logs in an infinite loop:
{noformat}
...
2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] 
DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop woken up from 
select, 0/0 selected
2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] 
DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop waiting on 
select
2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] 
DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop woken up from 
select, 0/0 selected
2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] 
DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop waiting on 
select
...
{noformat}

Unfortunately I don't have a minimal example that reproduces this issue but 
here is what I can share:
{noformat}
val dstream = pull data from kafka
val mapped = dstream.transform { rdd =>
  val data = getData // Perform a call that potentially throws an exception
  // broadcast the data
  // flatMap the RDD using the data
}

mapped.foreachRDD {
  try {
// write some data in a DB
  } catch {
case t: Throwable =>
  dstream.context.stop(stopSparkContext = true, stopGracefully = false)
  }
}

mapped.foreachRDD {
  try {
// write data to Kafka
// manually checkpoint the Kafka offsets (because I need them in JSON 
format)
  } catch {
case t: Throwable =>
  dstream.context.stop(stopSparkContext = true, stopGracefully = false)
  }
}
{noformat}

The issue appears when stop is invoked. At the point when SparkUI is stopped, 
it enters that infinite loop. Initially I thought it relates to Jetty, as the 
version used in SparkUI had some bugs (e.g. [this 
one|https://bugs.eclipse.org/bugs/show_bug.cgi?id=452465]). I bumped Jetty to a 
more recent version (March 2017) and built Spark 2.1.0 with that one but still 
got the error.

I encountered this issue with Spark 2.1.0 built with Hadoop 2.6 on top of Mesos.

  was:
When an exception thrown the transform stage is handled in foreachRDD and the 
streaming context is forced to stop, the SparkUI appears to hang and 
continually dump the following logs in an infinite loop:
{noformat}
...
2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] 
DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop woken up from 
select, 0/0 selected
2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] 
DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop waiting on 
select
2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] 
DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop woken up from 
select, 0/0 selected
2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] 
DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop waiting on 
select
...
{noformat}

Unfortunately I don't have a minimal example that reproduces this issue but 
here is what I can share:
{noformat}
val dstream = pull data from kafka
val mapped = dstream transform { rdd =>
  val data = getData // Perform a call that potentially throws an exception
  // broadcast the data
  // flatMap the RDD using the data
}

mapped.foreachRDD {
  try {
// write some data in a DB
  } catch {
case t: Throwable =>
  dstream.context.stop(stopSparkContext = true, stopGracefully = false)
  }
}

mapped.foreachRDD {
  try {
// write data to Kafka
// manually checkpoint the Kafka offsets (because I need them in JSON 
format)
  } catch {
case t: Throwable =>
  dstream.context.stop(stopSparkContext = true, stopGracefully = false)
  }
}
{noformat}

The issue appears when stop is invoked. At the point when SparkUI is stopped, 
it enters that infinite loop. Initially I thought it relates to Jetty, as the 
version used in SparkUI had some bugs (e.g. [this 
one|https://bugs.eclipse.org/bugs/show_bug.cgi?id=452465]). I bumped Jetty to a 
more recent version (March 2017) but still got the error.

I encountered this issue with Spark 2.1.0 built with Hadoop 2.6 on top of Mesos.


> Spark UI cannot be shutdown in spark streaming app
> --
>
> Key: SPARK-20321
> URL: https://issues.apache.org/jira/browse/SPARK-20321
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrei Taleanu
>
> When an exception thrown the transform stage is handled in foreachRDD and the 
> 

[jira] [Created] (SPARK-20323) Calling stop in a transform stage causes the app to hang

2017-04-13 Thread Andrei Taleanu (JIRA)
Andrei Taleanu created SPARK-20323:
--

 Summary: Calling stop in a transform stage causes the app to hang
 Key: SPARK-20323
 URL: https://issues.apache.org/jira/browse/SPARK-20323
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Andrei Taleanu


I'm not sure if this is a bug or just the way it needs to happen but I've run 
in this issue with the following code:
{noformat}
object ImmortalStreamingJob extends App {
  val conf = new SparkConf().setAppName("fun-spark").setMaster("local[*]")
  val ssc  = new StreamingContext(conf, Seconds(1))

  val elems = (1 to 1000).grouped(10)
.map(seq => ssc.sparkContext.parallelize(seq))
.toSeq
  val stream = ssc.queueStream(mutable.Queue[RDD[Int]](elems: _*))

  val transformed = stream.transform { rdd =>
try {
  if (Random.nextInt(6) == 5) throw new RuntimeException("boom")
  else println("lucky bastard")
  rdd
} catch {
  case e: Throwable =>
println("stopping streaming context", e)
ssc.stop(stopSparkContext = true, stopGracefully = false)
throw e
}
  }

  transformed.foreachRDD { rdd =>
println(rdd.collect().mkString(","))
  }

  ssc.start()
  ssc.awaitTermination()
}
{noformat}

There are two things I can note here:
* if the exception is thrown in the first transformation (when the first RDD is 
processed), the spark context is stopped and the app dies
* if the exception is thrown after at least one RDD has been processed, the app 
hangs after printing the error message and never stops

I think there's some sort of deadlock in the second case, is that normal? I 
also asked this 
[here|http://stackoverflow.com/questions/43273783/immortal-spark-streaming-job/43373624#43373624]
 but up two this point there's no answer pointing exactly to what happens, only 
guidelines.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20322) MinPattern length in PrefixSpan

2017-04-13 Thread Cyril de Vogelaere (JIRA)
Cyril de Vogelaere created SPARK-20322:
--

 Summary: MinPattern length in PrefixSpan
 Key: SPARK-20322
 URL: https://issues.apache.org/jira/browse/SPARK-20322
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.1.0
Reporter: Cyril de Vogelaere
Priority: Trivial


The current implementation already diposes of a maxPatternLength parameter. It 
might be nice to add a similar minPatternLength parameter. 

This would allow user to better control the solutions received, and lower the 
amount of memory needed to store the solutions in the driver.

This can be implemented as a simple check before adding a solution to the 
solution list, no difference in performance should be observable.

I'm proposing this fonctionality here, if it looks interesting to other people, 
I will implement it along with tests to verify it's implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20322) MinPattern length in PrefixSpan

2017-04-13 Thread Cyril de Vogelaere (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cyril de Vogelaere updated SPARK-20322:
---
Description: 
The current implementation already diposes of a maxPatternLength parameter.
It might be nice to add a similar minPatternLength parameter. 

This would allow user to better control the solutions received, and lower the 
amount of memory needed to store the solutions in the driver.

This can be implemented as a simple check before adding a solution to the 
solution list, no difference in performance should be observable.
I'm proposing this fonctionality here, if it looks interesting to other people, 
I will implement it along with tests to verify it's implementation.

  was:
The current implementation already diposes of a maxPatternLength parameter. It 
might be nice to add a similar minPatternLength parameter. 

This would allow user to better control the solutions received, and lower the 
amount of memory needed to store the solutions in the driver.

This can be implemented as a simple check before adding a solution to the 
solution list, no difference in performance should be observable.

I'm proposing this fonctionality here, if it looks interesting to other people, 
I will implement it along with tests to verify it's implementation.


> MinPattern length in PrefixSpan
> ---
>
> Key: SPARK-20322
> URL: https://issues.apache.org/jira/browse/SPARK-20322
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Trivial
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> The current implementation already diposes of a maxPatternLength parameter.
> It might be nice to add a similar minPatternLength parameter. 
> This would allow user to better control the solutions received, and lower the 
> amount of memory needed to store the solutions in the driver.
> This can be implemented as a simple check before adding a solution to the 
> solution list, no difference in performance should be observable.
> I'm proposing this fonctionality here, if it looks interesting to other 
> people, I will implement it along with tests to verify it's implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20321) Spark UI cannot be shutdown in spark streaming app

2017-04-13 Thread Andrei Taleanu (JIRA)
Andrei Taleanu created SPARK-20321:
--

 Summary: Spark UI cannot be shutdown in spark streaming app
 Key: SPARK-20321
 URL: https://issues.apache.org/jira/browse/SPARK-20321
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Andrei Taleanu


When an exception thrown the transform stage is handled in foreachRDD and the 
streaming context is forced to stop, the SparkUI appears to hang and 
continually dump the following logs in an infinite loop:
{noformat}
...
2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] 
DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop woken up from 
select, 0/0 selected
2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] 
DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop waiting on 
select
2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] 
DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop woken up from 
select, 0/0 selected
2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] 
DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop waiting on 
select
...
{noformat}

Unfortunately I don't have a minimal example that reproduces this issue but 
here is what I can share:
{noformat}
val dstream = pull data from kafka
val mapped = dstream transform { rdd =>
  val data = getData // Perform a call that potentially throws an exception
  // broadcast the data
  // flatMap the RDD using the data
}

mapped.foreachRDD {
  try {
// write some data in a DB
  } catch {
case t: Throwable =>
  dstream.context.stop(stopSparkContext = true, stopGracefully = false)
  }
}

mapped.foreachRDD {
  try {
// write data to Kafka
// manually checkpoint the Kafka offsets (because I need them in JSON 
format)
  } catch {
case t: Throwable =>
  dstream.context.stop(stopSparkContext = true, stopGracefully = false)
  }
}
{noformat}

The issue appears when stop is invoked. At the point when SparkUI is stopped, 
it enters that infinite loop. Initially I thought it relates to Jetty, as the 
version used in SparkUI had some bugs (e.g. [this 
one|https://bugs.eclipse.org/bugs/show_bug.cgi?id=452465]). I bumped Jetty to a 
more recent version (March 2017) but still got the error.

I encountered this issue with Spark 2.1.0 built with Hadoop 2.6 on top of Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19924) Handle InvocationTargetException for all Hive Shim

2017-04-13 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-19924:

Fix Version/s: 2.1.1

> Handle InvocationTargetException for all Hive Shim
> --
>
> Key: SPARK-19924
> URL: https://issues.apache.org/jira/browse/SPARK-19924
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.1, 2.2.0
>
>
> Since we are using shim for most Hive metastore APIs, the exceptions thrown 
> by the underlying method of Method.invoke() are wrapped by 
> `InvocationTargetException`. Instead of doing it one by one, we should handle 
> all of them in the `withClient`. If any of them is missing, the error message 
> could looks unfriendly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels

2017-04-13 Thread Christian Reiniger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967403#comment-15967403
 ] 

Christian Reiniger commented on SPARK-20081:


Thanks for the information, but as you already suspected it doesn't work. I'll 
look into using StringIndexer, but that change takes more time and the 
performance costs might be too high.

> RandomForestClassifier doesn't seem to support more than 100 labels
> ---
>
> Key: SPARK-20081
> URL: https://issues.apache.org/jira/browse/SPARK-20081
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
> Environment: Java
>Reporter: Christian Reiniger
>
> When feeding data with more than 100 labels into RanfomForestClassifer#fit() 
> (from java code), I get the following error message:
> {code}
> Classifier inferred 143 from label values in column 
> rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) 
> allowed to be inferred from values.  
>   To avoid this error for labels with > 100 classes, specify numClasses 
> explicitly in the metadata; this can be done by applying StringIndexer to the 
> label column.
> {code}
> Setting "numClasses" in the metadata for the label column doesn't make a 
> difference. Looking at the code, this is not surprising, since 
> MetadataUtils.getNumClasses() ignores this setting:
> {code:language=scala}
>   def getNumClasses(labelSchema: StructField): Option[Int] = {
> Attribute.fromStructField(labelSchema) match {
>   case binAttr: BinaryAttribute => Some(2)
>   case nomAttr: NominalAttribute => nomAttr.getNumValues
>   case _: NumericAttribute | UnresolvedAttribute => None
> }
>   }
> {code}
> The alternative would be to pass a proper "maxNumClasses" parameter to the 
> classifier, so that Classifier#getNumClasses() allows a larger number of 
> auto-detected labels. However, RandomForestClassifer#train() calls 
> #getNumClasses without the "maxNumClasses" parameter, causing it to use the 
> default of 100:
> {code:language=scala}
>   override protected def train(dataset: Dataset[_]): 
> RandomForestClassificationModel = {
> val categoricalFeatures: Map[Int, Int] =
>   MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
> val numClasses: Int = getNumClasses(dataset)
> // ...
> {code}
> My scala skills are pretty sketchy, so please correct me if I misinterpreted 
> something. But as it seems right now, there is no way to learn from data with 
> more than 100 labels via RandomForestClassifier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20320) AnalysisException: Columns of grouping_id (count(value#17L)) does not match grouping columns (count(value#17L))

2017-04-13 Thread Jacek Laskowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-20320:

   Shepherd: Davies Liu  (was: Herman van Hovell)
Description: 
I'm not questioning the {{AnalysisException}} (which I don't know whether 
should be reported or not), but the exception message that tells...nothing 
helpful.

{code}
val records = spark.range(5).flatMap(n => Seq.fill(n.toInt)(n))
scala> 
records.cube(count("value")).agg(grouping_id(count("value"))).queryExecution.logical
org.apache.spark.sql.AnalysisException: Columns of grouping_id 
(count(value#17L)) does not match grouping columns (count(value#17L));
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGroupingAnalytics$$replaceGroupingFunc$1.applyOrElse(Analyzer.scala:313)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGroupingAnalytics$$replaceGroupingFunc$1.applyOrElse(Analyzer.scala:308)
{code}

> AnalysisException: Columns of grouping_id (count(value#17L)) does not match 
> grouping columns (count(value#17L))
> ---
>
> Key: SPARK-20320
> URL: https://issues.apache.org/jira/browse/SPARK-20320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> I'm not questioning the {{AnalysisException}} (which I don't know whether 
> should be reported or not), but the exception message that tells...nothing 
> helpful.
> {code}
> val records = spark.range(5).flatMap(n => Seq.fill(n.toInt)(n))
> scala> 
> records.cube(count("value")).agg(grouping_id(count("value"))).queryExecution.logical
> org.apache.spark.sql.AnalysisException: Columns of grouping_id 
> (count(value#17L)) does not match grouping columns (count(value#17L));
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGroupingAnalytics$$replaceGroupingFunc$1.applyOrElse(Analyzer.scala:313)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGroupingAnalytics$$replaceGroupingFunc$1.applyOrElse(Analyzer.scala:308)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20320) AnalysisException: Columns of grouping_id (count(value#17L)) does not match grouping columns (count(value#17L))

2017-04-13 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-20320:
---

 Summary: AnalysisException: Columns of grouping_id 
(count(value#17L)) does not match grouping columns (count(value#17L))
 Key: SPARK-20320
 URL: https://issues.apache.org/jira/browse/SPARK-20320
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Jacek Laskowski
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels

2017-04-13 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967347#comment-15967347
 ] 

Yan Facai (颜发才) edited comment on SPARK-20081 at 4/13/17 9:42 AM:
--

Yes, you should use `builder.putLong("num_vals", numClasses).putString("type", 
"nominal")`.  A little hacky, and it might not work. I am not familiar with 
Metadata and Attribute class at present. Some experts perhaps have a better 
solution, unfortunately, I have no idea.

If you like to dig deeper, more details to see:
org.apache.spark.ml.attribute.Attribute
org.apache.spark.ml.attribute.NominalAttribute

Use StringIndexer with your label column should work well, which take care of 
itself, I guess.


was (Author: facai):
Yes, you should use `builder.putLong("num_vals", numClasses).putString("type", 
"nominal")`.  A little hacky, and it might not work. I am not familiar with 
Metadata and Attribute class at present. Some experts perhaps have a better 
solution, unfortunately, I have no idea.

Use StringIndexer with your label column should work well, which take care of 
itself, I guess.

> RandomForestClassifier doesn't seem to support more than 100 labels
> ---
>
> Key: SPARK-20081
> URL: https://issues.apache.org/jira/browse/SPARK-20081
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
> Environment: Java
>Reporter: Christian Reiniger
>
> When feeding data with more than 100 labels into RanfomForestClassifer#fit() 
> (from java code), I get the following error message:
> {code}
> Classifier inferred 143 from label values in column 
> rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) 
> allowed to be inferred from values.  
>   To avoid this error for labels with > 100 classes, specify numClasses 
> explicitly in the metadata; this can be done by applying StringIndexer to the 
> label column.
> {code}
> Setting "numClasses" in the metadata for the label column doesn't make a 
> difference. Looking at the code, this is not surprising, since 
> MetadataUtils.getNumClasses() ignores this setting:
> {code:language=scala}
>   def getNumClasses(labelSchema: StructField): Option[Int] = {
> Attribute.fromStructField(labelSchema) match {
>   case binAttr: BinaryAttribute => Some(2)
>   case nomAttr: NominalAttribute => nomAttr.getNumValues
>   case _: NumericAttribute | UnresolvedAttribute => None
> }
>   }
> {code}
> The alternative would be to pass a proper "maxNumClasses" parameter to the 
> classifier, so that Classifier#getNumClasses() allows a larger number of 
> auto-detected labels. However, RandomForestClassifer#train() calls 
> #getNumClasses without the "maxNumClasses" parameter, causing it to use the 
> default of 100:
> {code:language=scala}
>   override protected def train(dataset: Dataset[_]): 
> RandomForestClassificationModel = {
> val categoricalFeatures: Map[Int, Int] =
>   MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
> val numClasses: Int = getNumClasses(dataset)
> // ...
> {code}
> My scala skills are pretty sketchy, so please correct me if I misinterpreted 
> something. But as it seems right now, there is no way to learn from data with 
> more than 100 labels via RandomForestClassifier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels

2017-04-13 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967347#comment-15967347
 ] 

Yan Facai (颜发才) edited comment on SPARK-20081 at 4/13/17 9:40 AM:
--

Yes, you should use `builder.putLong("num_vals", numClasses).putString("type", 
"nominal")`.  A little hacky, and it might not work. I am not familiar with 
Metadata and Attribute class at present. Some experts perhaps have a better 
solution, unfortunately, I have no idea.

Use StringIndexer with your label column should work well, which take care of 
itself, I guess.


was (Author: facai):
Yes, you should use `builder.putLong("num_vals", numClasses).putString("type", 
"nominal")`.  A little hacky, and it might not work. I am not familiar with 
Metadata and Attribute class at present. Some experts perhaps have a better 
solution, unfortunately, I have no idea.

Use StringIndexer with your label column should work well, which take care of 
itself, I guess.

> RandomForestClassifier doesn't seem to support more than 100 labels
> ---
>
> Key: SPARK-20081
> URL: https://issues.apache.org/jira/browse/SPARK-20081
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
> Environment: Java
>Reporter: Christian Reiniger
>
> When feeding data with more than 100 labels into RanfomForestClassifer#fit() 
> (from java code), I get the following error message:
> {code}
> Classifier inferred 143 from label values in column 
> rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) 
> allowed to be inferred from values.  
>   To avoid this error for labels with > 100 classes, specify numClasses 
> explicitly in the metadata; this can be done by applying StringIndexer to the 
> label column.
> {code}
> Setting "numClasses" in the metadata for the label column doesn't make a 
> difference. Looking at the code, this is not surprising, since 
> MetadataUtils.getNumClasses() ignores this setting:
> {code:language=scala}
>   def getNumClasses(labelSchema: StructField): Option[Int] = {
> Attribute.fromStructField(labelSchema) match {
>   case binAttr: BinaryAttribute => Some(2)
>   case nomAttr: NominalAttribute => nomAttr.getNumValues
>   case _: NumericAttribute | UnresolvedAttribute => None
> }
>   }
> {code}
> The alternative would be to pass a proper "maxNumClasses" parameter to the 
> classifier, so that Classifier#getNumClasses() allows a larger number of 
> auto-detected labels. However, RandomForestClassifer#train() calls 
> #getNumClasses without the "maxNumClasses" parameter, causing it to use the 
> default of 100:
> {code:language=scala}
>   override protected def train(dataset: Dataset[_]): 
> RandomForestClassificationModel = {
> val categoricalFeatures: Map[Int, Int] =
>   MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
> val numClasses: Int = getNumClasses(dataset)
> // ...
> {code}
> My scala skills are pretty sketchy, so please correct me if I misinterpreted 
> something. But as it seems right now, there is no way to learn from data with 
> more than 100 labels via RandomForestClassifier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels

2017-04-13 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967347#comment-15967347
 ] 

Yan Facai (颜发才) edited comment on SPARK-20081 at 4/13/17 9:40 AM:
--

Yes, you should use `builder.putLong("num_vals", numClasses).putString("type", 
"nominal")`.  A little hacky, and it might not work. I am not familiar with 
Metadata and Attribute class at present. Some experts perhaps have a better 
solution, unfortunately, I have no idea.

Use StringIndexer with your label column should work well, which take care of 
itself, I guess.


was (Author: facai):
Yes, you should use `builder.putLong("num_vals", numClasses)`.  A little hacky, 
and it might not work. I am not familiar with Metadata and Attribute class at 
present. Some experts perhaps have a better solution, unfortunately, I have no 
idea.

Use StringIndexer with your label column should work well, which take care of 
itself, I guess.

> RandomForestClassifier doesn't seem to support more than 100 labels
> ---
>
> Key: SPARK-20081
> URL: https://issues.apache.org/jira/browse/SPARK-20081
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
> Environment: Java
>Reporter: Christian Reiniger
>
> When feeding data with more than 100 labels into RanfomForestClassifer#fit() 
> (from java code), I get the following error message:
> {code}
> Classifier inferred 143 from label values in column 
> rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) 
> allowed to be inferred from values.  
>   To avoid this error for labels with > 100 classes, specify numClasses 
> explicitly in the metadata; this can be done by applying StringIndexer to the 
> label column.
> {code}
> Setting "numClasses" in the metadata for the label column doesn't make a 
> difference. Looking at the code, this is not surprising, since 
> MetadataUtils.getNumClasses() ignores this setting:
> {code:language=scala}
>   def getNumClasses(labelSchema: StructField): Option[Int] = {
> Attribute.fromStructField(labelSchema) match {
>   case binAttr: BinaryAttribute => Some(2)
>   case nomAttr: NominalAttribute => nomAttr.getNumValues
>   case _: NumericAttribute | UnresolvedAttribute => None
> }
>   }
> {code}
> The alternative would be to pass a proper "maxNumClasses" parameter to the 
> classifier, so that Classifier#getNumClasses() allows a larger number of 
> auto-detected labels. However, RandomForestClassifer#train() calls 
> #getNumClasses without the "maxNumClasses" parameter, causing it to use the 
> default of 100:
> {code:language=scala}
>   override protected def train(dataset: Dataset[_]): 
> RandomForestClassificationModel = {
> val categoricalFeatures: Map[Int, Int] =
>   MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
> val numClasses: Int = getNumClasses(dataset)
> // ...
> {code}
> My scala skills are pretty sketchy, so please correct me if I misinterpreted 
> something. But as it seems right now, there is no way to learn from data with 
> more than 100 labels via RandomForestClassifier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels

2017-04-13 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967347#comment-15967347
 ] 

Yan Facai (颜发才) commented on SPARK-20081:
-

Yes, you should use `builder.putLong("num_vals", numClasses)`.  A little hacky, 
and it might not work. I am not familiar with Metadata and Attribute class at 
present. Some experts perhaps have a better solution, unfortunately, I have no 
idea.

Use StringIndexer with your label column should work well, which take care of 
itself, I guess.

> RandomForestClassifier doesn't seem to support more than 100 labels
> ---
>
> Key: SPARK-20081
> URL: https://issues.apache.org/jira/browse/SPARK-20081
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
> Environment: Java
>Reporter: Christian Reiniger
>
> When feeding data with more than 100 labels into RanfomForestClassifer#fit() 
> (from java code), I get the following error message:
> {code}
> Classifier inferred 143 from label values in column 
> rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) 
> allowed to be inferred from values.  
>   To avoid this error for labels with > 100 classes, specify numClasses 
> explicitly in the metadata; this can be done by applying StringIndexer to the 
> label column.
> {code}
> Setting "numClasses" in the metadata for the label column doesn't make a 
> difference. Looking at the code, this is not surprising, since 
> MetadataUtils.getNumClasses() ignores this setting:
> {code:language=scala}
>   def getNumClasses(labelSchema: StructField): Option[Int] = {
> Attribute.fromStructField(labelSchema) match {
>   case binAttr: BinaryAttribute => Some(2)
>   case nomAttr: NominalAttribute => nomAttr.getNumValues
>   case _: NumericAttribute | UnresolvedAttribute => None
> }
>   }
> {code}
> The alternative would be to pass a proper "maxNumClasses" parameter to the 
> classifier, so that Classifier#getNumClasses() allows a larger number of 
> auto-detected labels. However, RandomForestClassifer#train() calls 
> #getNumClasses without the "maxNumClasses" parameter, causing it to use the 
> default of 100:
> {code:language=scala}
>   override protected def train(dataset: Dataset[_]): 
> RandomForestClassificationModel = {
> val categoricalFeatures: Map[Int, Int] =
>   MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
> val numClasses: Int = getNumClasses(dataset)
> // ...
> {code}
> My scala skills are pretty sketchy, so please correct me if I misinterpreted 
> something. But as it seems right now, there is no way to learn from data with 
> more than 100 labels via RandomForestClassifier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels

2017-04-13 Thread Christian Reiniger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967328#comment-15967328
 ] 

Christian Reiniger commented on SPARK-20081:


Thanks for the feedback.

We had to implement indexing ourselves, since StringIndexer only operates on a 
single input column and we have potentially hundreds of columns that need to be 
indexed. Applying StringIndexer sequentially on them would be (and was) a 
performance desaster. We could of course use StringIndexer for the label column 
only, but if we can keep our all-in-one indexer that would be preferable.

I'm also suspicious of whether using StringIndexer would help at all -- it may 
set numClasses in the Metadata, but RandomForestClassifier doesn't seem to even 
look at it (as described above).

On the other hand -- rereading your comment I get the suspicion that I've been 
misled by the error message. Do you mean that "specify numClasses explicitly in 
the metadata" does *not* mean setting the key "numClasses" in the column 
metadata, as in this snippet:

{code:language=java}
MetadataBuilder builder = new MetadataBuilder();
builder.putLong("numClasses", numClasses);   // <--
StructField labelField = DataTypes.createStructField("label", 
DataTypes.DoubleType, true, builder.build());  // <--
StructField featuresField = DataTypes.createStructField("features", new 
VectorUDT(), true);

StructField[] fields = new StructField[] { labelField, featuresField };

StructType schema = DataTypes.createStructType(fields);
{code}

... and that instead the {{nomAttr.getNumValues}} is meant with that? And that 
this value can only be set inside of some internal mllib classes?

> RandomForestClassifier doesn't seem to support more than 100 labels
> ---
>
> Key: SPARK-20081
> URL: https://issues.apache.org/jira/browse/SPARK-20081
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
> Environment: Java
>Reporter: Christian Reiniger
>
> When feeding data with more than 100 labels into RanfomForestClassifer#fit() 
> (from java code), I get the following error message:
> {code}
> Classifier inferred 143 from label values in column 
> rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) 
> allowed to be inferred from values.  
>   To avoid this error for labels with > 100 classes, specify numClasses 
> explicitly in the metadata; this can be done by applying StringIndexer to the 
> label column.
> {code}
> Setting "numClasses" in the metadata for the label column doesn't make a 
> difference. Looking at the code, this is not surprising, since 
> MetadataUtils.getNumClasses() ignores this setting:
> {code:language=scala}
>   def getNumClasses(labelSchema: StructField): Option[Int] = {
> Attribute.fromStructField(labelSchema) match {
>   case binAttr: BinaryAttribute => Some(2)
>   case nomAttr: NominalAttribute => nomAttr.getNumValues
>   case _: NumericAttribute | UnresolvedAttribute => None
> }
>   }
> {code}
> The alternative would be to pass a proper "maxNumClasses" parameter to the 
> classifier, so that Classifier#getNumClasses() allows a larger number of 
> auto-detected labels. However, RandomForestClassifer#train() calls 
> #getNumClasses without the "maxNumClasses" parameter, causing it to use the 
> default of 100:
> {code:language=scala}
>   override protected def train(dataset: Dataset[_]): 
> RandomForestClassificationModel = {
> val categoricalFeatures: Map[Int, Int] =
>   MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
> val numClasses: Int = getNumClasses(dataset)
> // ...
> {code}
> My scala skills are pretty sketchy, so please correct me if I misinterpreted 
> something. But as it seems right now, there is no way to learn from data with 
> more than 100 labels via RandomForestClassifier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20319) Already quoted identifiers are getting wrapped with additional quotes

2017-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20319:


Assignee: Apache Spark

> Already quoted identifiers are getting wrapped with additional quotes
> -
>
> Key: SPARK-20319
> URL: https://issues.apache.org/jira/browse/SPARK-20319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Umesh Chaudhary
>Assignee: Apache Spark
>
> The issue was caused by 
> [SPARK-16387|https://issues.apache.org/jira/browse/SPARK-16387] where 
> reserved SQL words are honored by wrapping quotes on column names. 
> In our test we found that when quotes are explicitly wrapped in column names 
> then Oracle JDBC driver is throwing : 
> java.sql.BatchUpdateException: ORA-01741: illegal zero-length identifier 
> at 
> oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:12296)
>  
> at 
> oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:246)
>  
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:597)
>  
> and Cassandra JDBC driver is throwing : 
> 17/04/12 19:03:48 ERROR executor.Executor: Exception in task 0.0 in stage 5.0 
> (TID 6)
> java.sql.SQLSyntaxErrorException: [FMWGEN][Cassandra JDBC 
> Driver][Cassandra]syntax error or access rule violation: base table or view 
> not found: 
>   at weblogic.jdbc.cassandrabase.ddcl.b(Unknown Source)
>   at weblogic.jdbc.cassandrabase.ddt.a(Unknown Source)
>   at weblogic.jdbc.cassandrabase.BaseConnection.prepareStatement(Unknown 
> Source)
>   at weblogic.jdbc.cassandrabase.BaseConnection.prepareStatement(Unknown 
> Source)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.insertStatement(JdbcUtils.scala:118)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:571)
> CC: [~rxin] , [~joshrosen]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20319) Already quoted identifiers are getting wrapped with additional quotes

2017-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967318#comment-15967318
 ] 

Apache Spark commented on SPARK-20319:
--

User 'umesh9794' has created a pull request for this issue:
https://github.com/apache/spark/pull/17631

> Already quoted identifiers are getting wrapped with additional quotes
> -
>
> Key: SPARK-20319
> URL: https://issues.apache.org/jira/browse/SPARK-20319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Umesh Chaudhary
>
> The issue was caused by 
> [SPARK-16387|https://issues.apache.org/jira/browse/SPARK-16387] where 
> reserved SQL words are honored by wrapping quotes on column names. 
> In our test we found that when quotes are explicitly wrapped in column names 
> then Oracle JDBC driver is throwing : 
> java.sql.BatchUpdateException: ORA-01741: illegal zero-length identifier 
> at 
> oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:12296)
>  
> at 
> oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:246)
>  
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:597)
>  
> and Cassandra JDBC driver is throwing : 
> 17/04/12 19:03:48 ERROR executor.Executor: Exception in task 0.0 in stage 5.0 
> (TID 6)
> java.sql.SQLSyntaxErrorException: [FMWGEN][Cassandra JDBC 
> Driver][Cassandra]syntax error or access rule violation: base table or view 
> not found: 
>   at weblogic.jdbc.cassandrabase.ddcl.b(Unknown Source)
>   at weblogic.jdbc.cassandrabase.ddt.a(Unknown Source)
>   at weblogic.jdbc.cassandrabase.BaseConnection.prepareStatement(Unknown 
> Source)
>   at weblogic.jdbc.cassandrabase.BaseConnection.prepareStatement(Unknown 
> Source)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.insertStatement(JdbcUtils.scala:118)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:571)
> CC: [~rxin] , [~joshrosen]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20319) Already quoted identifiers are getting wrapped with additional quotes

2017-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20319:


Assignee: (was: Apache Spark)

> Already quoted identifiers are getting wrapped with additional quotes
> -
>
> Key: SPARK-20319
> URL: https://issues.apache.org/jira/browse/SPARK-20319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Umesh Chaudhary
>
> The issue was caused by 
> [SPARK-16387|https://issues.apache.org/jira/browse/SPARK-16387] where 
> reserved SQL words are honored by wrapping quotes on column names. 
> In our test we found that when quotes are explicitly wrapped in column names 
> then Oracle JDBC driver is throwing : 
> java.sql.BatchUpdateException: ORA-01741: illegal zero-length identifier 
> at 
> oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:12296)
>  
> at 
> oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:246)
>  
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:597)
>  
> and Cassandra JDBC driver is throwing : 
> 17/04/12 19:03:48 ERROR executor.Executor: Exception in task 0.0 in stage 5.0 
> (TID 6)
> java.sql.SQLSyntaxErrorException: [FMWGEN][Cassandra JDBC 
> Driver][Cassandra]syntax error or access rule violation: base table or view 
> not found: 
>   at weblogic.jdbc.cassandrabase.ddcl.b(Unknown Source)
>   at weblogic.jdbc.cassandrabase.ddt.a(Unknown Source)
>   at weblogic.jdbc.cassandrabase.BaseConnection.prepareStatement(Unknown 
> Source)
>   at weblogic.jdbc.cassandrabase.BaseConnection.prepareStatement(Unknown 
> Source)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.insertStatement(JdbcUtils.scala:118)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:571)
> CC: [~rxin] , [~joshrosen]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20284) Make SerializationStream and DeserializationStream extend Closeable

2017-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20284.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17598
[https://github.com/apache/spark/pull/17598]

> Make SerializationStream and DeserializationStream extend Closeable
> ---
>
> Key: SPARK-20284
> URL: https://issues.apache.org/jira/browse/SPARK-20284
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Sergei Lebedev
>Priority: Trivial
> Fix For: 2.2.0
>
>
> Both {{SerializationStream}} and {{DeserializationStream}} implement 
> {{close}} but do not extend {{Closeable}}. As a result, these streams cannot 
> be used in try-with-resources.
> Was this intentional or rather nobody ever needed that?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20284) Make SerializationStream and DeserializationStream extend Closeable

2017-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20284:
-

Assignee: Sergei Lebedev

> Make SerializationStream and DeserializationStream extend Closeable
> ---
>
> Key: SPARK-20284
> URL: https://issues.apache.org/jira/browse/SPARK-20284
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Sergei Lebedev
>Assignee: Sergei Lebedev
>Priority: Trivial
> Fix For: 2.2.0
>
>
> Both {{SerializationStream}} and {{DeserializationStream}} implement 
> {{close}} but do not extend {{Closeable}}. As a result, these streams cannot 
> be used in try-with-resources.
> Was this intentional or rather nobody ever needed that?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >