[jira] [Resolved] (SPARK-22348) The table cache providing ColumnarBatch should also do partition batch pruning

2017-10-24 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22348.
-
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.3.0

> The table cache providing ColumnarBatch should also do partition batch pruning
> --
>
> Key: SPARK-22348
> URL: https://issues.apache.org/jira/browse/SPARK-22348
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> We enable table cache {{InMemoryTableScanExec}} to provide {{ColumnarBatch}} 
> now. But the cached batches are retrieved without pruning. In this case, we 
> still need to do partition batch pruning.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22344) Prevent R CMD check from using /tmp

2017-10-24 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22344:
-
Affects Version/s: 2.3.0
   1.6.3
   2.2.0

> Prevent R CMD check from using /tmp
> ---
>
> Key: SPARK-22344
> URL: https://issues.apache.org/jira/browse/SPARK-22344
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.3, 2.1.2, 2.2.0, 2.3.0
>Reporter: Shivaram Venkataraman
>
> When R CMD check is run on the SparkR package it leaves behind files in /tmp 
> which is a violation of CRAN policy. We should instead write to Rtmpdir. 
> Notes from CRAN are below
> {code}
> Checking this leaves behind dirs
>hive/$USER
>$USER
> and files named like
>b4f6459b-0624-4100-8358-7aa7afbda757_resources
> in /tmp, in violation of the CRAN Policy.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21616) SparkR 2.3.0 migration guide, release note

2017-10-24 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16218104#comment-16218104
 ] 

Felix Cheung commented on SPARK-21616:
--

True, I don't know if we are tracking changes to programming guide for 
"migration guide" section that way though

> SparkR 2.3.0 migration guide, release note
> --
>
> Key: SPARK-21616
> URL: https://issues.apache.org/jira/browse/SPARK-21616
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> From looking at changes since 2.2.0, this/these should be documented in the 
> migration guide / release note for the 2.3.0 release, as it is behavior 
> changes



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe

2017-10-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16218061#comment-16218061
 ] 

Apache Spark commented on SPARK-15474:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19571

>  ORC data source fails to write and read back empty dataframe
> -
>
> Key: SPARK-15474
> URL: https://issues.apache.org/jira/browse/SPARK-15474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.1, 2.2.0
>Reporter: Hyukjin Kwon
>
> Currently ORC data source fails to write and read empty data.
> The code below:
> {code}
> val emptyDf = spark.range(10).limit(0)
> emptyDf.write
>   .format("orc")
>   .save(path.getCanonicalPath)
> val copyEmptyDf = spark.read
>   .format("orc")
>   .load(path.getCanonicalPath)
> copyEmptyDf.show()
> {code}
> throws an exception below:
> {code}
> Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114)
> {code}
> Note that this is a different case with the data below
> {code}
> val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
> {code}
> In this case, any writer is not initialised and created. (no calls of 
> {{WriterContainer.writeRows()}}.
> For Parquet and JSON, it works but ORC does not.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22335:


Assignee: (was: Apache Spark)

> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
>   baDs.select("a","b").as[AB].union(abDs).show() // This is the same 
> workaround for linked issue, slightly modified.  However this seems wrong 
> since its supposed to be strongly typed
>   
>   baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
> result, which is logically inconsistent behavior
>   abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives 
> correct result
> {code}
> So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
> around is really fair, since I'm supposed to be getting of type `AB`.  More 
> importantly I think the issue is bigger when you consider that it happens 
> even if you read from parquet (as you would expect).  And that its 
> inconsistent when going to/from rdd.
> I imagine its just lazily converting to typed DS instead of initially.  So 
> either that typing could be prioritized to happen before the union or 
> unioning of DF could be done with column order taken into account.  Again, 
> this is speculation..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22335:


Assignee: Apache Spark

> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>Assignee: Apache Spark
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
>   baDs.select("a","b").as[AB].union(abDs).show() // This is the same 
> workaround for linked issue, slightly modified.  However this seems wrong 
> since its supposed to be strongly typed
>   
>   baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
> result, which is logically inconsistent behavior
>   abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives 
> correct result
> {code}
> So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
> around is really fair, since I'm supposed to be getting of type `AB`.  More 
> importantly I think the issue is bigger when you consider that it happens 
> even if you read from parquet (as you would expect).  And that its 
> inconsistent when going to/from rdd.
> I imagine its just lazily converting to typed DS instead of initially.  So 
> either that typing could be prioritized to happen before the union or 
> unioning of DF could be done with column order taken into account.  Again, 
> this is speculation..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16218035#comment-16218035
 ] 

Apache Spark commented on SPARK-22335:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/19570

> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
>   baDs.select("a","b").as[AB].union(abDs).show() // This is the same 
> workaround for linked issue, slightly modified.  However this seems wrong 
> since its supposed to be strongly typed
>   
>   baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
> result, which is logically inconsistent behavior
>   abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives 
> correct result
> {code}
> So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
> around is really fair, since I'm supposed to be getting of type `AB`.  More 
> importantly I think the issue is bigger when you consider that it happens 
> even if you read from parquet (as you would expect).  And that its 
> inconsistent when going to/from rdd.
> I imagine its just lazily converting to typed DS instead of initially.  So 
> either that typing could be prioritized to happen before the union or 
> unioning of DF could be done with column order taken into account.  Again, 
> this is speculation..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-24 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16218027#comment-16218027
 ] 

Liang-Chi Hsieh commented on SPARK-22335:
-

[~CBribiescas] The column position in the schema of a Dataset doesn't 
necessarily match the fields in the typed objects. The document of {{as}} has 
explained it:

{code}
Returns a new Dataset where each record has been mapped on to the specified 
type. The
method used to map columns depend on the type of `U`:
 - When `U` is a class, fields for the class will be mapped to columns of the 
same name
   (case sensitivity is determined by `spark.sql.caseSensitive`).
{code}

{{unionByName}} resolves columns by name. For the typed objects, I think it can 
satisfy this kind of usage.

I remember that this is not the first ticket opened for union behavior on 
Datasets. The first thing might be that we can better describe this in the 
document of {{union}}.



> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
>   baDs.select("a","b").as[AB].union(abDs).show() // This is the same 
> workaround for linked issue, slightly modified.  However this seems wrong 
> since its supposed to be strongly typed
>   
>   baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
> result, which is logically inconsistent behavior
>   abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives 
> correct result
> {code}
> So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
> around is really fair, since I'm supposed to be getting of type `AB`.  More 
> importantly I think the issue is bigger when you consider that it happens 
> even if you read from parquet (as you would expect).  And that its 
> inconsistent when going to/from rdd.
> I imagine its just lazily converting to typed DS instead of initially.  So 
> either that typing could be prioritized to happen before the union or 
> unioning of DF could be done with column order taken into account.  Again, 
> this is speculation..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22348) The table cache providing ColumnarBatch should also do partition batch pruning

2017-10-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22348:


Assignee: Apache Spark

> The table cache providing ColumnarBatch should also do partition batch pruning
> --
>
> Key: SPARK-22348
> URL: https://issues.apache.org/jira/browse/SPARK-22348
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> We enable table cache {{InMemoryTableScanExec}} to provide {{ColumnarBatch}} 
> now. But the cached batches are retrieved without pruning. In this case, we 
> still need to do partition batch pruning.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22348) The table cache providing ColumnarBatch should also do partition batch pruning

2017-10-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22348:


Assignee: (was: Apache Spark)

> The table cache providing ColumnarBatch should also do partition batch pruning
> --
>
> Key: SPARK-22348
> URL: https://issues.apache.org/jira/browse/SPARK-22348
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Liang-Chi Hsieh
>
> We enable table cache {{InMemoryTableScanExec}} to provide {{ColumnarBatch}} 
> now. But the cached batches are retrieved without pruning. In this case, we 
> still need to do partition batch pruning.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22348) The table cache providing ColumnarBatch should also do partition batch pruning

2017-10-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217992#comment-16217992
 ] 

Apache Spark commented on SPARK-22348:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/19569

> The table cache providing ColumnarBatch should also do partition batch pruning
> --
>
> Key: SPARK-22348
> URL: https://issues.apache.org/jira/browse/SPARK-22348
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Liang-Chi Hsieh
>
> We enable table cache {{InMemoryTableScanExec}} to provide {{ColumnarBatch}} 
> now. But the cached batches are retrieved without pruning. In this case, we 
> still need to do partition batch pruning.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22348) The table cache providing ColumnarBatch should also do partition batch pruning

2017-10-24 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-22348:
---

 Summary: The table cache providing ColumnarBatch should also do 
partition batch pruning
 Key: SPARK-22348
 URL: https://issues.apache.org/jira/browse/SPARK-22348
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Liang-Chi Hsieh


We enable table cache {{InMemoryTableScanExec}} to provide {{ColumnarBatch}} 
now. But the cached batches are retrieved without pruning. In this case, we 
still need to do partition batch pruning.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22277) Chi Square selector garbling Vector content.

2017-10-24 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217979#comment-16217979
 ] 

Peng Meng commented on SPARK-22277:
---

For problem 1 and 2, could you please post the test code. 
For problem 1, one possible case is all the feature ChiSquare statistics value 
is the same, no matter you select which feature, the result is right. 
To code is helpful for analysis of the problem, 

> Chi Square selector garbling Vector content.
> 
>
> Key: SPARK-22277
> URL: https://issues.apache.org/jira/browse/SPARK-22277
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Cheburakshu
>
> There is a difference in behavior when Chisquare selector is used v direct 
> feature use in decision tree classifier. 
> In the below code, I have used chisquare selector as a thru' pass but the 
> decision tree classifier is unable to process it. But, it is able to process 
> when the features are used directly.
> The example is pulled out directly from Apache spark python documentation.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import ChiSqSelector
> from pyspark.ml.linalg import Vectors
> import sys
> df = spark.createDataFrame([
> (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
> (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
> (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
> "clicked"])
> # ChiSq selector will just be a pass-through. All four featuresin the i/p 
> will be in output also.
> selector = ChiSqSelector(numTopFeatures=4, featuresCol="features",
>  outputCol="selectedFeatures", labelCol="clicked")
> result = selector.fit(df).transform(df)
> print("ChiSqSelector output with top %d features selected" % 
> selector.getNumTopFeatures())
> from pyspark.ml.classification import DecisionTreeClassifier
> try:
> # Fails
> dt = 
> DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures")
> model = dt.fit(result)
> except:
> print(sys.exc_info())
> #Works
> dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features")
> model = dt.fit(df)
> 
> # Make predictions. Using same dataset, not splitting!!
> predictions = model.transform(result)
> # Select example rows to display.
> predictions.select("prediction", "clicked", "features").show(5)
> # Select (prediction, true label) and compute test error
> evaluator = MulticlassClassificationEvaluator(
> labelCol="clicked", predictionCol="prediction", metricName="accuracy")
> accuracy = evaluator.evaluate(predictions)
> print("Test Error = %g " % (1.0 - accuracy))
> {code}
> Output:
> ChiSqSelector output with top 4 features selected
> (, 
> IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but 
> it does not have the number of values specified.', 
> 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t
>  at 
> org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t 
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t 
> at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at 
> org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t
>  at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t
>  at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t
>  at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at 
> org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at 
> sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
>  at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at 
> py4j.Gateway.invoke(Gateway.java:280)\n\t at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at 
> py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at 
> py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at 
> java.lang.Thread.run(Thread.java:745)'), )
> +--+---+--+
> |prediction|clicked|  

[jira] [Updated] (SPARK-17074) generate equi-height histogram for column

2017-10-24 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-17074:
-
Affects Version/s: (was: 2.0.0)
   2.3.0

> generate equi-height histogram for column
> -
>
> Key: SPARK-17074
> URL: https://issues.apache.org/jira/browse/SPARK-17074
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.3.0
>Reporter: Ron Hu
>
> Equi-height histogram is effective in handling skewed data distribution.
> For equi-height histogram, the heights of all bins(intervals) are the same. 
> The default number of bins we use is 254.
> Now we use a two-step method to generate an equi-height histogram:
> 1. use percentile_approx to get percentiles (end points of the equi-height 
> bin intervals);
> 2. use a new aggregate function to get distinct counts in each of these bins.
> Note that this method takes two table scans. In the future we may provide 
> other algorithms which need only one table scan.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

2017-10-24 Thread Nicolas Porter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Porter updated SPARK-22347:
---
Description: 
Here's a simple example on how to reproduce this:

{code}
from pyspark.sql import functions as F, Row, types

def Divide10():
def fn(value): return 10 / int(value)
return F.udf(fn, types.IntegerType())

df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()

x = F.col('x')
df2 = df.select(F.when((x > 0), Divide10()(x)))
df2.show(200)
{code}

This raises a division by zero error, even if `F.when` is trying to filter out 
all cases where `x <= 0`. I believe the correct behavior should be not to 
evaluate the UDF when the `F.when` condition is false.

Interestingly enough, if the `F.when` condition is set to `F.lit(False)`, then 
the error is not raised and all rows resolve to `null`, which is the expected 
result.

  was:
Here's a simple example on how to reproduce this:

{code}
from pyspark.sql import functions as F, Row, types

def Divide10():
def fn(value): return 10 / int(value)
return F.udf(fn, types.IntegerType())

df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()

x = F.col('x')
df2 = df.select(F.when((x > 0), Divide10()(x)))
df2.show(200)
{code}

This raises a division by zero error, even if `F.when` is trying to filter out 
all cases where `x <= 0`. I believe the correct behavior should be not to 
evaluate the UDF when the `F.when` condition is false.

Interestingly enough, if the `F.when` condition is set to `F.lit(False)`, then 
the error is not raised and all rows resolve to `null`.


> UDF is evaluated when 'F.when' condition is false
> -
>
> Key: SPARK-22347
> URL: https://issues.apache.org/jira/browse/SPARK-22347
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Nicolas Porter
>Priority: Minor
>
> Here's a simple example on how to reproduce this:
> {code}
> from pyspark.sql import functions as F, Row, types
> def Divide10():
> def fn(value): return 10 / int(value)
> return F.udf(fn, types.IntegerType())
> df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()
> x = F.col('x')
> df2 = df.select(F.when((x > 0), Divide10()(x)))
> df2.show(200)
> {code}
> This raises a division by zero error, even if `F.when` is trying to filter 
> out all cases where `x <= 0`. I believe the correct behavior should be not to 
> evaluate the UDF when the `F.when` condition is false.
> Interestingly enough, if the `F.when` condition is set to `F.lit(False)`, 
> then the error is not raised and all rows resolve to `null`, which is the 
> expected result.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

2017-10-24 Thread Nicolas Porter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Porter updated SPARK-22347:
---
Description: 
Here's a simple example on how to reproduce this:

{code}
from pyspark.sql import functions as F, Row, types

def Divide10():
def fn(value): return 10 / int(value)
return F.udf(fn, types.IntegerType())

df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()

x = F.col('x')
df2 = df.select(F.when((x > 0), Divide10()(x)))
df2.show(200)
{code}

This raises a division by zero error, even if `F.when` is trying to filter out 
all cases where `x <= 0`. I believe the correct behavior should be not to 
evaluate the UDF when the `F.when` condition is false.

Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, 
then the error is not raised and all rows resolve to `null`, which is the 
expected result.

  was:
Here's a simple example on how to reproduce this:

{code}
from pyspark.sql import functions as F, Row, types

def Divide10():
def fn(value): return 10 / int(value)
return F.udf(fn, types.IntegerType())

df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()

x = F.col('x')
df2 = df.select(F.when((x > 0), Divide10()(x)))
df2.show(200)
{code}

This raises a division by zero error, even if `F.when` is trying to filter out 
all cases where `x <= 0`. I believe the correct behavior should be not to 
evaluate the UDF when the `F.when` condition is false.

Interestingly enough, if the `F.when` condition is set to `F.lit(False)`, then 
the error is not raised and all rows resolve to `null`, which is the expected 
result.


> UDF is evaluated when 'F.when' condition is false
> -
>
> Key: SPARK-22347
> URL: https://issues.apache.org/jira/browse/SPARK-22347
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Nicolas Porter
>Priority: Minor
>
> Here's a simple example on how to reproduce this:
> {code}
> from pyspark.sql import functions as F, Row, types
> def Divide10():
> def fn(value): return 10 / int(value)
> return F.udf(fn, types.IntegerType())
> df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()
> x = F.col('x')
> df2 = df.select(F.when((x > 0), Divide10()(x)))
> df2.show(200)
> {code}
> This raises a division by zero error, even if `F.when` is trying to filter 
> out all cases where `x <= 0`. I believe the correct behavior should be not to 
> evaluate the UDF when the `F.when` condition is false.
> Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, 
> then the error is not raised and all rows resolve to `null`, which is the 
> expected result.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

2017-10-24 Thread Nicolas Porter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Porter updated SPARK-22347:
---
Description: 
Here's a simple example on how to reproduce this:

{code}
from pyspark.sql import functions as F, Row, types

def Divide10():
def fn(value): return 10 / int(value)
return F.udf(fn, types.IntegerType())

df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()

x = F.col('x')
df2 = df.select(F.when((x > 0), Divide10()(x)))
df2.show(200)
{code}

This raises a division by zero error, even if `F.when` is trying to filter out 
all cases where `x <= 0`. I believe the correct behavior should be not to 
evaluate the UDF when the `F.when` condition is false.

Interestingly enough, if the `F.when` condition is set to `F.lit(False)`, then 
the error is not raised and all rows resolve to `null`.

  was:
Here's a simple example on how to reproduce this:

{code}
from pyspark.sql import functions as F, Row, types

def Divide10():
def fn(value): return 10 / int(value)
return F.udf(fn, types.IntegerType())

df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()

x = F.col('x')
df2 = df.select(F.when((x > 0), Divide10()(x)))
df2.show(200)
{code}

This raises a division by zero error, even if `F.when` is trying to filter out 
all cases where `x <= 0`. I believe the correct behavior should be not to 
evaluate the UDF when the `F.when` condition is false.


> UDF is evaluated when 'F.when' condition is false
> -
>
> Key: SPARK-22347
> URL: https://issues.apache.org/jira/browse/SPARK-22347
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Nicolas Porter
>Priority: Minor
>
> Here's a simple example on how to reproduce this:
> {code}
> from pyspark.sql import functions as F, Row, types
> def Divide10():
> def fn(value): return 10 / int(value)
> return F.udf(fn, types.IntegerType())
> df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()
> x = F.col('x')
> df2 = df.select(F.when((x > 0), Divide10()(x)))
> df2.show(200)
> {code}
> This raises a division by zero error, even if `F.when` is trying to filter 
> out all cases where `x <= 0`. I believe the correct behavior should be not to 
> evaluate the UDF when the `F.when` condition is false.
> Interestingly enough, if the `F.when` condition is set to `F.lit(False)`, 
> then the error is not raised and all rows resolve to `null`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

2017-10-24 Thread Nicolas Porter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Porter updated SPARK-22347:
---
Description: 
Here's a simple example on how to reproduce this:

{code}
from pyspark.sql import functions as F, Row, types

def Divide10():
def fn(value): return 10 / int(value)
return F.udf(fn, types.IntegerType())

df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()

x = F.col('x')
df2 = df.select(F.when((x > 0), Divide10()(x)))
df2.show(200)
{code}

This raises a division by zero error, even if `F.when` is trying to filter out 
all cases where `x <= 0`. I believe the correct behavior should be not to 
evaluate the UDF when the `F.when` condition is false.

  was:
Here's a simple example on how to reproduce this:

{code}
from pyspark.sql import functions as F, Row, types

def Divide10():
def fn(value): return 10 / int(value)
return F.udf(fn, types.IntegerType())

df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()

x = F.col('x')
df2 = df.select(F.when((x > 0), Divide10()(x)))
df2.show(200)
{code}

This raises a division by zero error, even if `F.when` is trying to filter out 
all cases where `x <= 0`. I believe the correct behavior should be not to 
evaluate the UDF since the condition fails.


> UDF is evaluated when 'F.when' condition is false
> -
>
> Key: SPARK-22347
> URL: https://issues.apache.org/jira/browse/SPARK-22347
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Nicolas Porter
>Priority: Minor
>
> Here's a simple example on how to reproduce this:
> {code}
> from pyspark.sql import functions as F, Row, types
> def Divide10():
> def fn(value): return 10 / int(value)
> return F.udf(fn, types.IntegerType())
> df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()
> x = F.col('x')
> df2 = df.select(F.when((x > 0), Divide10()(x)))
> df2.show(200)
> {code}
> This raises a division by zero error, even if `F.when` is trying to filter 
> out all cases where `x <= 0`. I believe the correct behavior should be not to 
> evaluate the UDF when the `F.when` condition is false.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

2017-10-24 Thread Nicolas Porter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Porter updated SPARK-22347:
---
Description: 
Here's a simple example on how to reproduce this:

{code}
from pyspark.sql import functions as F, Row, types

def Divide10():
def fn(value): return 10 / int(value)
return F.udf(fn, types.IntegerType())

df = sc.parallelize([Row(x=5),   Row(x=0)]).toDF()

x = F.col('x')
df2 = df.select(F.when((x > 0), Divide10()(x)))
df2.show(200)
{code}

This raises a division by zero error, even if `F.when` is trying to filter out 
all cases where `x <= 0`. I believe the correct behavior should be not to 
evaluate the UDF since the condition fails.

  was:
Here's a simple example on how to reproduce this:

{code:python}
from pyspark.sql import functions as F, Row, types

def Divide10():
def fn(value): return 10 / int(value)
return F.udf(fn, types.IntegerType())

df = sc.parallelize([Row(x=5),   Row(x=0)]).toDF()

x = F.col('x')
df2 = df.select(F.when((x > 0), Divide10()(x)))
df2.show(200)
{code}

This raises a division by zero error, even if `F.when` is trying to filter out 
all cases where `x <= 0`. I believe the correct behavior should be not to 
evaluate the UDF since the condition fails.


> UDF is evaluated when 'F.when' condition is false
> -
>
> Key: SPARK-22347
> URL: https://issues.apache.org/jira/browse/SPARK-22347
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Nicolas Porter
>Priority: Minor
>
> Here's a simple example on how to reproduce this:
> {code}
> from pyspark.sql import functions as F, Row, types
> def Divide10():
> def fn(value): return 10 / int(value)
> return F.udf(fn, types.IntegerType())
> df = sc.parallelize([Row(x=5),   Row(x=0)]).toDF()
> x = F.col('x')
> df2 = df.select(F.when((x > 0), Divide10()(x)))
> df2.show(200)
> {code}
> This raises a division by zero error, even if `F.when` is trying to filter 
> out all cases where `x <= 0`. I believe the correct behavior should be not to 
> evaluate the UDF since the condition fails.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

2017-10-24 Thread Nicolas Porter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Porter updated SPARK-22347:
---
Description: 
Here's a simple example on how to reproduce this:

{code}
from pyspark.sql import functions as F, Row, types

def Divide10():
def fn(value): return 10 / int(value)
return F.udf(fn, types.IntegerType())

df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()

x = F.col('x')
df2 = df.select(F.when((x > 0), Divide10()(x)))
df2.show(200)
{code}

This raises a division by zero error, even if `F.when` is trying to filter out 
all cases where `x <= 0`. I believe the correct behavior should be not to 
evaluate the UDF since the condition fails.

  was:
Here's a simple example on how to reproduce this:

{code}
from pyspark.sql import functions as F, Row, types

def Divide10():
def fn(value): return 10 / int(value)
return F.udf(fn, types.IntegerType())

df = sc.parallelize([Row(x=5),   Row(x=0)]).toDF()

x = F.col('x')
df2 = df.select(F.when((x > 0), Divide10()(x)))
df2.show(200)
{code}

This raises a division by zero error, even if `F.when` is trying to filter out 
all cases where `x <= 0`. I believe the correct behavior should be not to 
evaluate the UDF since the condition fails.


> UDF is evaluated when 'F.when' condition is false
> -
>
> Key: SPARK-22347
> URL: https://issues.apache.org/jira/browse/SPARK-22347
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Nicolas Porter
>Priority: Minor
>
> Here's a simple example on how to reproduce this:
> {code}
> from pyspark.sql import functions as F, Row, types
> def Divide10():
> def fn(value): return 10 / int(value)
> return F.udf(fn, types.IntegerType())
> df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()
> x = F.col('x')
> df2 = df.select(F.when((x > 0), Divide10()(x)))
> df2.show(200)
> {code}
> This raises a division by zero error, even if `F.when` is trying to filter 
> out all cases where `x <= 0`. I believe the correct behavior should be not to 
> evaluate the UDF since the condition fails.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

2017-10-24 Thread Nicolas Porter (JIRA)
Nicolas Porter created SPARK-22347:
--

 Summary: UDF is evaluated when 'F.when' condition is false
 Key: SPARK-22347
 URL: https://issues.apache.org/jira/browse/SPARK-22347
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
Reporter: Nicolas Porter
Priority: Minor


Here's a simple example on how to reproduce this:

{code:python}
from pyspark.sql import functions as F, Row, types

def Divide10():
def fn(value): return 10 / int(value)
return F.udf(fn, types.IntegerType())

df = sc.parallelize([Row(x=5),   Row(x=0)]).toDF()

x = F.col('x')
df2 = df.select(F.when((x > 0), Divide10()(x)))
df2.show(200)
{code}

This raises a division by zero error, even if `F.when` is trying to filter out 
all cases where `x <= 0`. I believe the correct behavior should be not to 
evaluate the UDF since the condition fails.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes

2017-10-24 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22346:

Description: 
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, 
[SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The Estimator 
can "learn" the size of the inputs vectors during training and save it to use 
during prediction.

Pros:
* Possibly simplest of the potential fixes

Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.

Pros:
* Potentially, easy short term fix for VectorAssembler
* Current Attributes implementation is also causing other issues, eg 
[SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141].

Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have fixed length vectors. Most mllib transformers 
that produce vectors already include the size of the vector in the column 
metadata. This change would be to deprecate APIs that allow creating a vector 
column of unknown length and replace those APIs with equivalents that enforce a 
fixed size.

Pros:
* We already treat vectors as fixed size, for example VectorAssembler assumes 
the inputs * output col are fixed size vectors and creates metadata 
accordingly. In the spirit of explicit is better than implicit, we would be 
codifying something we already assume.
* This could potentially enable performance optimizations that are only 
possible if the Vector size of a column is fixed & known.

Cons:
* This would require breaking changes.




  was:
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs 
vectors during training and save it to use during prediction.

Pros:
* Possibly simplest of the potential fixes

Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.

Pros:
* Potentially, easy short term fix for VectorAssembler
* Current Attributes implementation is also causing other issues, eg 
[SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141].

Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 

[jira] [Updated] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes

2017-10-24 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22346:

Description: 
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs 
vectors during training and save it to use during prediction.

Pros:
* Possibly simplest of the potential fixes

Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.

Pros:
* Potentially, easy short term fix for VectorAssembler
* Current Attributes implementation is also causing other issues, eg 
[SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141].

Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have fixed length vectors. Most mllib transformers 
that produce vectors already include the size of the vector in the column 
metadata. This change would be to deprecate APIs that allow creating a vector 
column of unknown length and replace those APIs with equivalents that enforce a 
fixed size.

Pros:
* We already treat vectors as fixed size, for example VectorAssembler assumes 
the inputs * output col are fixed size vectors and creates metadata 
accordingly. In the spirit of explicit is better than implicit, we would be 
codifying something we already assume.
* This could potentially enable performance optimizations that are only 
possible if the Vector size of a column is fixed & known.

Cons:
* This would require breaking changes.




  was:
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs 
vectors during training and save it to use during prediction.

Pros:
* Possibly simplest of the potential fixes

Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.

Pros:
* Potentially, easy short term fix for VectorAssembler
* Current Attributes implementation is also causing other issues, eg 
[#SPARK-19141].

Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement 

[jira] [Updated] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes

2017-10-24 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22346:

Description: 
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs 
vectors during training and save it to use during prediction.

Pros:
* Possibly simplest of the potential fixes

Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.

Pros:
* Potentially, easy short term fix for VectorAssembler
* Current Attributes implementation is also causing other issues, eg 
[#SPARK-19141].

Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have fixed length vectors. Most mllib transformers 
that produce vectors already include the size of the vector in the column 
metadata. This change would be to deprecate APIs that allow creating a vector 
column of unknown length and replace those APIs with equivalents that enforce a 
fixed size.

Pros:
* We already treat vectors as fixed size, for example VectorAssembler assumes 
the inputs * output col are fixed size vectors and creates metadata 
accordingly. In the spirit of explicit is better than implicit, we would be 
codifying something we already assume.
* This could potentially enable performance optimizations that are only 
possible if the Vector size of a column is fixed & known.

Cons:
* This would require breaking changes.




  was:
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs 
vectors during training and save it to use during prediction.

Pros:
* Possibly simplest of the potential fixes

Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.

Pros:
* Potentially, easy short term fix for VectorAssembler
* Current Attributes implementation is also causing other issues, .

Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have 

[jira] [Updated] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes

2017-10-24 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22346:

Description: 
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs 
vectors during training and save it to use during prediction.

Pros:
* Possibly simplest of the potential fixes

Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.

Pros:
* Potentially, easy short term fix for VectorAssembler
* Current Attributes implementation is also causing other issues, .

Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have fixed length vectors. Most mllib transformers 
that produce vectors already include the size of the vector in the column 
metadata. This change would be to deprecate APIs that allow creating a vector 
column of unknown length and replace those APIs with equivalents that enforce a 
fixed size.

Pros:
* We already treat vectors as fixed size, for example VectorAssembler assumes 
the inputs * output col are fixed size vectors and creates metadata 
accordingly. In the spirit of explicit is better than implicit, we would be 
codifying something we already assume.
* This could potentially enable performance optimizations that are only 
possible if the Vector size of a column is fixed & known.

Cons:
* This would require breaking changes.




  was:
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs 
vectors during training and save it to use during prediction.

Pros:
* Possibly simplest of the potential fixes

Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.

Pros:
* Potentially, easy short term fix for VectorAssembler

Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have fixed length vectors. Most mllib transformers 
that produce vectors already include the 

[jira] [Updated] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes

2017-10-24 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22346:

Description: 
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs 
vectors during training and save it to use during prediction.

Pros:
* Possibly simplest of the potential fixes

Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.

Pros:
* Potentially, easy short term fix for VectorAssembler

Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have fixed length vectors. Most mllib transformers 
that produce vectors already include the size of the vector in the column 
metadata. This change would be to deprecate APIs that allow creating a vector 
column of unknown length and replace those APIs with equivalents that enforce a 
fixed size.

Pros:
* We already treat vectors as fixed size, for example VectorAssembler assumes 
the inputs * output col are fixed size vectors and creates metadata 
accordingly. In the spirit of explicit is better than implicit, we would be 
codifying something we already assume.
* This could potentially enable performance optimizations that are only 
possible if the Vector size of a column is fixed & known.

Cons:
* This would require breaking changes.




  was:
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs 
vectors during training and save it to use during prediction.
Pros:
* Possibly simplest of the potential fixes
Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.
Pros:
* Potentially, easy short term fix for VectorAssembler
Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have fixed length vectors. Most mllib transformers 
that produce vectors already include the size of the vector in the column 
metadata. This change would be to 

[jira] [Created] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes

2017-10-24 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-22346:
---

 Summary: Update VectorAssembler to work with StreamingDataframes
 Key: SPARK-22346
 URL: https://issues.apache.org/jira/browse/SPARK-22346
 Project: Spark
  Issue Type: Improvement
  Components: ML, Structured Streaming
Affects Versions: 2.2.0
Reporter: Bago Amirbekian
Priority: Critical


The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs 
vectors during training and save it to use during prediction.
Pros:
* Possibly simplest of the potential fixes
Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.
Pros:
* Potentially, easy short term fix for VectorAssembler
Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have fixed length vectors. Most mllib transformers 
that produce vectors already include the size of the vector in the column 
metadata. This change would be to deprecate APIs that allow creating a vector 
column of unknown length and replace those APIs with equivalents that enforce a 
fixed size.
Pros:
* We already treat vectors as fixed size, for example VectorAssembler assumes 
the inputs * output col are fixed size vectors and creates metadata 
accordingly. In the spirit of explicit is better than implicit, we would be 
codifying something we already assume.
* This could potentially enable performance optimizations that are only 
possible if the Vector size of a column is fixed & known.
Cons:
* This would require breaking changes.






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22340) pyspark setJobGroup doesn't match java threads

2017-10-24 Thread Leif Walsh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217835#comment-16217835
 ] 

Leif Walsh commented on SPARK-22340:


Ok, this is fairly straightforward.  The problem is that from the Python side, 
{{setJobGroup}} isn't thread-local, it's global.  Here is a tight reproducer:

{noformat}
import concurrent.futures
import threading
import time
executor = concurrent.futures.ThreadPoolExecutor()

latch_1 = threading.Event()
latch_2 = threading.Event()

def wait(x):
time.sleep(x)
return x

def multiple_job_groups():
sc.setJobGroup('imajobgroup', 'helloitme')
groups = []
groups.append(get_job_group())
sc.parallelize([1, 1]).map(wait).collect()
latch_2.set()
latch_1.wait()
groups.append(get_job_group())
sc.parallelize([1, 1]).map(wait).collect()
groups.append(get_job_group())
return groups

def another_job_group():
latch_2.wait()
sc.setJobGroup('another', 'itnotme')
sc.parallelize([1, 1]).map(wait).collect()
latch_1.set()

future_1 = executor.submit(multiple_job_groups)
future_2 = executor.submit(another_job_group)
future_1.result()
{noformat}

The result is that {{another_job_group}} modifies the local property in between 
the first and second executions of {{multiple_job_groups}}'s jobs, and we get 
this result:

{noformat}
['imajobgroup', 'another', 'another']
{noformat}

I think I can "solve" this by wrapping {{SparkContext}} with a lock (to 
sequence the execution of {{setJobGroup}} and something in py4j that will 
release the lock during JVM execution, which feels Very Dangerous.

Would greatly appreciate it if we could do something to really solve this 
inside pyspark, but will attempt the Dangerous on my side for now.

> pyspark setJobGroup doesn't match java threads
> --
>
> Key: SPARK-22340
> URL: https://issues.apache.org/jira/browse/SPARK-22340
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2
>Reporter: Leif Walsh
>
> With pyspark, {{sc.setJobGroup}}'s documentation says
> {quote}
> Assigns a group ID to all the jobs started by this thread until the group ID 
> is set to a different value or cleared.
> {quote}
> However, this doesn't appear to be associated with Python threads, only with 
> Java threads.  As such, a Python thread which calls this and then submits 
> multiple jobs doesn't necessarily get its jobs associated with any particular 
> spark job group.  For example:
> {code}
> def run_jobs():
> sc.setJobGroup('hello', 'hello jobs')
> x = sc.range(100).sum()
> y = sc.range(1000).sum()
> return x, y
> import concurrent.futures
> with concurrent.futures.ThreadPoolExecutor() as executor:
> future = executor.submit(run_jobs)
> sc.cancelJobGroup('hello')
> future.result()
> {code}
> In this example, depending how the action calls on the Python side are 
> allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be 
> assigned the job group {{hello}}.
> First, we should clarify the docs if this truly is the case.
> Second, it would be really helpful if we could make the job group assignment 
> reliable for a Python thread, though I’m not sure the best way to do this.  
> As it stands, job groups are pretty useless from the pyspark side, if we 
> can't rely on this fact.
> My only idea so far is to mimic the TLS behavior on the Python side and then 
> patch every point where job submission may take place to pass that in, but 
> this feels pretty brittle. In my experience with py4j, controlling threading 
> there is a challenge. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22345) Sort-merge join generates incorrect code for CodegenFallback filter conditions

2017-10-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217772#comment-16217772
 ] 

Apache Spark commented on SPARK-22345:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/19568

> Sort-merge join generates incorrect code for CodegenFallback filter conditions
> --
>
> Key: SPARK-22345
> URL: https://issues.apache.org/jira/browse/SPARK-22345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Ryan Blue
>
> I have a job that is producing incorrect results from a sort-merge join with 
> a filter on an expression with a Hive UDF. The results are correct when 
> codgen is turned off.
> I tracked the problem to the evaluation of the Hive UDF, here:
> {code:lang=java}
> Object smj_obj1 = ((Expression) references[2]).eval(smj_rightRow1);
> {code}
> The UDF references columns from both left and right, so this was clearly 
> generated incorrectly. I think it is that the expression used for Hive UDFs 
> is a CodegenFallback, which uses eval instead of local variables. The 
> non-codegen implementations pass a joined row into eval and creating a joined 
> row in codegen fixes the problem. I'll post a PR in a minute.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22345) Sort-merge join generates incorrect code for CodegenFallback filter conditions

2017-10-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22345:


Assignee: Apache Spark

> Sort-merge join generates incorrect code for CodegenFallback filter conditions
> --
>
> Key: SPARK-22345
> URL: https://issues.apache.org/jira/browse/SPARK-22345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Ryan Blue
>Assignee: Apache Spark
>
> I have a job that is producing incorrect results from a sort-merge join with 
> a filter on an expression with a Hive UDF. The results are correct when 
> codgen is turned off.
> I tracked the problem to the evaluation of the Hive UDF, here:
> {code:lang=java}
> Object smj_obj1 = ((Expression) references[2]).eval(smj_rightRow1);
> {code}
> The UDF references columns from both left and right, so this was clearly 
> generated incorrectly. I think it is that the expression used for Hive UDFs 
> is a CodegenFallback, which uses eval instead of local variables. The 
> non-codegen implementations pass a joined row into eval and creating a joined 
> row in codegen fixes the problem. I'll post a PR in a minute.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22345) Sort-merge join generates incorrect code for CodegenFallback filter conditions

2017-10-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22345:


Assignee: (was: Apache Spark)

> Sort-merge join generates incorrect code for CodegenFallback filter conditions
> --
>
> Key: SPARK-22345
> URL: https://issues.apache.org/jira/browse/SPARK-22345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Ryan Blue
>
> I have a job that is producing incorrect results from a sort-merge join with 
> a filter on an expression with a Hive UDF. The results are correct when 
> codgen is turned off.
> I tracked the problem to the evaluation of the Hive UDF, here:
> {code:lang=java}
> Object smj_obj1 = ((Expression) references[2]).eval(smj_rightRow1);
> {code}
> The UDF references columns from both left and right, so this was clearly 
> generated incorrectly. I think it is that the expression used for Hive UDFs 
> is a CodegenFallback, which uses eval instead of local variables. The 
> non-codegen implementations pass a joined row into eval and creating a joined 
> row in codegen fixes the problem. I'll post a PR in a minute.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22345) Sort-merge join generates incorrect code for CodegenFallback filter conditions

2017-10-24 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-22345:
--
Description: 
I have a job that is producing incorrect results from a sort-merge join with a 
filter on an expression with a Hive UDF. The results are correct when codgen is 
turned off.

I tracked the problem to the evaluation of the Hive UDF, here:

{code:lang=java}
Object smj_obj1 = ((Expression) references[2]).eval(smj_rightRow1);
{code}

The UDF references columns from both left and right, so this was clearly 
generated incorrectly. I think it is that the expression used for Hive UDFs is 
a CodegenFallback, which uses eval instead of local variables. The non-codegen 
implementations pass a joined row into eval and creating a joined row in 
codegen fixes the problem. I'll post a PR in a minute.

  was:
I have a job that is producing incorrect results from a sort-merge join with a 
filter on an expression with a Hive UDF. The results are correct when codgen is 
turned off.

I tracked the problem to the evaluation of the Hive UDF, here:

```
Object smj_obj1 = ((Expression) references[2]).eval(smj_rightRow1);
```

The UDF references columns from both left and right, so this was clearly 
generated incorrectly. I think it is that the expression used for Hive UDFs is 
a CodegenFallback, which uses eval instead of local variables. The non-codegen 
implementations pass a joined row into eval and creating a joined row in 
codegen fixes the problem. I'll post a PR in a minute.


> Sort-merge join generates incorrect code for CodegenFallback filter conditions
> --
>
> Key: SPARK-22345
> URL: https://issues.apache.org/jira/browse/SPARK-22345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Ryan Blue
>
> I have a job that is producing incorrect results from a sort-merge join with 
> a filter on an expression with a Hive UDF. The results are correct when 
> codgen is turned off.
> I tracked the problem to the evaluation of the Hive UDF, here:
> {code:lang=java}
> Object smj_obj1 = ((Expression) references[2]).eval(smj_rightRow1);
> {code}
> The UDF references columns from both left and right, so this was clearly 
> generated incorrectly. I think it is that the expression used for Hive UDFs 
> is a CodegenFallback, which uses eval instead of local variables. The 
> non-codegen implementations pass a joined row into eval and creating a joined 
> row in codegen fixes the problem. I'll post a PR in a minute.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22316) Cannot Select ReducedAggregator Column

2017-10-24 Thread Russell Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217666#comment-16217666
 ] 

Russell Spitzer commented on SPARK-22316:
-

[~hvanhovell] This was the ticket I told you about :)

> Cannot Select ReducedAggregator Column
> --
>
> Key: SPARK-22316
> URL: https://issues.apache.org/jira/browse/SPARK-22316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Russell Spitzer
>Priority: Minor
>
> Given a dataset which has been run through reduceGroups like this
> {code}
> case class Person(name: String, age: Int)
> case class Customer(id: Int, person: Person)
> val ds = spark.createDataset(Seq(Customer(1,Person("russ", 85
> val grouped = ds.groupByKey(c => c.id).reduceGroups( (x,y) => x )
> {code}
> We end up with a Dataset with the schema
> {code}
>  org.apache.spark.sql.types.StructType = 
> StructType(
>   StructField(value,IntegerType,false), 
>   StructField(ReduceAggregator(Customer),
> StructType(StructField(id,IntegerType,false),
> StructField(person,
>   StructType(StructField(name,StringType,true),
>   StructField(age,IntegerType,false))
>,true))
> ,true))
> {code}
> The column names are 
> {code}
> Array(value, ReduceAggregator(Customer))
> {code}
> But you cannot select the "ReduceAggregatorColumn"
> {code}
> grouped.select(grouped.columns(1))
> org.apache.spark.sql.AnalysisException: cannot resolve 
> '`ReduceAggregator(Customer)`' given input columns: [value, 
> ReduceAggregator(Customer)];;
> 'Project ['ReduceAggregator(Customer)]
> +- Aggregate [value#338], [value#338, 
> reduceaggregator(org.apache.spark.sql.expressions.ReduceAggregator@5ada573, 
> Some(newInstance(class Customer)), Some(class Customer), 
> Some(StructType(StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true))), input[0, scala.Tuple2, true]._1 
> AS value#340, if ((isnull(input[0, scala.Tuple2, true]._2) || None.equals)) 
> null else named_struct(id, assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).id AS id#195, person, if 
> (isnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person)) null else named_struct(name, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person).name, true), age, 
> assertnotnull(assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]._2)).person).age) AS person#196) AS _2#341, newInstance(class 
> scala.Tuple2), assertnotnull(assertnotnull(input[0, Customer, true])).id AS 
> id#195, if (isnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person)) null else named_struct(name, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person).name, true), age, 
> assertnotnull(assertnotnull(assertnotnull(input[0, Customer, 
> true])).person).age) AS person#196, StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true), true, 0, 0) AS 
> ReduceAggregator(Customer)#346]
>+- AppendColumns , class Customer, 
> [StructField(id,IntegerType,false), 
> StructField(person,StructType(StructField(name,StringType,true), 
> StructField(age,IntegerType,false)),true)], newInstance(class Customer), 
> [input[0, int, false] AS value#338]
>   +- LocalRelation [id#197, person#198]
> {code}
> You can work around this by using "toDF" to rename the column
> {code}
> scala> grouped.toDF("key", "reduced").select("reduced")
> res55: org.apache.spark.sql.DataFrame = [reduced: struct struct>]
> {code}
> I think that all invocations of 
> {code}
> ds.select(ds.columns(i))
> {code}
> For all valid i < columns size should work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22345) Sort-merge join generates incorrect code for CodegenFallback filter conditions

2017-10-24 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-22345:
-

 Summary: Sort-merge join generates incorrect code for 
CodegenFallback filter conditions
 Key: SPARK-22345
 URL: https://issues.apache.org/jira/browse/SPARK-22345
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
Reporter: Ryan Blue


I have a job that is producing incorrect results from a sort-merge join with a 
filter on an expression with a Hive UDF. The results are correct when codgen is 
turned off.

I tracked the problem to the evaluation of the Hive UDF, here:

```
Object smj_obj1 = ((Expression) references[2]).eval(smj_rightRow1);
```

The UDF references columns from both left and right, so this was clearly 
generated incorrectly. I think it is that the expression used for Hive UDFs is 
a CodegenFallback, which uses eval instead of local variables. The non-codegen 
implementations pass a joined row into eval and creating a joined row in 
codegen fixes the problem. I'll post a PR in a minute.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22340) pyspark setJobGroup doesn't match java threads

2017-10-24 Thread Leif Walsh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217535#comment-16217535
 ] 

Leif Walsh commented on SPARK-22340:


This is less spooky than I initially thought, I will explain later tonight.

> pyspark setJobGroup doesn't match java threads
> --
>
> Key: SPARK-22340
> URL: https://issues.apache.org/jira/browse/SPARK-22340
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2
>Reporter: Leif Walsh
>
> With pyspark, {{sc.setJobGroup}}'s documentation says
> {quote}
> Assigns a group ID to all the jobs started by this thread until the group ID 
> is set to a different value or cleared.
> {quote}
> However, this doesn't appear to be associated with Python threads, only with 
> Java threads.  As such, a Python thread which calls this and then submits 
> multiple jobs doesn't necessarily get its jobs associated with any particular 
> spark job group.  For example:
> {code}
> def run_jobs():
> sc.setJobGroup('hello', 'hello jobs')
> x = sc.range(100).sum()
> y = sc.range(1000).sum()
> return x, y
> import concurrent.futures
> with concurrent.futures.ThreadPoolExecutor() as executor:
> future = executor.submit(run_jobs)
> sc.cancelJobGroup('hello')
> future.result()
> {code}
> In this example, depending how the action calls on the Python side are 
> allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be 
> assigned the job group {{hello}}.
> First, we should clarify the docs if this truly is the case.
> Second, it would be really helpful if we could make the job group assignment 
> reliable for a Python thread, though I’m not sure the best way to do this.  
> As it stands, job groups are pretty useless from the pyspark side, if we 
> can't rely on this fact.
> My only idea so far is to mimic the TLS behavior on the Python side and then 
> patch every point where job submission may take place to pass that in, but 
> this feels pretty brittle. In my experience with py4j, controlling threading 
> there is a challenge. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-24 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217532#comment-16217532
 ] 

Dongjoon Hyun commented on SPARK-22335:
---

To be clear, I have no objection on your idea here now.

> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
>   baDs.select("a","b").as[AB].union(abDs).show() // This is the same 
> workaround for linked issue, slightly modified.  However this seems wrong 
> since its supposed to be strongly typed
>   
>   baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
> result, which is logically inconsistent behavior
>   abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives 
> correct result
> {code}
> So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
> around is really fair, since I'm supposed to be getting of type `AB`.  More 
> importantly I think the issue is bigger when you consider that it happens 
> even if you read from parquet (as you would expect).  And that its 
> inconsistent when going to/from rdd.
> I imagine its just lazily converting to typed DS instead of initially.  So 
> either that typing could be prioritized to happen before the union or 
> unioning of DF could be done with column order taken into account.  Again, 
> this is speculation..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-24 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217524#comment-16217524
 ] 

Dongjoon Hyun commented on SPARK-22335:
---

Hm. I see your point. What I meant was the following in DataFrame. What you're 
asking is 'name' matching in DataFrame; "a" to "a" and "b" to "b".
{code}
scala> val abDf = sc.parallelize(List(("aThing",0))).toDF("a", "b")
scala> val baDf = sc.parallelize(List((0,"aThing"))).toDF("b", "a")
scala> abDf.union(baDf).show
+--+--+
| a| b|
+--+--+
|aThing| 0|
| 0|aThing|
+--+--+
{code}

> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
>   baDs.select("a","b").as[AB].union(abDs).show() // This is the same 
> workaround for linked issue, slightly modified.  However this seems wrong 
> since its supposed to be strongly typed
>   
>   baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
> result, which is logically inconsistent behavior
>   abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives 
> correct result
> {code}
> So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
> around is really fair, since I'm supposed to be getting of type `AB`.  More 
> importantly I think the issue is bigger when you consider that it happens 
> even if you read from parquet (as you would expect).  And that its 
> inconsistent when going to/from rdd.
> I imagine its just lazily converting to typed DS instead of initially.  So 
> either that typing could be prioritized to happen before the union or 
> unioning of DF could be done with column order taken into account.  Again, 
> this is speculation..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-24 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217497#comment-16217497
 ] 

Carlos Bribiescas commented on SPARK-22335:
---

I'm not sure I understand what you're asking.  I do agree that DS should be 
consistent with DF when possible, but in this case the more specific 
functionality (typing) doesn't apply to DF.   Again, sorry if I didn't answer 
your question I didn't quite get what you were asking.  Can you clarify?

Here is another example that maybe helps.

{code:java}

  case class AB(a : String, b : Int)

  val abDs = sc.parallelize(List(("aThing",0))).toDF("a", "b").as[AB]
  val baDs = sc.parallelize(List((0,"aThing"))).toDF("b", "a").as[AB]
  
  abDs.show() // works
  baDs.show() // works
  
  abDs.union(baDs).show() // Real error to do with types
  abDs.rdd.union(baDs.rdd).toDF().as[AB].show() // Works which is inconsistent 
with last statement IMO
{code}


> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
>   baDs.select("a","b").as[AB].union(abDs).show() // This is the same 
> workaround for linked issue, slightly modified.  However this seems wrong 
> since its supposed to be strongly typed
>   
>   baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
> result, which is logically inconsistent behavior
>   abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives 
> correct result
> {code}
> So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
> around is really fair, since I'm supposed to be getting of type `AB`.  More 
> importantly I think the issue is bigger when you consider that it happens 
> even if you read from parquet (as you would expect).  And that its 
> inconsistent when going to/from rdd.
> I imagine its just lazily converting to typed DS instead of initially.  So 
> either that typing could be prioritized to happen before the union or 
> unioning of DF could be done with column order taken into account.  Again, 
> this is speculation..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error

2017-10-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22291:


Assignee: Apache Spark

> Postgresql UUID[] to Cassandra: Conversion Error
> 
>
> Key: SPARK-22291
> URL: https://issues.apache.org/jira/browse/SPARK-22291
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, 
> Cassandra 3
>Reporter: Fabio J. Walter
>Assignee: Apache Spark
>  Labels: patch, postgresql, sql
> Attachments: 
> org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png
>
>
> My job reads data from a PostgreSQL table that contains columns of user_ids 
> uuid[] type, so that I'm getting the error above when I'm trying to save data 
> on Cassandra.
> However, the creation of this same table on Cassandra works fine!  user_ids 
> list.
> I can't change the type on the source table, because I'm reading data from a 
> legacy system.
> I've been looking at point printed on log, on class 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala
> Stacktrace on Spark:
> {noformat}
> Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to 
> [Ljava.lang.String;
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> 

[jira] [Assigned] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error

2017-10-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22291:


Assignee: (was: Apache Spark)

> Postgresql UUID[] to Cassandra: Conversion Error
> 
>
> Key: SPARK-22291
> URL: https://issues.apache.org/jira/browse/SPARK-22291
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, 
> Cassandra 3
>Reporter: Fabio J. Walter
>  Labels: patch, postgresql, sql
> Attachments: 
> org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png
>
>
> My job reads data from a PostgreSQL table that contains columns of user_ids 
> uuid[] type, so that I'm getting the error above when I'm trying to save data 
> on Cassandra.
> However, the creation of this same table on Cassandra works fine!  user_ids 
> list.
> I can't change the type on the source table, because I'm reading data from a 
> legacy system.
> I've been looking at point printed on log, on class 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala
> Stacktrace on Spark:
> {noformat}
> Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to 
> [Ljava.lang.String;
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> 

[jira] [Commented] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error

2017-10-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217462#comment-16217462
 ] 

Apache Spark commented on SPARK-22291:
--

User 'jmchung' has created a pull request for this issue:
https://github.com/apache/spark/pull/19567

> Postgresql UUID[] to Cassandra: Conversion Error
> 
>
> Key: SPARK-22291
> URL: https://issues.apache.org/jira/browse/SPARK-22291
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, 
> Cassandra 3
>Reporter: Fabio J. Walter
>  Labels: patch, postgresql, sql
> Attachments: 
> org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png
>
>
> My job reads data from a PostgreSQL table that contains columns of user_ids 
> uuid[] type, so that I'm getting the error above when I'm trying to save data 
> on Cassandra.
> However, the creation of this same table on Cassandra works fine!  user_ids 
> list.
> I can't change the type on the source table, because I'm reading data from a 
> legacy system.
> I've been looking at point printed on log, on class 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala
> Stacktrace on Spark:
> {noformat}
> Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to 
> [Ljava.lang.String;
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 

[jira] [Updated] (SPARK-22324) Upgrade Arrow to version 0.8.0

2017-10-24 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-22324:
-
Description: 
Arrow version 0.8.0 is slated for release in early November, but I'd like to 
start discussing to help get all the work that's being done synced up.

Along with upgrading the Arrow Java artifacts, pyarrow on our Jenkins test envs 
will need to be upgraded as well that will take a fair amount of work and 
planning.

One topic I'd like to discuss is if pyarrow should be an installation 
requirement for pyspark, i.e. when a user pip installs pyspark, it will also 
install pyarrow.  If not, then is there a minimum version that needs to be 
supported?  We currently have 0.4.1 installed on Jenkins.

There are a number of improvements and cleanups in the current code that can 
happen depending on what we decide (I'll link them all here later, but off the 
top of my head):

* Decimal bug fix and improved support
* Improved internal casting between pyarrow and pandas (can clean up some 
workarounds), this will also verify data bounds if the user specifies a type 
and data overflows.  see 
https://github.com/apache/spark/pull/19459#discussion_r146421804
* Better type checking when converting Spark types to Arrow
* Timestamp conversion to microseconds (for Spark internal format)
* Full support for using validity mask with 'object' types 
https://github.com/apache/spark/pull/18664#discussion_r146567335

  was:
Arrow version 0.8.0 is slated for release in early November, but I'd like to 
start discussing to help get all the work that's being done synced up.

Along with upgrading the Arrow Java artifacts, pyarrow on our Jenkins test envs 
will need to be upgraded as well that will take a fair amount of work and 
planning.

One topic I'd like to discuss is if pyarrow should be an installation 
requirement for pyspark, i.e. when a user pip installs pyspark, it will also 
install pyarrow.  If not, then is there a minimum version that needs to be 
supported?  We currently have 0.4.1 installed on Jenkins.

There are a number of improvements and cleanups in the current code that can 
happen depending on what we decide (I'll link them all here later, but off the 
top of my head):

* Decimal bug fix and improved support
* Improved internal casting between pyarrow and pandas (can clean up some 
workarounds), this will also verify data bounds if the user specifies a type 
and data overflows.  see 
https://github.com/apache/spark/pull/19459#discussion_r146421804
* Better type checking when converting Spark types to Arrow
* Timestamp conversion to microseconds (for Spark internal format)


> Upgrade Arrow to version 0.8.0
> --
>
> Key: SPARK-22324
> URL: https://issues.apache.org/jira/browse/SPARK-22324
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>
> Arrow version 0.8.0 is slated for release in early November, but I'd like to 
> start discussing to help get all the work that's being done synced up.
> Along with upgrading the Arrow Java artifacts, pyarrow on our Jenkins test 
> envs will need to be upgraded as well that will take a fair amount of work 
> and planning.
> One topic I'd like to discuss is if pyarrow should be an installation 
> requirement for pyspark, i.e. when a user pip installs pyspark, it will also 
> install pyarrow.  If not, then is there a minimum version that needs to be 
> supported?  We currently have 0.4.1 installed on Jenkins.
> There are a number of improvements and cleanups in the current code that can 
> happen depending on what we decide (I'll link them all here later, but off 
> the top of my head):
> * Decimal bug fix and improved support
> * Improved internal casting between pyarrow and pandas (can clean up some 
> workarounds), this will also verify data bounds if the user specifies a type 
> and data overflows.  see 
> https://github.com/apache/spark/pull/19459#discussion_r146421804
> * Better type checking when converting Spark types to Arrow
> * Timestamp conversion to microseconds (for Spark internal format)
> * Full support for using validity mask with 'object' types 
> https://github.com/apache/spark/pull/18664#discussion_r146567335



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21043) Add unionByName API to Dataset

2017-10-24 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217415#comment-16217415
 ] 

Reynold Xin commented on SPARK-21043:
-

Because some people expect union by position too.


> Add unionByName API to Dataset
> --
>
> Key: SPARK-21043
> URL: https://issues.apache.org/jira/browse/SPARK-21043
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Takeshi Yamamuro
> Fix For: 2.3.0
>
>
> It would be useful to add unionByName which resolves columns by name, in 
> addition to the existing union (which resolves by position).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22240) S3 CSV number of partitions incorrectly computed

2017-10-24 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217398#comment-16217398
 ] 

Steve Loughran commented on SPARK-22240:


no, spark 2.2 doesn't fix this. I  have to explicitly define the schema as the 
list of headers -> String and then reader setup time drops to 6s. 
{code}
2017-10-24 19:19:31,593 [ScalaTest-main-running-S3ACommitBulkDataSuite] DEBUG 
fs.FsUrlStreamHandlerFactory 
(FsUrlStreamHandlerFactory.java:createURLStreamHandler(107)) - Unknown protocol 
jar, delegating to default implementation
2017-10-24 19:19:36,402 [ScalaTest-main-running-S3ACommitBulkDataSuite] INFO  
commit.S3ACommitBulkDataSuite (Logging.scala:logInfo(54)) - Duration of set up 
initial .csv load = 6,195,721,969 nS
{code}

I'm not sure if this is related to the original bug, though it is potentially 
part of the issue. As what I'm seeing is that either schema inference always 
takes place, or taking the first line of a .gz file is enough to force reading 
the entire .gz source file

> S3 CSV number of partitions incorrectly computed
> 
>
> Key: SPARK-22240
> URL: https://issues.apache.org/jira/browse/SPARK-22240
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Running on EMR 5.8.0 with Hadoop 2.7.3 and Spark 2.2.0
>Reporter: Arthur Baudry
>
> Reading CSV out of S3 using S3A protocol does not compute the number of 
> partitions correctly in Spark 2.2.0.
> With Spark 2.2.0 I get only partition when loading a 14GB file
> {code:java}
> scala> val input = spark.read.format("csv").option("header", 
> "true").option("delimiter", "|").option("multiLine", 
> "true").load("s3a://")
> input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: 
> string ... 36 more fields]
> scala> input.rdd.getNumPartitions
> res2: Int = 1
> {code}
> While in Spark 2.0.2 I had:
> {code:java}
> scala> val input = spark.read.format("csv").option("header", 
> "true").option("delimiter", "|").option("multiLine", 
> "true").load("s3a://")
> input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: 
> string ... 36 more fields]
> scala> input.rdd.getNumPartitions
> res2: Int = 115
> {code}
> This introduces obvious performance issues in Spark 2.2.0. Maybe there is a 
> property that should be set to have the number of partitions computed 
> correctly.
> I'm aware that the .option("multiline","true") is not supported in Spark 
> 2.0.2, it's not relevant here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12359) Add showString() to DataSet API.

2017-10-24 Thread Alexandre Dupriez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217366#comment-16217366
 ] 

Alexandre Dupriez edited comment on SPARK-12359 at 10/24/17 6:08 PM:
-

Looking at [this 
source|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala]
 of the {{Dataset}} it seems that {{showString}} is still package-private. What 
is the reason not to expose a similar method to the public?

_Note_: the reason I would like to have such a method is to use the 
pretty-formatted output of the {{show}} method without being forced to inject 
it to the standard output.

_Edit_: ok, found the PR for this change which was closed: 
https://github.com/apache/spark/pull/10130


was (Author: hangleton):
Looking at [this 
source|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala]
 of the {{Dataset}} it seems that {{showString}} is still package-private. What 
is the reason not to expose a similar method to the public?

_Note_: the reason I would like to have such a method is to use the 
pretty-formatted output of the {{show}} method without being forced to inject 
it to the standard output.

> Add showString() to DataSet API.
> 
>
> Key: SPARK-12359
> URL: https://issues.apache.org/jira/browse/SPARK-12359
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dilip Biswal
>Priority: Minor
>
> JIRA 12105 exposed showString and its variants as public API. This adds the 
> two APIs into DataSet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12359) Add showString() to DataSet API.

2017-10-24 Thread Alexandre Dupriez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217366#comment-16217366
 ] 

Alexandre Dupriez edited comment on SPARK-12359 at 10/24/17 6:01 PM:
-

Looking at [this 
source|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala]
 of the {{Dataset}} it seems that {{showString}} is still package-private. What 
is the reason not to expose a similar method to the public?

_Note_: the reason I would like to have such a method is to use the 
pretty-formatted output of the {{show}} method without being forced to inject 
it to the standard output.


was (Author: hangleton):
Looking at [this 
source|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala]
 of the {{Dataset}} it seems that {{showString}} is still package-private. What 
is the reason not to expose a similar method to the public?

> Add showString() to DataSet API.
> 
>
> Key: SPARK-12359
> URL: https://issues.apache.org/jira/browse/SPARK-12359
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dilip Biswal
>Priority: Minor
>
> JIRA 12105 exposed showString and its variants as public API. This adds the 
> two APIs into DataSet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22341) [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off

2017-10-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22341:


Assignee: Apache Spark

> [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off
> --
>
> Key: SPARK-22341
> URL: https://issues.apache.org/jira/browse/SPARK-22341
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Maciej Bryński
>Assignee: Apache Spark
>
> I'm trying to run 2.3.0 (from master) on my yarn cluster.
> The result is:
> {code}
> Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
> Permission denied: user=yarn, access=EXECUTE, 
> inode="/user/bi/.sparkStaging/application_1508815646088_0164":bi:hdfs:drwx--
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:208)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:171)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4387)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:855)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1990)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:219)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:216)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:216)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:821)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:842)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> {code}
> I think the problem exists, because I'm not using yarn impersonation which 
> mean that all jobs on cluster are runned from user yarn.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22341) [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off

2017-10-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22341:


Assignee: (was: Apache Spark)

> [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off
> --
>
> Key: SPARK-22341
> URL: https://issues.apache.org/jira/browse/SPARK-22341
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Maciej Bryński
>
> I'm trying to run 2.3.0 (from master) on my yarn cluster.
> The result is:
> {code}
> Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
> Permission denied: user=yarn, access=EXECUTE, 
> inode="/user/bi/.sparkStaging/application_1508815646088_0164":bi:hdfs:drwx--
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:208)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:171)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4387)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:855)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1990)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:219)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:216)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:216)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:821)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:842)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> {code}
> I think the problem exists, because I'm not using yarn impersonation which 
> mean that all jobs on cluster are runned from user yarn.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12359) Add showString() to DataSet API.

2017-10-24 Thread Alexandre Dupriez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217366#comment-16217366
 ] 

Alexandre Dupriez commented on SPARK-12359:
---

Looking at [this 
source|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala]
 of the {{Dataset}} it seems that {{showString}} is still package-private. What 
is the reason not to expose a similar method to the public?

> Add showString() to DataSet API.
> 
>
> Key: SPARK-12359
> URL: https://issues.apache.org/jira/browse/SPARK-12359
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dilip Biswal
>Priority: Minor
>
> JIRA 12105 exposed showString and its variants as public API. This adds the 
> two APIs into DataSet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22341) [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off

2017-10-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217368#comment-16217368
 ] 

Apache Spark commented on SPARK-22341:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19566

> [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off
> --
>
> Key: SPARK-22341
> URL: https://issues.apache.org/jira/browse/SPARK-22341
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Maciej Bryński
>
> I'm trying to run 2.3.0 (from master) on my yarn cluster.
> The result is:
> {code}
> Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
> Permission denied: user=yarn, access=EXECUTE, 
> inode="/user/bi/.sparkStaging/application_1508815646088_0164":bi:hdfs:drwx--
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:208)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:171)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4387)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:855)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1990)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:219)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:216)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:216)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:821)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:842)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> {code}
> I think the problem exists, because I'm not using yarn impersonation which 
> mean that all jobs on cluster are runned from user yarn.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-24 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217357#comment-16217357
 ] 

Dongjoon Hyun commented on SPARK-22335:
---

[~CBribiescas]. Is this issue about types? For me, it looks like about *names*, 
not *types*. DataFrame also has names. In that point of view, Dataset had 
better be consistent with DataFrames.

> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
>   baDs.select("a","b").as[AB].union(abDs).show() // This is the same 
> workaround for linked issue, slightly modified.  However this seems wrong 
> since its supposed to be strongly typed
>   
>   baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
> result, which is logically inconsistent behavior
>   abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives 
> correct result
> {code}
> So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
> around is really fair, since I'm supposed to be getting of type `AB`.  More 
> importantly I think the issue is bigger when you consider that it happens 
> even if you read from parquet (as you would expect).  And that its 
> inconsistent when going to/from rdd.
> I imagine its just lazily converting to typed DS instead of initially.  So 
> either that typing could be prioritized to happen before the union or 
> unioning of DF could be done with column order taken into account.  Again, 
> this is speculation..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2017-10-24 Thread Dan Blanchard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217354#comment-16217354
 ] 

Dan Blanchard commented on SPARK-16367:
---

Right, sorry. I just meant to point out that the proposal you have up there has 
someone installing the {{wheelhouse}} package, which is unnecessary. 

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> This ticket proposes to allow users the ability to deploy their job as 
> "Wheels" packages. The Python community is strongly advocating to promote 
> this way of packaging and distributing Python application as a "standard way 
> of deploying Python App". In other word, this is the "Pythonic Way of 
> Deployment".
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via spark-testing-base 
> six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
> spark-testing-base==0.0.7.post2 
> traceback2==1.4.0 # via unittest2 
> unittest2==1.1.0 # via spark-testing-base 
> wheel==0.29.0 
> wrapt==1.10.8 # via astroid 
> {code} 
> -- write a setup.py with some entry points or package. Use 
> 

[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2017-10-24 Thread Ashwin Shankar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217333#comment-16217333
 ] 

Ashwin Shankar commented on SPARK-18105:


Hi [~davies] [~cloud_fan]
We hit the same issue. What is the aforementioned workaround that went into 
2.1? Any other workaround?

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22240) S3 CSV number of partitions incorrectly computed

2017-10-24 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217325#comment-16217325
 ] 

Steve Loughran commented on SPARK-22240:


I'm doing some testing with master & reading files off S3A, with s3a logging 
file IO in close() statements, and I'm seeing a full load of the file in the 
executor, even when schema inference is turned off

{code}
2017-10-24 18:03:20,085 [ScalaTest-main-running-S3ACommitBulkDataSuite] INFO  
codegen.CodeGenerator (Logging.scala:logInfo(54)) - Code generated in 
149.680004 ms
2017-10-24 18:03:20,423 [Executor task launch worker for task 0] INFO  
codegen.CodeGenerator (Logging.scala:logInfo(54)) - Code generated in 9.406645 
ms
2017-10-24 18:03:20,434 [Executor task launch worker for task 0] DEBUG 
s3a.S3AFileSystem (S3AFileSystem.java:open(680)) - Opening 
's3a://landsat-pds/scene_list.gz' for reading; input policy = sequential
2017-10-24 18:03:20,435 [Executor task launch worker for task 0] DEBUG 
s3a.S3AFileSystem (S3AFileSystem.java:innerGetFileStatus(2065)) - Getting path 
status for s3a://landsat-pds/scene_list.gz  (scene_list.gz)
2017-10-24 18:03:21,175 [Executor task launch worker for task 0] DEBUG 
s3a.S3AFileSystem (S3AFileSystem.java:s3GetFileStatus(2133)) - Found exact 
file: normal file
2017-10-24 18:03:21,184 [Executor task launch worker for task 0] INFO  
compress.CodecPool (CodecPool.java:getDecompressor(184)) - Got brand-new 
decompressor [.gz]
2017-10-24 18:03:21,188 [Executor task launch worker for task 0] DEBUG 
s3a.S3AInputStream (S3AInputStream.java:reopen(174)) - 
reopen(s3a://landsat-pds/scene_list.gz) for read from new offset 
range[0-45603307], length=65536, streamPosition=0, nextReadPosition=0, 
policy=sequential
2017-10-24 18:03:53,447 [Executor task launch worker for task 0] DEBUG 
s3a.S3AInputStream (S3AInputStream.java:closeStream(490)) - Closing stream 
close() operation: soft
2017-10-24 18:03:53,447 [Executor task launch worker for task 0] DEBUG 
s3a.S3AInputStream (S3AInputStream.java:closeStream(503)) - Drained stream of 0 
bytes
2017-10-24 18:03:53,448 [Executor task launch worker for task 0] DEBUG 
s3a.S3AInputStream (S3AInputStream.java:closeStream(524)) - Stream 
s3a://landsat-pds/scene_list.gz closed: close() operation; remaining=0 
streamPos=45603307, nextReadPos=45603307, request range 0-45603307 
length=45603307
2017-10-24 18:03:53,448 [Executor task launch worker for task 0] DEBUG 
s3a.S3AInputStream (S3AInputStream.java:close(463)) - Statistics of stream 
scene_list.gz
StreamStatistics{OpenOperations=1, CloseOperations=1, Closed=1, Aborted=0, 
SeekOperations=0, ReadExceptions=0, ForwardSeekOperations=0, 
BackwardSeekOperations=0, BytesSkippedOnSeek=0, BytesBackwardsOnSeek=0, 
BytesRead=45603307, BytesRead excluding skipped=45603307, ReadOperations=5240, 
ReadFullyOperations=0, ReadsIncomplete=5240, BytesReadInClose=0, 
BytesDiscardedInAbort=0, InputPolicy=1, InputPolicySetCount=1}
{code}

that is, handing in a csv.gz triggers a full read., which is "observably slow" 
to read 400MB of data from across the planet. Load options were: 
{code}
  val csvOptions = Map(
"header" -> "true",
"ignoreLeadingWhiteSpace" -> "true",
"ignoreTrailingWhiteSpace" -> "true",
"timestampFormat" -> "-MM-dd HH:mm:ss.SSSZZZ",
"inferSchema" -> "false",
"mode" -> "DROPMALFORMED",
  )
{code}

I'm going to set a schema to make this go away (it'll become a dataframe one 
transform later anyway), but I don't believe this used to occur. I'll try 
building against other spark versions to see.


> S3 CSV number of partitions incorrectly computed
> 
>
> Key: SPARK-22240
> URL: https://issues.apache.org/jira/browse/SPARK-22240
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Running on EMR 5.8.0 with Hadoop 2.7.3 and Spark 2.2.0
>Reporter: Arthur Baudry
>
> Reading CSV out of S3 using S3A protocol does not compute the number of 
> partitions correctly in Spark 2.2.0.
> With Spark 2.2.0 I get only partition when loading a 14GB file
> {code:java}
> scala> val input = spark.read.format("csv").option("header", 
> "true").option("delimiter", "|").option("multiLine", 
> "true").load("s3a://")
> input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: 
> string ... 36 more fields]
> scala> input.rdd.getNumPartitions
> res2: Int = 1
> {code}
> While in Spark 2.0.2 I had:
> {code:java}
> scala> val input = spark.read.format("csv").option("header", 
> "true").option("delimiter", "|").option("multiLine", 
> "true").load("s3a://")
> input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: 
> string ... 36 more fields]
> scala> input.rdd.getNumPartitions
> res2: Int = 115
> {code}
> This introduces obvious performance issues 

[jira] [Comment Edited] (SPARK-16367) Wheelhouse Support for PySpark

2017-10-24 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217322#comment-16217322
 ] 

Semet edited comment on SPARK-16367 at 10/24/17 5:29 PM:
-

Yes, I don't use it because it is a feature of {{pip}}:
{code}
pip wheel --wheel-dir wheelhouse .
{code}

It is described [in here|https://wheel.readthedocs.io/en/stable/].


was (Author: gae...@xeberon.net):
Yes, I don't use it because it is a feature of {{pip}}:
{{code}
pip wheel --wheel-dir wheelhouse .
{{code}}

It is described [in here|https://wheel.readthedocs.io/en/stable/].

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> This ticket proposes to allow users the ability to deploy their job as 
> "Wheels" packages. The Python community is strongly advocating to promote 
> this way of packaging and distributing Python application as a "standard way 
> of deploying Python App". In other word, this is the "Pythonic Way of 
> Deployment".
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via spark-testing-base 
> six==1.10.0 # via astroid, 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2017-10-24 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217322#comment-16217322
 ] 

Semet commented on SPARK-16367:
---

Yes, I don't use it because it is a feature of {{pip}}:
{{code}
pip wheel --wheel-dir wheelhouse .
{{code}}

It is described [in here|https://wheel.readthedocs.io/en/stable/].

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> This ticket proposes to allow users the ability to deploy their job as 
> "Wheels" packages. The Python community is strongly advocating to promote 
> this way of packaging and distributing Python application as a "standard way 
> of deploying Python App". In other word, this is the "Pythonic Way of 
> Deployment".
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via spark-testing-base 
> six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
> spark-testing-base==0.0.7.post2 
> traceback2==1.4.0 # via unittest2 
> unittest2==1.1.0 # via spark-testing-base 
> wheel==0.29.0 
> wrapt==1.10.8 # via astroid 
> {code} 
> -- write a setup.py with some entry points or package. Use 
> 

[jira] [Created] (SPARK-22344) Prevent R CMD check from using /tmp

2017-10-24 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-22344:
-

 Summary: Prevent R CMD check from using /tmp
 Key: SPARK-22344
 URL: https://issues.apache.org/jira/browse/SPARK-22344
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.2
Reporter: Shivaram Venkataraman


When R CMD check is run on the SparkR package it leaves behind files in /tmp 
which is a violation of CRAN policy. We should instead write to Rtmpdir. Notes 
from CRAN are below

{code}
Checking this leaves behind dirs

   hive/$USER
   $USER

and files named like

   b4f6459b-0624-4100-8358-7aa7afbda757_resources

in /tmp, in violation of the CRAN Policy.
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2017-10-24 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217310#comment-16217310
 ] 

Shivaram Venkataraman commented on SPARK-15799:
---

I created https://issues.apache.org/jira/browse/SPARK-22344

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>Assignee: Shivaram Venkataraman
> Fix For: 2.1.2
>
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22148) TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled

2017-10-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217299#comment-16217299
 ] 

Juan Rodríguez Hortalá commented on SPARK-22148:


Hi, 

I've been working on this issue, and I would like to get your feedback on the 
following approach. The idea is that instead of failing in 
`TaskSetManager.abortIfCompletelyBlacklisted`, when a task cannot be scheduled 
in any executor but dynamic allocation is enabled, we will register this task 
with `ExecutorAllocationManager`. Then `ExecutorAllocationManager` will request 
additional executors for these "unscheduleable tasks" by increasing the value 
returned in `ExecutorAllocationManager.maxNumExecutorsNeeded`. This way we are 
counting these tasks twice, but this makes sense because the current executors 
don't have any slot for these tasks, so we actually want to get new executors 
that are able to run these tasks. To avoid a deadlock due to tasks being 
unscheduleable forever, we store the timestamp when a task was registered as 
unscheduleable, and in `ExecutorAllocationManager.schedule` we abort the 
application if there is some task that has been unscheduleable for a 
configurable age threshold. This way we give an opportunity to dynamic 
allocation to get more executors that are able to run the tasks, but we don't 
make the application wait forever. 

Attached is a patch with a draft for this approach. Looking forward to your 
feedback on this. 

> TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current 
> executors are blacklisted but dynamic allocation is enabled
> -
>
> Key: SPARK-22148
> URL: https://issues.apache.org/jira/browse/SPARK-22148
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Juan Rodríguez Hortalá
> Attachments: SPARK-22148_WIP.diff
>
>
> Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and 
> the whole Spark job with `task X (partition Y) cannot run anywhere due to 
> node and executor blacklist. Blacklisting behavior can be configured via 
> spark.blacklist.*.` when all the available executors are blacklisted for a 
> pending Task or TaskSet. This makes sense for static allocation, where the 
> set of executors is fixed for the duration of the application, but this might 
> lead to unnecessary job failures when dynamic allocation is enabled. For 
> example, in a Spark application with a single job at a time, when a node 
> fails at the end of a stage attempt, all other executors will complete their 
> tasks, but the tasks running in the executors of the failing node will be 
> pending. Spark will keep waiting for those tasks for 2 minutes by default 
> (spark.network.timeout) until the heartbeat timeout is triggered, and then it 
> will blacklist those executors for that stage. At that point in time, other 
> executors would had been released after being idle for 1 minute by default 
> (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't 
> started yet and so there are no more tasks available (assuming the default of 
> spark.speculation = false). So Spark will fail because the only executors 
> available are blacklisted for that stage. 
> An alternative is requesting more executors to the cluster manager in this 
> situation. This could be retried a configurable number of times after a 
> configurable wait time between request attempts, so if the cluster manager 
> fails to provide a suitable executor then the job is aborted like in the 
> previous case. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22148) TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled

2017-10-24 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juan Rodríguez Hortalá updated SPARK-22148:
---
Attachment: SPARK-22148_WIP.diff

> TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current 
> executors are blacklisted but dynamic allocation is enabled
> -
>
> Key: SPARK-22148
> URL: https://issues.apache.org/jira/browse/SPARK-22148
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Juan Rodríguez Hortalá
> Attachments: SPARK-22148_WIP.diff
>
>
> Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and 
> the whole Spark job with `task X (partition Y) cannot run anywhere due to 
> node and executor blacklist. Blacklisting behavior can be configured via 
> spark.blacklist.*.` when all the available executors are blacklisted for a 
> pending Task or TaskSet. This makes sense for static allocation, where the 
> set of executors is fixed for the duration of the application, but this might 
> lead to unnecessary job failures when dynamic allocation is enabled. For 
> example, in a Spark application with a single job at a time, when a node 
> fails at the end of a stage attempt, all other executors will complete their 
> tasks, but the tasks running in the executors of the failing node will be 
> pending. Spark will keep waiting for those tasks for 2 minutes by default 
> (spark.network.timeout) until the heartbeat timeout is triggered, and then it 
> will blacklist those executors for that stage. At that point in time, other 
> executors would had been released after being idle for 1 minute by default 
> (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't 
> started yet and so there are no more tasks available (assuming the default of 
> spark.speculation = false). So Spark will fail because the only executors 
> available are blacklisted for that stage. 
> An alternative is requesting more executors to the cluster manager in this 
> situation. This could be retried a configurable number of times after a 
> configurable wait time between request attempts, so if the cluster manager 
> fails to provide a suitable executor then the job is aborted like in the 
> previous case. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2017-10-24 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217279#comment-16217279
 ] 

Hossein Falaki commented on SPARK-15799:


Is there a ticket to follow up on new policy violation issue? The package has 
been archived by CRAN.

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>Assignee: Shivaram Venkataraman
> Fix For: 2.1.2
>
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2017-10-24 Thread Dan Blanchard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217253#comment-16217253
 ] 

Dan Blanchard commented on SPARK-16367:
---

You don't actually appear to use the {{wheelhouse}} Python package at all in 
this proposal. You use {{pip wheel --wheel-dir=/path/to/wheelhouse}} to create 
a directory with all the wheels for the project, and that just requires {{pip}} 
and {{wheel}}. {{wheelhouse}} is a separate package for performing operations 
on those directories, but you do not appear to use it.

The whole wheelhouse-as-a-name-for-a-directory-of-wheels thing is just a pun 
that a lot of people like to use, separate from the {{wheelhouse}} package.

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> This ticket proposes to allow users the ability to deploy their job as 
> "Wheels" packages. The Python community is strongly advocating to promote 
> this way of packaging and distributing Python application as a "standard way 
> of deploying Python App". In other word, this is the "Pythonic Way of 
> Deployment".
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> 

[jira] [Commented] (SPARK-22341) [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off

2017-10-24 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217226#comment-16217226
 ] 

Marcelo Vanzin commented on SPARK-22341:


I was messing with this area recently, so let me take a look...

> [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off
> --
>
> Key: SPARK-22341
> URL: https://issues.apache.org/jira/browse/SPARK-22341
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Maciej Bryński
>
> I'm trying to run 2.3.0 (from master) on my yarn cluster.
> The result is:
> {code}
> Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
> Permission denied: user=yarn, access=EXECUTE, 
> inode="/user/bi/.sparkStaging/application_1508815646088_0164":bi:hdfs:drwx--
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:208)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:171)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4387)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:855)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1990)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:219)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:216)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:216)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:821)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:842)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> {code}
> I think the problem exists, because I'm not using yarn impersonation which 
> mean that all jobs on cluster are runned from user yarn.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

2017-10-24 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217221#comment-16217221
 ] 

Semet edited comment on SPARK-13587 at 10/24/17 4:46 PM:
-

Hello. For me this solution is equivalent with my "Wheelhouse" (SPARK-16367) 
proposal I made, even without having to modify pyspark at all. I even think you 
can package a wheelhouse using this {{--archive}} argument.
The drawback is indeed your spark-submit has to send this package to each node 
(1 to n). If Pyspark supported {{requirements.txt}}/{{Pipfile}} dependencies 
description formats, each node would download by itself the dependencies...
The strong argument for wheelhouse is that is only packages the libraries used 
by the project, not the complete environment. The drawback is that it may not 
work well with anaconda.


was (Author: gae...@xeberon.net):
Hello. For me this solution is equivalent with my "Wheelhouse" (SPARK-16367) 
proposal I made, even without having to modify pyspark at all. I even think you 
can package a wheelhouse using this {{--archive}} argument.
The drawback is indeed your spark-submit has to send this package to each node 
(1 to n). If Pyspark supported {{requirements.txt}}/{{Pipfile}} dependencies 
description formats, each node would download by itself the dependencies...

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2017-10-24 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217221#comment-16217221
 ] 

Semet commented on SPARK-13587:
---

Hello. For me this solution is equivalent with my "Wheelhouse" (SPARK-16367) 
proposal I made, even without having to modify pyspark at all. I even think you 
can package a wheelhouse using this {{--archive}} argument.
The drawback is indeed your spark-submit has to send this package to each node 
(1 to n). If Pyspark supported {{requirements.txt}}/{{Pipfile}} dependencies 
description formats, each node would download by itself the dependencies...

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9686) Spark Thrift server doesn't return correct JDBC metadata

2017-10-24 Thread Andriy Kushnir (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217201#comment-16217201
 ] 

Andriy Kushnir commented on SPARK-9686:
---

[~rxin], I did a little research for this error.
To invoke {{run()}} → {{runInternal()}} on any 
{{org.apache.hive.service.cli.operation.Operation}} (for example, 
{{GetSchemasOperation}}) we need {{IMetaStoreClient}}. Currently it's taken 
from {{HiveSession}} instance:
{code:java}
public class GetSchemasOperation extends MetadataOperation {
@Override
public void runInternal() throws HiveSQLException {
IMetaStoreClient metastoreClient = 
getParentSession().getMetaStoreClient();
}
}
{code}

All opened {{HiveSession}} s are handled by 
{{org.apache.hive.service.cli.session.SessionManager}} instance.
{{SessionManager}}, among with others, implements 
{{org.apache.hive.service.Service}} interface, and all {{Service}}s initialized 
with same Hive configuration:
{code:java}
public interface Service { 
void init(HiveConf conf);
}
{code}
When {{org.apache.spark.sql.hive.thriftserver.HiveThriftServer2}} initializes, 
all {{org.apache.hive.service.CompositeService}} s receive same {{HiveConf}}:

{code:java}
private[hive] class HiveThriftServer2(sqlContext: SQLContext) extends 
HiveServer2 with ReflectedCompositeService {
override def init(hiveConf: HiveConf) {
initCompositeService(hiveConf)
}
}

object HiveThriftServer2 extends Logging {
@DeveloperApi
def startWithContext(sqlContext: SQLContext): Unit = {
val server = new HiveThriftServer2(sqlContext)

val executionHive = HiveUtils.newClientForExecution(
  sqlContext.sparkContext.conf,
  sqlContext.sessionState.newHadoopConf())

server.init(executionHive.conf)
}
}

{code}

So, {{HiveUtils#newClientForExecution()}} returns implementation of 
{{IMetaStoreClient}} which *ALWAYS* points to derby metastore (see dosctrings 
and comments in 
{{org.apache.spark.sql.hive.HiveUtils#newTemporaryConfiguration()}})

IMHO, to get correct metadata we need to additionally create another 
{{IMetaStoreClient}} with {{newClientForMetadata()}}, and pass it's 
{{HiveConf}} to underlying {{Service}} s.

> Spark Thrift server doesn't return correct JDBC metadata 
> -
>
> Key: SPARK-9686
> URL: https://issues.apache.org/jira/browse/SPARK-9686
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2
>Reporter: pin_zhang
>Priority: Critical
> Attachments: SPARK-9686.1.patch.txt
>
>
> 1. Start  start-thriftserver.sh
> 2. connect with beeline
> 3. create table
> 4.show tables, the new created table returned
> 5.
>   Class.forName("org.apache.hive.jdbc.HiveDriver");
>   String URL = "jdbc:hive2://localhost:1/default";
>Properties info = new Properties();
> Connection conn = DriverManager.getConnection(URL, info);
>   ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(),
>null, null, null);
> Problem:
>No tables with returned this API, that work in spark1.3



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22343) Add support for publishing Spark metrics into Prometheus

2017-10-24 Thread Janos Matyas (JIRA)
Janos Matyas created SPARK-22343:


 Summary: Add support for publishing Spark metrics into Prometheus
 Key: SPARK-22343
 URL: https://issues.apache.org/jira/browse/SPARK-22343
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Janos Matyas


I've created a PR for supporting Prometheus a metric sink for Spark - in the 
https://github.com/apache-spark-on-k8s/spark fork. Based on the maintainers of 
the project I should create an issue here as well, in order to be tracked 
upstream as well. See below the original text of the PR: 
_
Publishing Spark metrics into Prometheus - as discussed earlier in #384. 
Implemented a metrics sink that publishes Spark metrics into Prometheus via 
[Prometheus Pushgateway](https://prometheus.io/docs/instrumenting/pushing/). 
Metrics data published by Spark is based on 
[Dropwizard](http://metrics.dropwizard.io/). The format of Spark metrics is not 
supported natively by Prometheus thus these are converted using 
[DropwizardExports](https://prometheus.io/client_java/io/prometheus/client/dropwizard/DropwizardExports.html)
 prior pushing metrics to the pushgateway.

Also the default Prometheus pushgateway client API implementation does not 
support metrics timestamp thus the client API has been ehanced to enrich 
metrics  data with timestamp. _




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22301) Add rule to Optimizer for In with empty list of values

2017-10-24 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22301.
-
   Resolution: Fixed
 Assignee: Marco Gaido
Fix Version/s: 2.3.0

> Add rule to Optimizer for In with empty list of values
> --
>
> Key: SPARK-22301
> URL: https://issues.apache.org/jira/browse/SPARK-22301
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
> Fix For: 2.3.0
>
>
> For performance reason, we should resolve in operation on an empty list as 
> false in the optimizations phase.
> For further reference, please look at the discussion on PRs: 
> https://github.com/apache/spark/pull/19522 and 
> https://github.com/apache/spark/pull/19494.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22277) Chi Square selector garbling Vector content.

2017-10-24 Thread Cheburakshu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217134#comment-16217134
 ] 

Cheburakshu commented on SPARK-22277:
-

There are 2 problems I faced because of which I quit using ChiSqSelector
since it is producing inconsistent results. Also, I am not able to find one
example in the documentation that uses selected features for training any
model. Already, I opened a bug where the transform output is garbling
content.

1. If I use numFeatures=5 and numFeatures=3 and examine the
selectedFeatures indices, the three features are not a subset of 5
features.
2. If I use VectorIndexer/StringIndexer+OneHotEncoder before using in
ChiSqSelector, the selectedFeatures indices of the model go out of bounds.

I did not get it when you are saying that my code has some issue when I
plucked it just out of Spark ChiSqSelector example. Are you suggesting that
Spark example itself is wrong?

https://spark.apache.org/docs/latest/ml-features.html#chisqselector

[~cheburakshu] Your post code has some issue:
{code}
df = spark.createDataFrame([
(7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
(8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
(9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features",
"clicked"])
{code}
Are these features categorical features ?




> Chi Square selector garbling Vector content.
> 
>
> Key: SPARK-22277
> URL: https://issues.apache.org/jira/browse/SPARK-22277
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Cheburakshu
>
> There is a difference in behavior when Chisquare selector is used v direct 
> feature use in decision tree classifier. 
> In the below code, I have used chisquare selector as a thru' pass but the 
> decision tree classifier is unable to process it. But, it is able to process 
> when the features are used directly.
> The example is pulled out directly from Apache spark python documentation.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import ChiSqSelector
> from pyspark.ml.linalg import Vectors
> import sys
> df = spark.createDataFrame([
> (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
> (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
> (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
> "clicked"])
> # ChiSq selector will just be a pass-through. All four featuresin the i/p 
> will be in output also.
> selector = ChiSqSelector(numTopFeatures=4, featuresCol="features",
>  outputCol="selectedFeatures", labelCol="clicked")
> result = selector.fit(df).transform(df)
> print("ChiSqSelector output with top %d features selected" % 
> selector.getNumTopFeatures())
> from pyspark.ml.classification import DecisionTreeClassifier
> try:
> # Fails
> dt = 
> DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures")
> model = dt.fit(result)
> except:
> print(sys.exc_info())
> #Works
> dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features")
> model = dt.fit(df)
> 
> # Make predictions. Using same dataset, not splitting!!
> predictions = model.transform(result)
> # Select example rows to display.
> predictions.select("prediction", "clicked", "features").show(5)
> # Select (prediction, true label) and compute test error
> evaluator = MulticlassClassificationEvaluator(
> labelCol="clicked", predictionCol="prediction", metricName="accuracy")
> accuracy = evaluator.evaluate(predictions)
> print("Test Error = %g " % (1.0 - accuracy))
> {code}
> Output:
> ChiSqSelector output with top 4 features selected
> (, 
> IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but 
> it does not have the number of values specified.', 
> 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t
>  at 
> org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t 
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t 
> at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at 
> org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t
>  at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t
>  at 
> 

[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2017-10-24 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217115#comment-16217115
 ] 

Nicholas Chammas commented on SPARK-13587:
--

To follow-up on my [earlier 
comment|https://issues.apache.org/jira/browse/SPARK-13587?focusedCommentId=15740419=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15740419],
 I created a [completely self-contained sample 
repo|https://github.com/massmutual/sample-pyspark-application] demonstrating a 
technique for bundling PySpark app dependencies in an isolated way. It's the 
technique that Ben, I, and several others discussed here in this JIRA issue.

https://github.com/massmutual/sample-pyspark-application

The approach has advantages (like letting you ship a completely isolated Python 
environment, so you don't even need Python installed on the workers) and 
disadvantages (requires YARN; increases job startup time). Hope some of you 
find the sample repo useful until Spark adds more "first-class" support for 
Python dependency isolation.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22331) Make MLlib string params case-insensitive

2017-10-24 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216224#comment-16216224
 ] 

Weichen Xu edited comment on SPARK-22331 at 10/24/17 2:19 PM:
--

OK, it seems not breaking current code in spark, but it is possible to 
influence user extension code. e.g., some user create his own `evaluator` class 
and in `metricName` param he use `ParamValidators.inArray(Array("aa", "AA"))` 
as allowedParams.But `Param` is develop api so it also doesn't matter.
OK I agree to make this change, if we do not find it break something.


was (Author: weichenxu123):
OK, it seems not breaking current code in spark, but it is possible to 
influence user extension code. e.g., some user create his own `evaluator` class 
and in `metricName` param he use `ParamValidators.inArray(Array("aa", "AA"))` 
as allowedParams. 

> Make MLlib string params case-insensitive
> -
>
> Key: SPARK-22331
> URL: https://issues.apache.org/jira/browse/SPARK-22331
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> Some String params in ML are still case-sensitive, as they are checked by 
> ParamValidators.inArray.
> For consistency in user experience, there should be some general guideline in 
> whether String params in Spark MLlib are case-insensitive or not. 
> I'm leaning towards making all String params case-insensitive where possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-22342) refactor schedulerDriver registration

2017-10-24 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22342:

Comment: was deleted

(was: @Arthur Rand fyi)

> refactor schedulerDriver registration
> -
>
> Key: SPARK-22342
> URL: https://issues.apache.org/jira/browse/SPARK-22342
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Stavros Kontopoulos
>
> This is an umbrella issue for working on:
> https://github.com/apache/spark/pull/13143
> and handle the multiple re-registration issue which invalidates an offer.
> To test:
>  dcos spark run --verbose --name=spark-nohive  --submit-args="--driver-cores 
> 1 --conf spark.cores.max=1 --driver-memory 512M --class 
> org.apache.spark.examples.SparkPi http://.../spark-examples_2.11-2.2.0.jar;
> master log:
> I1020 13:49:05.00  3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3085 hierarchical.cpp:303] Added framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3085 hierarchical.cpp:412] Deactivated framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3090 hierarchical.cpp:380] Activated framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [  ]
> I1020 13:49:05.00  3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00  3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00  3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [ ]
> I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00 3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00 3087 master.cpp:7662] Sending 6 offers to framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [ ]
> I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00 3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10039
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10038
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10037
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10036
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10035
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10034
> I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [ ]
> I1020 13:49:05.00 3087 master.cpp:6618] Updating 

[jira] [Comment Edited] (SPARK-22342) refactor schedulerDriver registration

2017-10-24 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216959#comment-16216959
 ] 

Stavros Kontopoulos edited comment on SPARK-22342 at 10/24/17 2:08 PM:
---

@Arthur Rand fyi


was (Author: skonto):
@Arthur Rand 

> refactor schedulerDriver registration
> -
>
> Key: SPARK-22342
> URL: https://issues.apache.org/jira/browse/SPARK-22342
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Stavros Kontopoulos
>
> This is an umbrella issue for working on:
> https://github.com/apache/spark/pull/13143
> and handle the multiple re-registration issue which invalidates an offer.
> To test:
>  dcos spark run --verbose --name=spark-nohive  --submit-args="--driver-cores 
> 1 --conf spark.cores.max=1 --driver-memory 512M --class 
> org.apache.spark.examples.SparkPi http://.../spark-examples_2.11-2.2.0.jar;
> master log:
> I1020 13:49:05.00  3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3085 hierarchical.cpp:303] Added framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3085 hierarchical.cpp:412] Deactivated framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3090 hierarchical.cpp:380] Activated framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [  ]
> I1020 13:49:05.00  3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00  3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00  3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [ ]
> I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00 3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00 3087 master.cpp:7662] Sending 6 offers to framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [ ]
> I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00 3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10039
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10038
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10037
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10036
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10035
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10034
> I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi 
> with 

[jira] [Commented] (SPARK-22342) refactor schedulerDriver registration

2017-10-24 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216959#comment-16216959
 ] 

Stavros Kontopoulos commented on SPARK-22342:
-

@Arthur Rand 

> refactor schedulerDriver registration
> -
>
> Key: SPARK-22342
> URL: https://issues.apache.org/jira/browse/SPARK-22342
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Stavros Kontopoulos
>
> This is an umbrella issue for working on:
> https://github.com/apache/spark/pull/13143
> and handle the multiple re-registration issue which invalidates an offer.
> To test:
>  dcos spark run --verbose --name=spark-nohive  --submit-args="--driver-cores 
> 1 --conf spark.cores.max=1 --driver-memory 512M --class 
> org.apache.spark.examples.SparkPi http://.../spark-examples_2.11-2.2.0.jar;
> master log:
> I1020 13:49:05.00  3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3085 hierarchical.cpp:303] Added framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3085 hierarchical.cpp:412] Deactivated framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3090 hierarchical.cpp:380] Activated framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [  ]
> I1020 13:49:05.00  3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00  3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00  3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [ ]
> I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00 3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00 3087 master.cpp:7662] Sending 6 offers to framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [ ]
> I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00 3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10039
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10038
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10037
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10036
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10035
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10034
> I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [ ]
> I1020 13:49:05.00 3087 master.cpp:6618] 

[jira] [Updated] (SPARK-22342) refactor schedulerDriver registration

2017-10-24 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22342:

Description: 
This is an umbrella issue for working on:
https://github.com/apache/spark/pull/13143

and handle the multiple re-registration issue which invalidates an offer.

To test:
 dcos spark run --verbose --name=spark-nohive  --submit-args="--driver-cores 1 
--conf spark.cores.max=1 --driver-memory 512M --class 
org.apache.spark.examples.SparkPi http://.../spark-examples_2.11-2.2.0.jar;

master log:

I1020 13:49:05.00  3087 master.cpp:6618] Updating info for framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
I1020 13:49:05.00  3085 hierarchical.cpp:303] Added framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
I1020 13:49:05.00  3085 hierarchical.cpp:412] Deactivated framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
I1020 13:49:05.00  3090 hierarchical.cpp:380] Activated framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
I1020 13:49:05.00  3087 master.cpp:2974] Subscribing framework Spark Pi 
with checkpointing disabled and capabilities [  ]
I1020 13:49:05.00  3087 master.cpp:6618] Updating info for framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
I1020 13:49:05.00  3087 master.cpp:3083] Framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark Pi) 
at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed over
I1020 13:49:05.00  3087 master.cpp:2894] Received SUBSCRIBE call for 
framework 'Spark Pi' at 
scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
I1020 13:49:05.00  3087 master.cpp:2894] Received SUBSCRIBE call for 
framework 'Spark Pi' at 
scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
framework 'Spark Pi' at 
scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
framework 'Spark Pi' at 
scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi with 
checkpointing disabled and capabilities [ ]
I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
I1020 13:49:05.00 3087 master.cpp:3083] Framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark Pi) 
at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed over
I1020 13:49:05.00 3087 master.cpp:7662] Sending 6 offers to framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark Pi) 
at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi with 
checkpointing disabled and capabilities [ ]
I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
I1020 13:49:05.00 3087 master.cpp:3083] Framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark Pi) 
at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed over
I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
9764beab-c90a-4b4f-b0ff-44c187851b34-O10039
I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
9764beab-c90a-4b4f-b0ff-44c187851b34-O10038
I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
9764beab-c90a-4b4f-b0ff-44c187851b34-O10037
I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
9764beab-c90a-4b4f-b0ff-44c187851b34-O10036
I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
9764beab-c90a-4b4f-b0ff-44c187851b34-O10035
I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
9764beab-c90a-4b4f-b0ff-44c187851b34-O10034
I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
framework 'Spark Pi' at 
scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi with 
checkpointing disabled and capabilities [ ]
I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
I1020 13:49:05.00 3087 master.cpp:3083] Framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark Pi) 
at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed over
I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi with 
checkpointing disabled and capabilities [ ]
I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 

[jira] [Created] (SPARK-22342) refactor schedulerDriver registration

2017-10-24 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created SPARK-22342:
---

 Summary: refactor schedulerDriver registration
 Key: SPARK-22342
 URL: https://issues.apache.org/jira/browse/SPARK-22342
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 2.2.0
Reporter: Stavros Kontopoulos


This is an umbrella issue for working on:
https://github.com/apache/spark/pull/13143

and handle the multiple re-registration issue which invalidates an offer.

```
I1020 13:49:05.00  3087 master.cpp:6618] Updating info for framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
I1020 13:49:05.00  3085 hierarchical.cpp:303] Added framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
I1020 13:49:05.00  3085 hierarchical.cpp:412] Deactivated framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
I1020 13:49:05.00  3090 hierarchical.cpp:380] Activated framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
I1020 13:49:05.00  3087 master.cpp:2974] Subscribing framework Spark Pi 
with checkpointing disabled and capabilities [  ]
I1020 13:49:05.00  3087 master.cpp:6618] Updating info for framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
I1020 13:49:05.00  3087 master.cpp:3083] Framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark Pi) 
at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed over
I1020 13:49:05.00  3087 master.cpp:2894] Received SUBSCRIBE call for 
framework 'Spark Pi' at 
scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
I1020 13:49:05.00  3087 master.cpp:2894] Received SUBSCRIBE call for 
framework 'Spark Pi' at 
scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
framework 'Spark Pi' at 
scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
framework 'Spark Pi' at 
scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi with 
checkpointing disabled and capabilities [ ]
I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
I1020 13:49:05.00 3087 master.cpp:3083] Framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark Pi) 
at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed over
I1020 13:49:05.00 3087 master.cpp:7662] Sending 6 offers to framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark Pi) 
at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi with 
checkpointing disabled and capabilities [ ]
I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
I1020 13:49:05.00 3087 master.cpp:3083] Framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark Pi) 
at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed over
I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
9764beab-c90a-4b4f-b0ff-44c187851b34-O10039
I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
9764beab-c90a-4b4f-b0ff-44c187851b34-O10038
I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
9764beab-c90a-4b4f-b0ff-44c187851b34-O10037
I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
9764beab-c90a-4b4f-b0ff-44c187851b34-O10036
I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
9764beab-c90a-4b4f-b0ff-44c187851b34-O10035
I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
9764beab-c90a-4b4f-b0ff-44c187851b34-O10034
I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
framework 'Spark Pi' at 
scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi with 
checkpointing disabled and capabilities [ ]
I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
I1020 13:49:05.00 3087 master.cpp:3083] Framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark Pi) 
at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed over
I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi with 
checkpointing disabled and capabilities [ ]
I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
I1020 

[jira] [Commented] (SPARK-21043) Add unionByName API to Dataset

2017-10-24 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216953#comment-16216953
 ] 

Carlos Bribiescas commented on SPARK-21043:
---

I really like this feature.  Is there a motivation not to replace union with 
this functionality, other than backwards compatibility?

> Add unionByName API to Dataset
> --
>
> Key: SPARK-21043
> URL: https://issues.apache.org/jira/browse/SPARK-21043
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Takeshi Yamamuro
> Fix For: 2.3.0
>
>
> It would be useful to add unionByName which resolves columns by name, in 
> addition to the existing union (which resolves by position).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-24 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216950#comment-16216950
 ] 

Carlos Bribiescas commented on SPARK-22335:
---

I think if unionByName replaced union then it would be a solution.   Its 
definitely a workaround... But as the api stands it feels like a bug since 
Dataset is supposed to be typed.

Again, I suspect it has to do with the optimizer pushing the typing to a later 
step, after the union by column order happens.  If this is the root cause of 
the bug I worry how else its being manifested, that is, what other bugs it may 
cause.  I'll have to think about it a bit more.



> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
>   baDs.select("a","b").as[AB].union(abDs).show() // This is the same 
> workaround for linked issue, slightly modified.  However this seems wrong 
> since its supposed to be strongly typed
>   
>   baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
> result, which is logically inconsistent behavior
>   abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives 
> correct result
> {code}
> So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
> around is really fair, since I'm supposed to be getting of type `AB`.  More 
> importantly I think the issue is bigger when you consider that it happens 
> even if you read from parquet (as you would expect).  And that its 
> inconsistent when going to/from rdd.
> I imagine its just lazily converting to typed DS instead of initially.  So 
> either that typing could be prioritized to happen before the union or 
> unioning of DF could be done with column order taken into account.  Again, 
> this is speculation..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22111) OnlineLDAOptimizer should filter out empty documents beforehand

2017-10-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22111:


Assignee: (was: Apache Spark)

> OnlineLDAOptimizer should filter out empty documents beforehand 
> 
>
> Key: SPARK-22111
> URL: https://issues.apache.org/jira/browse/SPARK-22111
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Priority: Minor
>
> OnlineLDAOptimizer should filter out empty documents beforehand in order to 
> make corpusSize, batchSize, and nonEmptyDocsN all refer to the same filtered 
> corpus with all non-empty docs. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22111) OnlineLDAOptimizer should filter out empty documents beforehand

2017-10-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22111:


Assignee: Apache Spark

> OnlineLDAOptimizer should filter out empty documents beforehand 
> 
>
> Key: SPARK-22111
> URL: https://issues.apache.org/jira/browse/SPARK-22111
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>
> OnlineLDAOptimizer should filter out empty documents beforehand in order to 
> make corpusSize, batchSize, and nonEmptyDocsN all refer to the same filtered 
> corpus with all non-empty docs. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22111) OnlineLDAOptimizer should filter out empty documents beforehand

2017-10-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216948#comment-16216948
 ] 

Apache Spark commented on SPARK-22111:
--

User 'akopich' has created a pull request for this issue:
https://github.com/apache/spark/pull/19565

> OnlineLDAOptimizer should filter out empty documents beforehand 
> 
>
> Key: SPARK-22111
> URL: https://issues.apache.org/jira/browse/SPARK-22111
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Priority: Minor
>
> OnlineLDAOptimizer should filter out empty documents beforehand in order to 
> make corpusSize, batchSize, and nonEmptyDocsN all refer to the same filtered 
> corpus with all non-empty docs. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22118) Should prevent change epoch in success stage while there is some running stage

2017-10-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-22118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216908#comment-16216908
 ] 

Maciej Bryński commented on SPARK-22118:


I think this problem is resolved by: SPARK-20715


> Should prevent change epoch in success stage while there is some running stage
> --
>
> Key: SPARK-22118
> URL: https://issues.apache.org/jira/browse/SPARK-22118
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: SuYan
>Priority: Critical
>
> In 2.1, will change epoch if stage success, and will trigger mapoutTracker to 
> clean cache broadcast
> but when there have other running stage, and want to get mapstatus broadcast 
> value...it will has the possibility that occurs failed to get broadcast? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22341) [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off

2017-10-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-22341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216877#comment-16216877
 ] 

Maciej Bryński edited comment on SPARK-22341 at 10/24/17 1:29 PM:
--

Because Spark 2.2.0 is working perfectly fine with such a configuration.

The problem is that Spark 2.3.0 is using wrong user on AM when trying to read 
files from hdfs.


was (Author: maver1ck):
Because Spark 2.2.0 is working perfectly fine with such a configuration.

The problem is that Spark 2.3.0 is using wrong user on AM.

> [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off
> --
>
> Key: SPARK-22341
> URL: https://issues.apache.org/jira/browse/SPARK-22341
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Maciej Bryński
>
> I'm trying to run 2.3.0 (from master) on my yarn cluster.
> The result is:
> {code}
> Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
> Permission denied: user=yarn, access=EXECUTE, 
> inode="/user/bi/.sparkStaging/application_1508815646088_0164":bi:hdfs:drwx--
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:208)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:171)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4387)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:855)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1990)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:219)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:216)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:216)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:821)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:842)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> {code}
> I think the problem exists, because I'm not using yarn impersonation which 
> mean that all jobs on cluster are runned from user yarn.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (SPARK-22340) pyspark setJobGroup doesn't match java threads

2017-10-24 Thread Leif Walsh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leif Walsh updated SPARK-22340:
---
Description: 
With pyspark, {{sc.setJobGroup}}'s documentation says

{quote}
Assigns a group ID to all the jobs started by this thread until the group ID is 
set to a different value or cleared.
{quote}

However, this doesn't appear to be associated with Python threads, only with 
Java threads.  As such, a Python thread which calls this and then submits 
multiple jobs doesn't necessarily get its jobs associated with any particular 
spark job group.  For example:

{code}
def run_jobs():
sc.setJobGroup('hello', 'hello jobs')
x = sc.range(100).sum()
y = sc.range(1000).sum()
return x, y

import concurrent.futures
with concurrent.futures.ThreadPoolExecutor() as executor:
future = executor.submit(run_jobs)
sc.cancelJobGroup('hello')
future.result()
{code}

In this example, depending how the action calls on the Python side are 
allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be 
assigned the job group {{hello}}.

First, we should clarify the docs if this truly is the case.

Second, it would be really helpful if we could make the job group assignment 
reliable for a Python thread, though I’m not sure the best way to do this.  As 
it stands, job groups are pretty useless from the pyspark side, if we can't 
rely on this fact.

My only idea so far is to mimic the TLS behavior on the Python side and then 
patch every point where job submission may take place to pass that in, but this 
feels pretty brittle. In my experience with py4j, controlling threading there 
is a challenge. 

  was:
With pyspark, {{sc.setJobGroup}}'s documentation says

{quote}
Assigns a group ID to all the jobs started by this thread until the group ID is 
set to a different value or cleared.
{quote}

However, this doesn't appear to be associated with Python threads, only with 
Java threads.  As such, a Python thread which calls this and then submits 
multiple jobs doesn't necessarily get its jobs associated with any particular 
spark job group.  For example:

{code}
def run_jobs():
sc.setJobGroup('hello', 'hello jobs')
x = sc.range(100).sum()
y = sc.range(1000).sum()
return x, y

import concurrent.futures
with concurrent.futures.ThreadPoolExecutor() as executor:
future = executor.submit(run_jobs)
sc.cancelJobGroup('hello')
future.result()
{code}

In this example, depending how the action calls on the Python side are 
allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be 
assigned the job group {{hello}}.

First, we should clarify the docs if this truly is the case.

Second, it would be really helpful if this could be made the case.  As it 
stands, job groups are pretty useless from the pyspark side, if we can't rely 
on this fact.


> pyspark setJobGroup doesn't match java threads
> --
>
> Key: SPARK-22340
> URL: https://issues.apache.org/jira/browse/SPARK-22340
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2
>Reporter: Leif Walsh
>
> With pyspark, {{sc.setJobGroup}}'s documentation says
> {quote}
> Assigns a group ID to all the jobs started by this thread until the group ID 
> is set to a different value or cleared.
> {quote}
> However, this doesn't appear to be associated with Python threads, only with 
> Java threads.  As such, a Python thread which calls this and then submits 
> multiple jobs doesn't necessarily get its jobs associated with any particular 
> spark job group.  For example:
> {code}
> def run_jobs():
> sc.setJobGroup('hello', 'hello jobs')
> x = sc.range(100).sum()
> y = sc.range(1000).sum()
> return x, y
> import concurrent.futures
> with concurrent.futures.ThreadPoolExecutor() as executor:
> future = executor.submit(run_jobs)
> sc.cancelJobGroup('hello')
> future.result()
> {code}
> In this example, depending how the action calls on the Python side are 
> allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be 
> assigned the job group {{hello}}.
> First, we should clarify the docs if this truly is the case.
> Second, it would be really helpful if we could make the job group assignment 
> reliable for a Python thread, though I’m not sure the best way to do this.  
> As it stands, job groups are pretty useless from the pyspark side, if we 
> can't rely on this fact.
> My only idea so far is to mimic the TLS behavior on the Python side and then 
> patch every point where job submission may take place to pass that in, but 
> this feels pretty brittle. In my experience with py4j, controlling threading 
> there is a challenge. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (SPARK-22341) [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off

2017-10-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-22341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216877#comment-16216877
 ] 

Maciej Bryński edited comment on SPARK-22341 at 10/24/17 1:20 PM:
--

Because Spark 2.2.0 is working perfectly fine with such a configuration.

The problem is that Spark 2.3.0 is using wrong user on AM.


was (Author: maver1ck):
Because Spark 2.2.0 is working perfectly fine with such a configuration.


> [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off
> --
>
> Key: SPARK-22341
> URL: https://issues.apache.org/jira/browse/SPARK-22341
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Maciej Bryński
>
> I'm trying to run 2.3.0 (from master) on my yarn cluster.
> The result is:
> {code}
> Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
> Permission denied: user=yarn, access=EXECUTE, 
> inode="/user/bi/.sparkStaging/application_1508815646088_0164":bi:hdfs:drwx--
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:208)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:171)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4387)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:855)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1990)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:219)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:216)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:216)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:821)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:842)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> {code}
> I think the problem exists, because I'm not using yarn impersonation which 
> mean that all jobs on cluster are runned from user yarn.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-22341) [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off

2017-10-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-22341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216877#comment-16216877
 ] 

Maciej Bryński commented on SPARK-22341:


Because Spark 2.2.0 is working perfectly fine with such a configuration.


> [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off
> --
>
> Key: SPARK-22341
> URL: https://issues.apache.org/jira/browse/SPARK-22341
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Maciej Bryński
>
> I'm trying to run 2.3.0 (from master) on my yarn cluster.
> The result is:
> {code}
> Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
> Permission denied: user=yarn, access=EXECUTE, 
> inode="/user/bi/.sparkStaging/application_1508815646088_0164":bi:hdfs:drwx--
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:208)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:171)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4387)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:855)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1990)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:219)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:216)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:216)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:821)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:842)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> {code}
> I think the problem exists, because I'm not using yarn impersonation which 
> mean that all jobs on cluster are runned from user yarn.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22341) [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off

2017-10-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22341:
--
Priority: Major  (was: Blocker)

> [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off
> --
>
> Key: SPARK-22341
> URL: https://issues.apache.org/jira/browse/SPARK-22341
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Maciej Bryński
>
> I'm trying to run 2.3.0 (from master) on my yarn cluster.
> The result is:
> {code}
> Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
> Permission denied: user=yarn, access=EXECUTE, 
> inode="/user/bi/.sparkStaging/application_1508815646088_0164":bi:hdfs:drwx--
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:208)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:171)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4387)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:855)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1990)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:219)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:216)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:216)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:821)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:842)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> {code}
> I think the problem exists, because I'm not using yarn impersonation which 
> mean that all jobs on cluster are runned from user yarn.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22341) [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off

2017-10-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216868#comment-16216868
 ] 

Sean Owen commented on SPARK-22341:
---

Yes, but why is that a Spark problem?

> [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off
> --
>
> Key: SPARK-22341
> URL: https://issues.apache.org/jira/browse/SPARK-22341
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Maciej Bryński
>Priority: Blocker
>
> I'm trying to run 2.3.0 (from master) on my yarn cluster.
> The result is:
> {code}
> Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
> Permission denied: user=yarn, access=EXECUTE, 
> inode="/user/bi/.sparkStaging/application_1508815646088_0164":bi:hdfs:drwx--
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:208)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:171)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4387)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:855)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1990)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:219)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:216)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:216)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:821)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:842)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> {code}
> I think the problem exists, because I'm not using yarn impersonation which 
> mean that all jobs on cluster are runned from user yarn.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)

2017-10-24 Thread Pranav Singhania (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216862#comment-16216862
 ] 

Pranav Singhania commented on SPARK-16599:
--

I've been temporarily able to avoid the error using a local SparkSession 
declared in the function rather than declaring it globally accessible to all 
the functions in my object/class. If required in other functions, passed it as 
a parameter to the particular function. Hope it helps to resolve the bug.

Cheers!

> java.util.NoSuchElementException: None.get  at at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
> --
>
> Key: SPARK-16599
> URL: https://issues.apache.org/jira/browse/SPARK-16599
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: centos 6.7   spark 2.0
>Reporter: binde
>
> run a spark job with spark 2.0, error message
> Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): 
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22341) [2.3.0] cannot run Spark on Yarn when Yarn impersonation is turned off

2017-10-24 Thread JIRA
Maciej Bryński created SPARK-22341:
--

 Summary: [2.3.0] cannot run Spark on Yarn when Yarn impersonation 
is turned off
 Key: SPARK-22341
 URL: https://issues.apache.org/jira/browse/SPARK-22341
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 2.3.0
Reporter: Maciej Bryński
Priority: Blocker


I'm trying to run 2.3.0 (from master) on my yarn cluster.
The result is:
{code}
Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
Permission denied: user=yarn, access=EXECUTE, 
inode="/user/bi/.sparkStaging/application_1508815646088_0164":bi:hdfs:drwx--
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:208)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:171)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4387)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:855)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at 
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1990)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:219)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$7.apply(ApplicationMaster.scala:216)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:216)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:821)
at 
org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:842)
at 
org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
{code}

I think the problem exists, because I'm not using yarn impersonation which mean 
that all jobs on cluster are runned from user yarn.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22277) Chi Square selector garbling Vector content.

2017-10-24 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216623#comment-16216623
 ] 

Weichen Xu edited comment on SPARK-22277 at 10/24/17 12:45 PM:
---

[~cheburakshu] Your post code has some issue:
{code}
df = spark.createDataFrame([
(7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
(8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
(9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
"clicked"])
{code}
Are these features categorical features ?  
If yes, then you cannot directly training them by DecisionTreeClassifier, 
because they lack of metadata(Binary / Nominal) and will be regarded as 
continuous features. You need use `VectorIndexer` to process the categorical 
features first.
If no, then you cannot use ChiSqSelector to select topFeatures, because 
ChiSqSelector is only available for categorical features.
But your above code compare the two cases, they're contrary, so I am confused.

If these features are categorical, you can process them first by VectorIndexer, 
it will fill metadata for the features first, and then process by 
ChiSqSelector/DecisionTreeClassifier, then I think no error will occur.


was (Author: weichenxu123):
[~cheburakshu] Your post code has some issue:
{code}
df = spark.createDataFrame([
(7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
(8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
(9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
"clicked"])
{code}
Are these features categorical features ?  
If yes, then you cannot directly training them by DecisionTreeClassifier, 
because they lack of metadata(Binary / Nominal) and will be regarded as 
continuous features.
If no, then you cannot use ChiSqSelector to select topFeatures, because 
ChiSqSelector is only available for categorical features.
But your above code compare the two cases, they're contrary, so I am confused.

If these features are categorical, you can process them first by VectorIndexer, 
it will fill metadata for the features first, and then process by 
ChiSqSelector/DecisionTreeClassifier, then I think no error will occur.

> Chi Square selector garbling Vector content.
> 
>
> Key: SPARK-22277
> URL: https://issues.apache.org/jira/browse/SPARK-22277
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Cheburakshu
>
> There is a difference in behavior when Chisquare selector is used v direct 
> feature use in decision tree classifier. 
> In the below code, I have used chisquare selector as a thru' pass but the 
> decision tree classifier is unable to process it. But, it is able to process 
> when the features are used directly.
> The example is pulled out directly from Apache spark python documentation.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import ChiSqSelector
> from pyspark.ml.linalg import Vectors
> import sys
> df = spark.createDataFrame([
> (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
> (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
> (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
> "clicked"])
> # ChiSq selector will just be a pass-through. All four featuresin the i/p 
> will be in output also.
> selector = ChiSqSelector(numTopFeatures=4, featuresCol="features",
>  outputCol="selectedFeatures", labelCol="clicked")
> result = selector.fit(df).transform(df)
> print("ChiSqSelector output with top %d features selected" % 
> selector.getNumTopFeatures())
> from pyspark.ml.classification import DecisionTreeClassifier
> try:
> # Fails
> dt = 
> DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures")
> model = dt.fit(result)
> except:
> print(sys.exc_info())
> #Works
> dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features")
> model = dt.fit(df)
> 
> # Make predictions. Using same dataset, not splitting!!
> predictions = model.transform(result)
> # Select example rows to display.
> predictions.select("prediction", "clicked", "features").show(5)
> # Select (prediction, true label) and compute test error
> evaluator = MulticlassClassificationEvaluator(
> labelCol="clicked", predictionCol="prediction", metricName="accuracy")
> accuracy = evaluator.evaluate(predictions)
> print("Test Error = %g " % (1.0 - accuracy))
> {code}
> Output:
> ChiSqSelector output with top 4 features selected
> (, 
> IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but 
> it does not have the number of values specified.', 
> 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t
>  at 
> 

[jira] [Updated] (SPARK-22118) Should prevent change epoch in success stage while there is some running stage

2017-10-24 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-22118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-22118:
---
Priority: Critical  (was: Major)

> Should prevent change epoch in success stage while there is some running stage
> --
>
> Key: SPARK-22118
> URL: https://issues.apache.org/jira/browse/SPARK-22118
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: SuYan
>Priority: Critical
>
> In 2.1, will change epoch if stage success, and will trigger mapoutTracker to 
> clean cache broadcast
> but when there have other running stage, and want to get mapstatus broadcast 
> value...it will has the possibility that occurs failed to get broadcast? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22118) Should prevent change epoch in success stage while there is some running stage

2017-10-24 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-22118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-22118:
---
Affects Version/s: 2.2.0

> Should prevent change epoch in success stage while there is some running stage
> --
>
> Key: SPARK-22118
> URL: https://issues.apache.org/jira/browse/SPARK-22118
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: SuYan
>Priority: Critical
>
> In 2.1, will change epoch if stage success, and will trigger mapoutTracker to 
> clean cache broadcast
> but when there have other running stage, and want to get mapstatus broadcast 
> value...it will has the possibility that occurs failed to get broadcast? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22118) Should prevent change epoch in success stage while there is some running stage

2017-10-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-22118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216768#comment-16216768
 ] 

Maciej Bryński commented on SPARK-22118:


I think I have such a problem

{code}
2017-10-24 11:23:47,482 DEBUG [org.apache.spark.scheduler.DAGScheduler] - 
submitStage(ShuffleMapStage 52)
2017-10-24 11:23:47,484 DEBUG [org.apache.hadoop.hdfs.DFSClient] - DFSClient 
seqno: 2164 status: SUCCESS status: SUCCESS status: SUCCESS 
downstreamAckTimeNanos: 1761269
2017-10-24 11:23:47,484 DEBUG [org.apache.hadoop.hdfs.DFSClient] - DFSClient 
seqno: 2165 status: SUCCESS status: SUCCESS status: SUCCESS 
downstreamAckTimeNanos: 1631471
2017-10-24 11:23:47,485 DEBUG [org.apache.hadoop.hdfs.DFSClient] - DFSClient 
seqno: 2166 status: SUCCESS status: SUCCESS status: SUCCESS 
downstreamAckTimeNanos: 1608203
2017-10-24 11:23:47,544 INFO [org.apache.spark.storage.memory.MemoryStore] - 
Block broadcast_69 stored as values in memory (estimated size 514.2 KB, free 
897.0 MB)
2017-10-24 11:23:47,544 DEBUG [org.apache.spark.storage.BlockManager] - Put 
block broadcast_69 locally took  0 ms
2017-10-24 11:23:47,544 DEBUG [org.apache.spark.storage.BlockManager] - Putting 
block broadcast_69 without replication took  0 ms
2017-10-24 11:23:47,549 INFO [org.apache.spark.storage.memory.MemoryStore] - 
Block broadcast_69_piece0 stored as bytes in memory (estimated size 514.6 KB, 
free 896.5 MB)
2017-10-24 11:23:47,550 INFO [org.apache.spark.storage.BlockManagerInfo] - 
Added broadcast_69_piece0 in memory on 10.32.32.227:36396 (size: 514.6 KB, 
free: 910.7 MB)
2017-10-24 11:23:47,550 DEBUG [org.apache.spark.storage.BlockManagerMaster] - 
Updated info of block broadcast_69_piece0
2017-10-24 11:23:47,550 DEBUG [org.apache.spark.storage.BlockManager] - Told 
master about block broadcast_69_piece0
2017-10-24 11:23:47,550 DEBUG [org.apache.spark.storage.BlockManager] - Put 
block broadcast_69_piece0 locally took  1 ms
2017-10-24 11:23:47,550 DEBUG [org.apache.spark.storage.BlockManager] - Putting 
block broadcast_69_piece0 without replication took  1 ms
2017-10-24 11:23:47,550 INFO [org.apache.spark.MapOutputTracker] - Broadcast 
mapstatuses size = 430, actual size = 526530
2017-10-24 11:23:47,550 INFO [org.apache.spark.MapOutputTrackerMaster] - Size 
of output statuses for shuffle 14 is 430 bytes
2017-10-24 11:23:47,550 INFO [org.apache.spark.MapOutputTrackerMaster] - Epoch 
changed, not caching!
2017-10-24 11:23:47,550 DEBUG [org.apache.spark.broadcast.TorrentBroadcast] - 
Unpersisting TorrentBroadcast 69
2017-10-24 11:23:47,551 DEBUG 
[org.apache.spark.storage.BlockManagerSlaveEndpoint] - removing broadcast 69
2017-10-24 11:23:47,551 DEBUG [org.apache.spark.storage.BlockManager] - 
Removing broadcast 69
2017-10-24 11:23:47,551 DEBUG [org.apache.spark.storage.BlockManager] - 
Removing block broadcast_69
2017-10-24 11:23:47,551 DEBUG [org.apache.spark.storage.memory.MemoryStore] - 
Block broadcast_69 of size 526576 dropped from memory (free 940623361)
2017-10-24 11:23:47,551 DEBUG [org.apache.spark.storage.BlockManager] - 
Removing block broadcast_69_piece0
2017-10-24 11:23:47,551 DEBUG [org.apache.spark.storage.memory.MemoryStore] - 
Block broadcast_69_piece0 of size 526913 dropped from memory (free 941150274)
2017-10-24 11:23:47,551 DEBUG [org.apache.spark.MapOutputTrackerMaster] - 
cached status not found for : 14
2017-10-24 11:23:47,552 INFO [org.apache.spark.storage.BlockManagerInfo] - 
Removed broadcast_69_piece0 on 10.32.32.227:36396 in memory (size: 514.6 KB, 
free: 911.2 MB)
2017-10-24 11:23:47,554 DEBUG [org.apache.spark.storage.BlockManagerMaster] - 
Updated info of block broadcast_69_piece0
2017-10-24 11:23:47,554 DEBUG [org.apache.spark.storage.BlockManager] - Told 
master about block broadcast_69_piece0
2017-10-24 11:23:47,555 DEBUG 
[org.apache.spark.storage.BlockManagerSlaveEndpoint] - Done removing broadcast 
69, response is 0
2017-10-24 11:23:47,555 DEBUG 
[org.apache.spark.storage.BlockManagerSlaveEndpoint] - Sent response: 0 to 
10.32.32.227:53171
2017-10-24 11:23:47,556 INFO [org.apache.spark.MapOutputTrackerMasterEndpoint] 
- Asked to send map output locations for shuffle 14 to 10.32.32.28:39424
2017-10-24 11:23:47,556 DEBUG [org.apache.spark.MapOutputTrackerMaster] - 
Handling request to send map output locations for shuffle 14 to 
10.32.32.28:39424
2017-10-24 11:23:47,556 DEBUG [org.apache.spark.MapOutputTrackerMaster] - 
cached status not found for : 14
2017-10-24 11:23:47,607 DEBUG 
[org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint] - 
Launching task 31490 on executor id: 2 hostname: dwh-hn28.adpilot.co.
2017-10-24 11:23:47,608 DEBUG [org.apache.spark.ExecutorAllocationManager] - 
Clearing idle timer for 2 because it is now running a task
2017-10-24 11:23:47,633 DEBUG [org.apache.hadoop.ipc.Client] - IPC Client 
(766216029) connection to master/10.32.32.3:8032 from bi sending #1418
2017-10-24 

[jira] [Updated] (SPARK-22323) Design doc for different types of pandas_udf

2017-10-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22323:
--
Fix Version/s: (was: 2.3.0)

> Design doc for different types of pandas_udf
> 
>
> Key: SPARK-22323
> URL: https://issues.apache.org/jira/browse/SPARK-22323
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Li Jin
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21936) backward compatibility test framework for HiveExternalCatalog

2017-10-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216683#comment-16216683
 ] 

Apache Spark commented on SPARK-21936:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/19564

> backward compatibility test framework for HiveExternalCatalog
> -
>
> Key: SPARK-21936
> URL: https://issues.apache.org/jira/browse/SPARK-21936
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.1, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22249) UnsupportedOperationException: empty.reduceLeft when caching a dataframe

2017-10-24 Thread Andreas Maier (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216634#comment-16216634
 ] 

Andreas Maier commented on SPARK-22249:
---

Thank you for solving this issue so quickly. Not every open source project is 
reacting so fast. 

> UnsupportedOperationException: empty.reduceLeft when caching a dataframe
> 
>
> Key: SPARK-22249
> URL: https://issues.apache.org/jira/browse/SPARK-22249
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
> Environment: $ uname -a
> Darwin MAC-UM-024.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 
> 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64
> $ pyspark --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.2.0
>   /_/
> 
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_92
> Branch 
> Compiled by user jenkins on 2017-06-30T22:58:04Z
> Revision 
> Url 
>Reporter: Andreas Maier
>Assignee: Marco Gaido
> Fix For: 2.2.1, 2.3.0
>
>
> It seems that the {{isin()}} method with an empty list as argument only 
> works, if the dataframe is not cached. If it is cached, it results in an 
> exception. To reproduce
> {code:java}
> $ pyspark
> >>> df = spark.createDataFrame([pyspark.Row(KEY="value")])
> >>> df.where(df["KEY"].isin([])).show()
> +---+
> |KEY|
> +---+
> +---+
> >>> df.cache()
> DataFrame[KEY: string]
> >>> df.where(df["KEY"].isin([])).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/dataframe.py",
>  line 336, in show
> print(self._jdf.showString(n, 20))
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
>  line 1133, in __call__
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/utils.py",
>  line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",
>  line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o302.showString.
> : java.lang.UnsupportedOperationException: empty.reduceLeft
>   at 
> scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180)
>   at 
> scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:48)
>   at 
> scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:74)
>   at scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:48)
>   at 
> scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208)
>   at scala.collection.AbstractTraversable.reduce(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:107)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:71)
>   at scala.PartialFunction$Lifted.apply(PartialFunction.scala:223)
>   at scala.PartialFunction$Lifted.apply(PartialFunction.scala:219)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:112)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:111)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.(InMemoryTableScanExec.scala:111)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307)
>   at 
> org.apache.spark.sql.execution.SparkPlanner.pruneFilterProject(SparkPlanner.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$.apply(SparkStrategies.scala:303)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
>   at 
> 

[jira] [Comment Edited] (SPARK-22277) Chi Square selector garbling Vector content.

2017-10-24 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216623#comment-16216623
 ] 

Weichen Xu edited comment on SPARK-22277 at 10/24/17 9:58 AM:
--

[~cheburakshu] Your post code has some issue:
{code}
df = spark.createDataFrame([
(7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
(8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
(9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
"clicked"])
{code}
Are these features categorical features ?  
If yes, then you cannot directly training them by DecisionTreeClassifier, 
because they lack of metadata(Binary / Nominal) and will be regarded as 
continuous features.
If no, then you cannot use ChiSqSelector to select topFeatures, because 
ChiSqSelector is only available for categorical features.
But your above code compare the two cases, they're contrary, so I am confused.

If these features are categorical, you can process them first by VectorIndexer, 
it will fill metadata for the features first, and then process by 
ChiSqSelector/DecisionTreeClassifier, then I think no error will occur.


was (Author: weichenxu123):
[~cheburakshu] Your post code has some issue:
{code}
df = spark.createDataFrame([
(7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
(8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
(9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
"clicked"])
{code}
Are these features categorical features ?  
If yes, then you cannot directly training them by DecisionTreeClassifier, 
because they lack of metadata(Binary / Nominal) and will be regarded as 
continuous features.
If no, then you cannot use ChiSqSelector to select topFeatures, because 
ChiSqSelector is only available for categorical features.
But your above code compare the two cases, they're contrary, so I am confused.

> Chi Square selector garbling Vector content.
> 
>
> Key: SPARK-22277
> URL: https://issues.apache.org/jira/browse/SPARK-22277
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Cheburakshu
>
> There is a difference in behavior when Chisquare selector is used v direct 
> feature use in decision tree classifier. 
> In the below code, I have used chisquare selector as a thru' pass but the 
> decision tree classifier is unable to process it. But, it is able to process 
> when the features are used directly.
> The example is pulled out directly from Apache spark python documentation.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import ChiSqSelector
> from pyspark.ml.linalg import Vectors
> import sys
> df = spark.createDataFrame([
> (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
> (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
> (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
> "clicked"])
> # ChiSq selector will just be a pass-through. All four featuresin the i/p 
> will be in output also.
> selector = ChiSqSelector(numTopFeatures=4, featuresCol="features",
>  outputCol="selectedFeatures", labelCol="clicked")
> result = selector.fit(df).transform(df)
> print("ChiSqSelector output with top %d features selected" % 
> selector.getNumTopFeatures())
> from pyspark.ml.classification import DecisionTreeClassifier
> try:
> # Fails
> dt = 
> DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures")
> model = dt.fit(result)
> except:
> print(sys.exc_info())
> #Works
> dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features")
> model = dt.fit(df)
> 
> # Make predictions. Using same dataset, not splitting!!
> predictions = model.transform(result)
> # Select example rows to display.
> predictions.select("prediction", "clicked", "features").show(5)
> # Select (prediction, true label) and compute test error
> evaluator = MulticlassClassificationEvaluator(
> labelCol="clicked", predictionCol="prediction", metricName="accuracy")
> accuracy = evaluator.evaluate(predictions)
> print("Test Error = %g " % (1.0 - accuracy))
> {code}
> Output:
> ChiSqSelector output with top 4 features selected
> (, 
> IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but 
> it does not have the number of values specified.', 
> 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t
>  at 
> org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> 

[jira] [Commented] (SPARK-22277) Chi Square selector garbling Vector content.

2017-10-24 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216623#comment-16216623
 ] 

Weichen Xu commented on SPARK-22277:


[~cheburakshu] Your post code has some issue:
{code}
df = spark.createDataFrame([
(7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
(8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
(9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
"clicked"])
{code}
Are these features categorical features ?  
If yes, then you cannot directly training them by DecisionTreeClassifier, 
because they lack of metadata(Binary / Nominal) and will be regarded as 
continuous features.
If no, then you cannot use ChiSqSelector to select topFeatures, because 
ChiSqSelector is only available for categorical features.
But your above code compare the two cases, they're contrary, so I am confused.

> Chi Square selector garbling Vector content.
> 
>
> Key: SPARK-22277
> URL: https://issues.apache.org/jira/browse/SPARK-22277
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Cheburakshu
>
> There is a difference in behavior when Chisquare selector is used v direct 
> feature use in decision tree classifier. 
> In the below code, I have used chisquare selector as a thru' pass but the 
> decision tree classifier is unable to process it. But, it is able to process 
> when the features are used directly.
> The example is pulled out directly from Apache spark python documentation.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import ChiSqSelector
> from pyspark.ml.linalg import Vectors
> import sys
> df = spark.createDataFrame([
> (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
> (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
> (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
> "clicked"])
> # ChiSq selector will just be a pass-through. All four featuresin the i/p 
> will be in output also.
> selector = ChiSqSelector(numTopFeatures=4, featuresCol="features",
>  outputCol="selectedFeatures", labelCol="clicked")
> result = selector.fit(df).transform(df)
> print("ChiSqSelector output with top %d features selected" % 
> selector.getNumTopFeatures())
> from pyspark.ml.classification import DecisionTreeClassifier
> try:
> # Fails
> dt = 
> DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures")
> model = dt.fit(result)
> except:
> print(sys.exc_info())
> #Works
> dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features")
> model = dt.fit(df)
> 
> # Make predictions. Using same dataset, not splitting!!
> predictions = model.transform(result)
> # Select example rows to display.
> predictions.select("prediction", "clicked", "features").show(5)
> # Select (prediction, true label) and compute test error
> evaluator = MulticlassClassificationEvaluator(
> labelCol="clicked", predictionCol="prediction", metricName="accuracy")
> accuracy = evaluator.evaluate(predictions)
> print("Test Error = %g " % (1.0 - accuracy))
> {code}
> Output:
> ChiSqSelector output with top 4 features selected
> (, 
> IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but 
> it does not have the number of values specified.', 
> 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t
>  at 
> org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t 
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t 
> at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at 
> org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t
>  at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t
>  at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t
>  at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at 
> org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at 
> sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
>  at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
> 

[jira] [Commented] (SPARK-22284) Code of class \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" grows beyond 64 KB

2017-10-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216611#comment-16216611
 ] 

Apache Spark commented on SPARK-22284:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/19563

> Code of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
>  grows beyond 64 KB
> --
>
> Key: SPARK-22284
> URL: https://issues.apache.org/jira/browse/SPARK-22284
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Ben
> Attachments: 64KB Error.log
>
>
> I am using pySpark 2.1.0 in a production environment, and trying to join two 
> DataFrames, one of which is very large and has complex nested structures.
> Basically, I load both DataFrames and cache them.
> Then, in the large DataFrame, I extract 3 nested values and save them as 
> direct columns.
> Finally, I join on these three columns with the smaller DataFrame.
> This would be a short code for this:
> {code}
> dataFrame.read..cache()
> dataFrameSmall.read...cache()
> dataFrame = dataFrame.selectExpr(['*','nested.Value1 AS 
> Value1','nested.Value2 AS Value2','nested.Value3 AS Value3'])
> dataFrame = dataFrame.dropDuplicates().join(dataFrameSmall, 
> ['Value1','Value2',Value3'])
> dataFrame.count()
> {code}
> And this is the error I get when it gets to the count():
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in 
> stage 7.0 failed 4 times, most recent failure: Lost task 11.3 in stage 7.0 
> (TID 11234, somehost.com, executor 10): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> \"apply_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V\"
>  of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
>  grows beyond 64 KB
> {code}
> I have seen many tickets with similar issues here, but no proper solution. 
> Most of the fixes are until Spark 2.1.0 so I don't know if running it on 
> Spark 2.2.0 would fix it. In any case I cannot change the version of Spark 
> since it is in production.
> I have also tried setting 
> {code:java}
> spark.sql.codegen.wholeStage=false
> {code}
>  but still the same error.
> The job worked well up to now, also with large datasets, but apparently this 
> batch got larger, and that is the only thing that changed. Is there any 
> workaround for this?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >