date:20160608

[jira] [Created] (SPARK-15817) Spark client picking hive 1.2.1 by default which failed to alter a table name

2016-06-08 Thread Nataraj Gorantla (JIRA)

Nataraj Gorantla created SPARK-15817:


 Summary: Spark client picking hive 1.2.1 by default which failed 
to alter a table name
 Key: SPARK-15817
 URL: https://issues.apache.org/jira/browse/SPARK-15817
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.6.1
Reporter: Nataraj Gorantla


Some of our scala scripts are failing with below error. 

FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. Invalid
method name: 'alter_table_with_cascade'
msg: org.apache.spark.sql.execution.QueryExecutionException: FAILED:

Spark when invoked is trying to initiate Hive 1.2.1 by default. We have Hive 
0.14 installed. Some backgroud investigation from our side explained this. 

Analysis
"alter_table_with_cascade" error occurs because of metastore version mismatch 
of Spark. 
To correct this error set proper version of metastore in Spark config.

I tried to add a couple of parameters to spark-default-conf file. 


spark.sql.hive.metastore.version 0.14.0
#spark.sql.hive.metastore.jars maven
spark.sql.hive.metastore.jars =/usr/hdp/current/hive-client/lib

Still I see issues. Can you please let me know if you have any alternative to 
fix this issue. 

Thanks,
Nataraj G




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15804) Manually added metadata not saving with parquet

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15804:


Assignee: Apache Spark

> Manually added metadata not saving with parquet
> ---
>
> Key: SPARK-15804
> URL: https://issues.apache.org/jira/browse/SPARK-15804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Charlie Evans
>Assignee: Apache Spark
>
> Adding metadata with col().as(_, metadata) then saving the resultant 
> dataframe does not save the metadata. No error is thrown. Only see the schema 
> contains the metadata before saving and does not contain the metadata after 
> saving and loading the dataframe. Was working fine with 1.6.1.
> {code}
> case class TestRow(a: String, b: Int)
> val rows = TestRow("a", 0) :: TestRow("b", 1) :: TestRow("c", 2) :: Nil
> val df = spark.createDataFrame(rows)
> import org.apache.spark.sql.types.MetadataBuilder
> val md = new MetadataBuilder().putString("key", "value").build()
> val dfWithMeta = df.select(col("a"), col("b").as("b", md))
> println(dfWithMeta.schema.json)
> dfWithMeta.write.parquet("dfWithMeta")
> val dfWithMeta2 = spark.read.parquet("dfWithMeta")
> println(dfWithMeta2.schema.json)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15804) Manually added metadata not saving with parquet

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15804:


Assignee: (was: Apache Spark)

> Manually added metadata not saving with parquet
> ---
>
> Key: SPARK-15804
> URL: https://issues.apache.org/jira/browse/SPARK-15804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Charlie Evans
>
> Adding metadata with col().as(_, metadata) then saving the resultant 
> dataframe does not save the metadata. No error is thrown. Only see the schema 
> contains the metadata before saving and does not contain the metadata after 
> saving and loading the dataframe. Was working fine with 1.6.1.
> {code}
> case class TestRow(a: String, b: Int)
> val rows = TestRow("a", 0) :: TestRow("b", 1) :: TestRow("c", 2) :: Nil
> val df = spark.createDataFrame(rows)
> import org.apache.spark.sql.types.MetadataBuilder
> val md = new MetadataBuilder().putString("key", "value").build()
> val dfWithMeta = df.select(col("a"), col("b").as("b", md))
> println(dfWithMeta.schema.json)
> dfWithMeta.write.parquet("dfWithMeta")
> val dfWithMeta2 = spark.read.parquet("dfWithMeta")
> println(dfWithMeta2.schema.json)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15804) Manually added metadata not saving with parquet

2016-06-08 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320180#comment-15320180
 ] 

kevin yu commented on SPARK-15804:
--

https://github.com/apache/spark/pull/13555

> Manually added metadata not saving with parquet
> ---
>
> Key: SPARK-15804
> URL: https://issues.apache.org/jira/browse/SPARK-15804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Charlie Evans
>
> Adding metadata with col().as(_, metadata) then saving the resultant 
> dataframe does not save the metadata. No error is thrown. Only see the schema 
> contains the metadata before saving and does not contain the metadata after 
> saving and loading the dataframe. Was working fine with 1.6.1.
> {code}
> case class TestRow(a: String, b: Int)
> val rows = TestRow("a", 0) :: TestRow("b", 1) :: TestRow("c", 2) :: Nil
> val df = spark.createDataFrame(rows)
> import org.apache.spark.sql.types.MetadataBuilder
> val md = new MetadataBuilder().putString("key", "value").build()
> val dfWithMeta = df.select(col("a"), col("b").as("b", md))
> println(dfWithMeta.schema.json)
> dfWithMeta.write.parquet("dfWithMeta")
> val dfWithMeta2 = spark.read.parquet("dfWithMeta")
> println(dfWithMeta2.schema.json)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15804) Manually added metadata not saving with parquet

2016-06-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320182#comment-15320182
 ] 

Apache Spark commented on SPARK-15804:
--

User 'kevinyu98' has created a pull request for this issue:
https://github.com/apache/spark/pull/13555

> Manually added metadata not saving with parquet
> ---
>
> Key: SPARK-15804
> URL: https://issues.apache.org/jira/browse/SPARK-15804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Charlie Evans
>
> Adding metadata with col().as(_, metadata) then saving the resultant 
> dataframe does not save the metadata. No error is thrown. Only see the schema 
> contains the metadata before saving and does not contain the metadata after 
> saving and loading the dataframe. Was working fine with 1.6.1.
> {code}
> case class TestRow(a: String, b: Int)
> val rows = TestRow("a", 0) :: TestRow("b", 1) :: TestRow("c", 2) :: Nil
> val df = spark.createDataFrame(rows)
> import org.apache.spark.sql.types.MetadataBuilder
> val md = new MetadataBuilder().putString("key", "value").build()
> val dfWithMeta = df.select(col("a"), col("b").as("b", md))
> println(dfWithMeta.schema.json)
> dfWithMeta.write.parquet("dfWithMeta")
> val dfWithMeta2 = spark.read.parquet("dfWithMeta")
> println(dfWithMeta2.schema.json)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15813) Spark Dyn Allocation Cancel log message misleading

2016-06-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320223#comment-15320223
 ] 

Sean Owen commented on SPARK-15813:
---

This is too small to make a JIRA for

> Spark Dyn Allocation Cancel log message misleading
> --
>
> Key: SPARK-15813
> URL: https://issues.apache.org/jira/browse/SPARK-15813
> Project: Spark
>  Issue Type: Bug
>Reporter: Peter Ableda
>Priority: Trivial
>
> *Driver requested* message is logged before the *Canceling* message but has 
> the updated executor number. The messages are misleading.
> See log snippet:
> {code}
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
> 619 executor(s).
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor 
> containers
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage 
> 0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage 
> 0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage 
> 0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage 
> 0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage 
> 0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage 
> 0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage 
> 0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage 
> 0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage 
> 0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage 
> 0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
> 614 executor(s).
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor 
> containers
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage 
> 0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 
> bytes)
> {code}
> The easy solution is to update the message to use past tense. This is 
> consistent with the other messages there.
> *Canceled requests for 5 executor container(s).*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15817) Spark client picking hive 1.2.1 by default which failed to alter a table name

2016-06-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320225#comment-15320225
 ] 

Sean Owen commented on SPARK-15817:
---

Hm, what version did you build with? 

> Spark client picking hive 1.2.1 by default which failed to alter a table name
> -
>
> Key: SPARK-15817
> URL: https://issues.apache.org/jira/browse/SPARK-15817
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.1
>Reporter: Nataraj Gorantla
>
> Some of our scala scripts are failing with below error. 
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. Invalid
> method name: 'alter_table_with_cascade'
> msg: org.apache.spark.sql.execution.QueryExecutionException: FAILED:
> Spark when invoked is trying to initiate Hive 1.2.1 by default. We have Hive 
> 0.14 installed. Some backgroud investigation from our side explained this. 
> Analysis
> "alter_table_with_cascade" error occurs because of metastore version mismatch 
> of Spark. 
> To correct this error set proper version of metastore in Spark config.
> I tried to add a couple of parameters to spark-default-conf file. 
> spark.sql.hive.metastore.version 0.14.0
> #spark.sql.hive.metastore.jars maven
> spark.sql.hive.metastore.jars =/usr/hdp/current/hive-client/lib
> Still I see issues. Can you please let me know if you have any alternative to 
> fix this issue. 
> Thanks,
> Nataraj G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15811) UDFs do not work in Spark 2.0-preview built with scala 2.10

2016-06-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15811:
--
Target Version/s:   (was: 2.0.0)
Priority: Critical  (was: Blocker)
   Fix Version/s: (was: 2.0.0)

> UDFs do not work in Spark 2.0-preview built with scala 2.10
> ---
>
> Key: SPARK-15811
> URL: https://issues.apache.org/jira/browse/SPARK-15811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Franklyn Dsouza
>Priority: Critical
>
> I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following
> {code}
> ./dev/change-version-to-2.10.sh
> ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
> -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
> {code}
> and then ran the following code in a pyspark shell
> {code}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import IntegerType, StructField, StructType
> from pyspark.sql.functions import udf
> from pyspark.sql.types import Row
> spark = SparkSession.builder.master('local[4]').appName('2.0 
> DF').getOrCreate()
> add_one = udf(lambda x: x + 1, IntegerType())
> schema = StructType([StructField('a', IntegerType(), False)])
> df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
> df.select(add_one(df.a).alias('incremented')).collect()
> {code}
> This never returns with a result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15793) Word2vec in ML package should have maxSentenceLength method

2016-06-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15793:
--
Assignee: Xusen Yin

> Word2vec in ML package should have maxSentenceLength method
> ---
>
> Key: SPARK-15793
> URL: https://issues.apache.org/jira/browse/SPARK-15793
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>Priority: Minor
> Fix For: 2.0.0
>
>
> Word2vec in ML package should have maxSentenceLength method for feature 
> parity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15793) Word2vec in ML package should have maxSentenceLength method

2016-06-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15793.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13536
[https://github.com/apache/spark/pull/13536]

> Word2vec in ML package should have maxSentenceLength method
> ---
>
> Key: SPARK-15793
> URL: https://issues.apache.org/jira/browse/SPARK-15793
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Xusen Yin
>Priority: Minor
> Fix For: 2.0.0
>
>
> Word2vec in ML package should have maxSentenceLength method for feature 
> parity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15818) Upgrade to Hadoop 2.7.2

2016-06-08 Thread Adam Roberts (JIRA)

Adam Roberts created SPARK-15818:


 Summary: Upgrade to Hadoop 2.7.2
 Key: SPARK-15818
 URL: https://issues.apache.org/jira/browse/SPARK-15818
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Build
Affects Versions: 2.0.0
Reporter: Adam Roberts
Priority: Minor


I'd like us to use Hadoop 2.7.2 owing to the Hadoop release notes stating 
Hadoop 2.7.0 is not ready for production use

https://hadoop.apache.org/docs/r2.7.0/ states

"Apache Hadoop 2.7.0 is a minor release in the 2.x.y release line, building 
upon the previous stable release 2.6.0.

This release is not yet ready for production use. Production users should use 
2.7.1 release and beyond."

Hadoop 2.7.1 release notes:

"Apache Hadoop 2.7.1 is a minor release in the 2.x.y release line, building 
upon the previous release 2.7.0. This is the next stable release after Apache 
Hadoop 2.6.x."

And then Hadoop 2.7.2 release notes:

"Apache Hadoop 2.7.2 is a minor release in the 2.x.y release line, building 
upon the previous stable release 2.7.1."

I've tested this is OK with Intel hardware and IBM Java 8 so let's test it with 
OpenJDK, ideally this will be pushed to branch-2.0 and master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15818) Upgrade to Hadoop 2.7.2

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15818:


Assignee: (was: Apache Spark)

> Upgrade to Hadoop 2.7.2
> ---
>
> Key: SPARK-15818
> URL: https://issues.apache.org/jira/browse/SPARK-15818
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Adam Roberts
>Priority: Minor
>
> I'd like us to use Hadoop 2.7.2 owing to the Hadoop release notes stating 
> Hadoop 2.7.0 is not ready for production use
> https://hadoop.apache.org/docs/r2.7.0/ states
> "Apache Hadoop 2.7.0 is a minor release in the 2.x.y release line, building 
> upon the previous stable release 2.6.0.
> This release is not yet ready for production use. Production users should use 
> 2.7.1 release and beyond."
> Hadoop 2.7.1 release notes:
> "Apache Hadoop 2.7.1 is a minor release in the 2.x.y release line, building 
> upon the previous release 2.7.0. This is the next stable release after Apache 
> Hadoop 2.6.x."
> And then Hadoop 2.7.2 release notes:
> "Apache Hadoop 2.7.2 is a minor release in the 2.x.y release line, building 
> upon the previous stable release 2.7.1."
> I've tested this is OK with Intel hardware and IBM Java 8 so let's test it 
> with OpenJDK, ideally this will be pushed to branch-2.0 and master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15818) Upgrade to Hadoop 2.7.2

2016-06-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320253#comment-15320253
 ] 

Apache Spark commented on SPARK-15818:
--

User 'a-roberts' has created a pull request for this issue:
https://github.com/apache/spark/pull/13556

> Upgrade to Hadoop 2.7.2
> ---
>
> Key: SPARK-15818
> URL: https://issues.apache.org/jira/browse/SPARK-15818
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Adam Roberts
>Priority: Minor
>
> I'd like us to use Hadoop 2.7.2 owing to the Hadoop release notes stating 
> Hadoop 2.7.0 is not ready for production use
> https://hadoop.apache.org/docs/r2.7.0/ states
> "Apache Hadoop 2.7.0 is a minor release in the 2.x.y release line, building 
> upon the previous stable release 2.6.0.
> This release is not yet ready for production use. Production users should use 
> 2.7.1 release and beyond."
> Hadoop 2.7.1 release notes:
> "Apache Hadoop 2.7.1 is a minor release in the 2.x.y release line, building 
> upon the previous release 2.7.0. This is the next stable release after Apache 
> Hadoop 2.6.x."
> And then Hadoop 2.7.2 release notes:
> "Apache Hadoop 2.7.2 is a minor release in the 2.x.y release line, building 
> upon the previous stable release 2.7.1."
> I've tested this is OK with Intel hardware and IBM Java 8 so let's test it 
> with OpenJDK, ideally this will be pushed to branch-2.0 and master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15818) Upgrade to Hadoop 2.7.2

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15818:


Assignee: Apache Spark

> Upgrade to Hadoop 2.7.2
> ---
>
> Key: SPARK-15818
> URL: https://issues.apache.org/jira/browse/SPARK-15818
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Adam Roberts
>Assignee: Apache Spark
>Priority: Minor
>
> I'd like us to use Hadoop 2.7.2 owing to the Hadoop release notes stating 
> Hadoop 2.7.0 is not ready for production use
> https://hadoop.apache.org/docs/r2.7.0/ states
> "Apache Hadoop 2.7.0 is a minor release in the 2.x.y release line, building 
> upon the previous stable release 2.6.0.
> This release is not yet ready for production use. Production users should use 
> 2.7.1 release and beyond."
> Hadoop 2.7.1 release notes:
> "Apache Hadoop 2.7.1 is a minor release in the 2.x.y release line, building 
> upon the previous release 2.7.0. This is the next stable release after Apache 
> Hadoop 2.6.x."
> And then Hadoop 2.7.2 release notes:
> "Apache Hadoop 2.7.2 is a minor release in the 2.x.y release line, building 
> upon the previous stable release 2.7.1."
> I've tested this is OK with Intel hardware and IBM Java 8 so let's test it 
> with OpenJDK, ideally this will be pushed to branch-2.0 and master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15819) Add KMeanSummary in KMeans of PySpark

2016-06-08 Thread Jeff Zhang (JIRA)

Jeff Zhang created SPARK-15819:
--

 Summary: Add KMeanSummary in KMeans of PySpark
 Key: SPARK-15819
 URL: https://issues.apache.org/jira/browse/SPARK-15819
 Project: Spark
  Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: Jeff Zhang


There's no corresponding python api for KMeansSummary, it would be nice to have 
it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15819) Add KMeanSummary in KMeans of PySpark

2016-06-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320263#comment-15320263
 ] 

Apache Spark commented on SPARK-15819:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/13557

> Add KMeanSummary in KMeans of PySpark
> -
>
> Key: SPARK-15819
> URL: https://issues.apache.org/jira/browse/SPARK-15819
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>
> There's no corresponding python api for KMeansSummary, it would be nice to 
> have it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15819) Add KMeanSummary in KMeans of PySpark

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15819:


Assignee: Apache Spark

> Add KMeanSummary in KMeans of PySpark
> -
>
> Key: SPARK-15819
> URL: https://issues.apache.org/jira/browse/SPARK-15819
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>
> There's no corresponding python api for KMeansSummary, it would be nice to 
> have it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15819) Add KMeanSummary in KMeans of PySpark

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15819:


Assignee: (was: Apache Spark)

> Add KMeanSummary in KMeans of PySpark
> -
>
> Key: SPARK-15819
> URL: https://issues.apache.org/jira/browse/SPARK-15819
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>
> There's no corresponding python api for KMeansSummary, it would be nice to 
> have it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15817) Spark client picking hive 1.2.1 by default which failed to alter a table name

2016-06-08 Thread Nataraj Gorantla (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320265#comment-15320265
 ] 

Nataraj Gorantla commented on SPARK-15817:
--

Sean, I didn't build this. I Just downloaded Spark Client and extracted. 
Not aware of Build Version. 

Can you please let me know what other details you require. 

> Spark client picking hive 1.2.1 by default which failed to alter a table name
> -
>
> Key: SPARK-15817
> URL: https://issues.apache.org/jira/browse/SPARK-15817
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.1
>Reporter: Nataraj Gorantla
>
> Some of our scala scripts are failing with below error. 
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. Invalid
> method name: 'alter_table_with_cascade'
> msg: org.apache.spark.sql.execution.QueryExecutionException: FAILED:
> Spark when invoked is trying to initiate Hive 1.2.1 by default. We have Hive 
> 0.14 installed. Some backgroud investigation from our side explained this. 
> Analysis
> "alter_table_with_cascade" error occurs because of metastore version mismatch 
> of Spark. 
> To correct this error set proper version of metastore in Spark config.
> I tried to add a couple of parameters to spark-default-conf file. 
> spark.sql.hive.metastore.version 0.14.0
> #spark.sql.hive.metastore.jars maven
> spark.sql.hive.metastore.jars =/usr/hdp/current/hive-client/lib
> Still I see issues. Can you please let me know if you have any alternative to 
> fix this issue. 
> Thanks,
> Nataraj G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15820) Add spark-SQL Catalog.refreshTable into python api

2016-06-08 Thread Weichen Xu (JIRA)

Weichen Xu created SPARK-15820:
--

 Summary: Add spark-SQL Catalog.refreshTable into python api
 Key: SPARK-15820
 URL: https://issues.apache.org/jira/browse/SPARK-15820
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Reporter: Weichen Xu


The Catalog.refreshTable API is missing in python interface for Spark-SQL, add 
it.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15820) Add spark-SQL Catalog.refreshTable into python api

2016-06-08 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-15820:
---
Description: 
The Catalog.refreshTable API is missing in python interface for Spark-SQL, add 
it.

see also: https://issues.apache.org/jira/browse/SPARK-15367

  was:
The Catalog.refreshTable API is missing in python interface for Spark-SQL, add 
it.




> Add spark-SQL Catalog.refreshTable into python api
> --
>
> Key: SPARK-15820
> URL: https://issues.apache.org/jira/browse/SPARK-15820
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The Catalog.refreshTable API is missing in python interface for Spark-SQL, 
> add it.
> see also: https://issues.apache.org/jira/browse/SPARK-15367



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15820) Add spark-SQL Catalog.refreshTable into python api

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15820:


Assignee: Apache Spark

> Add spark-SQL Catalog.refreshTable into python api
> --
>
> Key: SPARK-15820
> URL: https://issues.apache.org/jira/browse/SPARK-15820
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Weichen Xu
>Assignee: Apache Spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The Catalog.refreshTable API is missing in python interface for Spark-SQL, 
> add it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15820) Add spark-SQL Catalog.refreshTable into python api

2016-06-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320304#comment-15320304
 ] 

Apache Spark commented on SPARK-15820:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/13558

> Add spark-SQL Catalog.refreshTable into python api
> --
>
> Key: SPARK-15820
> URL: https://issues.apache.org/jira/browse/SPARK-15820
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The Catalog.refreshTable API is missing in python interface for Spark-SQL, 
> add it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15820) Add spark-SQL Catalog.refreshTable into python api

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15820:


Assignee: (was: Apache Spark)

> Add spark-SQL Catalog.refreshTable into python api
> --
>
> Key: SPARK-15820
> URL: https://issues.apache.org/jira/browse/SPARK-15820
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The Catalog.refreshTable API is missing in python interface for Spark-SQL, 
> add it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15820) Add spark-SQL Catalog.refreshTable into python api

2016-06-08 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-15820:
---
External issue ID:   (was: SPARK-15367)

> Add spark-SQL Catalog.refreshTable into python api
> --
>
> Key: SPARK-15820
> URL: https://issues.apache.org/jira/browse/SPARK-15820
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The Catalog.refreshTable API is missing in python interface for Spark-SQL, 
> add it.
> see also: https://issues.apache.org/jira/browse/SPARK-15367



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15820) Add Catalog.refreshTable into python API

2016-06-08 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-15820:
---
Summary: Add Catalog.refreshTable into python API  (was: Add spark-SQL 
Catalog.refreshTable into python api)

> Add Catalog.refreshTable into python API
> 
>
> Key: SPARK-15820
> URL: https://issues.apache.org/jira/browse/SPARK-15820
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The Catalog.refreshTable API is missing in python interface for Spark-SQL, 
> add it.
> see also: https://issues.apache.org/jira/browse/SPARK-15367



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15821) Should we use mvn -T for multithreaded Spark builds?

2016-06-08 Thread Adam Roberts (JIRA)

Adam Roberts created SPARK-15821:


 Summary: Should we use mvn -T for multithreaded Spark builds?
 Key: SPARK-15821
 URL: https://issues.apache.org/jira/browse/SPARK-15821
 Project: Spark
  Issue Type: Question
  Components: Build
Reporter: Adam Roberts
Priority: Minor


With Maven we can build Spark in a multithreaded way and benefit from increased 
build time performance as a result.

On a machine with eight cores, I noticed the build time reduced from 20-25 
minutes to five minutes; this is by building with

mvn -T 1C -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -DskipTests clean 
package

-T 1C says that we'll use one extra thread for each core available, I've never 
experienced a problem with using this option (ranging from a single cored box 
to one with 192 cores available)

Should we use this for building Spark quicker or is the Jenkins job 
deliberately set up such that each "executor" is needed for each pull request 
and we wouldn't see an improvement anyway? 

This can be discovered by checking core utilization across the farm and can 
potentially reduce our build times.

Here's more information on the feature: 
https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3

If this isn't suitable for the current farm then I think we should document it 
for those building Spark from source



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15086) Update Java API once the Scala one is finalized

2016-06-08 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320351#comment-15320351
 ] 

Weichen Xu commented on SPARK-15086:


I think the Java API should be the same to scala API if possible, so here the 
JavaSparkContext.longAccumulator method should return LongAccumulator object, 
not Accumulator[long] which is deprecate, other methods can do similar 
modification. If such modification is OK, I can do it. Thanks!

> Update Java API once the Scala one is finalized
> ---
>
> Key: SPARK-15086
> URL: https://issues.apache.org/jira/browse/SPARK-15086
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Priority: Blocker
>
> We should make sure we update the Java API once the Scala one is finalized. 
> This includes adding the equivalent API in Java as well as deprecating the 
> old ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14146) Imported implicits can't be found in Spark REPL in some cases

2016-06-08 Thread Prashant Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320357#comment-15320357
 ] 

Prashant Sharma commented on SPARK-14146:
-

So I tried that option, looks like it does not help either. 
here is the branch. 
https://github.com/ScrapCodes/spark/tree/SPARK-14146/import-fix

Unexplored ideas that can may be fix this issue.
-Yunused-imports is one unexplored territory.
I am not sure, if replacing semicolon in the input with \n can work. Because 
sometimes input can be an XML(or even scala-xml) literal. 


And I will think of more options.

Thanks,

> Imported implicits can't be found in Spark REPL in some cases
> -
>
> Key: SPARK-14146
> URL: https://issues.apache.org/jira/browse/SPARK-14146
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>
> {code}
> class I(i: Int) {
>   def double: Int = i * 2
> }
> class Context {
>   implicit def toI(i: Int): I = new I(i)
> }
> val c = new Context
> import c._
> // OK
> 1.double
> // Fail
> class A; 1.double
> {code}
> The above code snippets can work in Scala REPL however.
> This will affect our Dataset functionality, for example:
> {code}
> class A; Seq(1 -> "a").toDS() // fail
> {code}
> or in paste mode:
> {code}
> :paste
> class A
> Seq(1 -> "a").toDS() // fail
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String with spark.memory.offHeap.enabled=true

2016-06-08 Thread Pete Robbins (JIRA)

Pete Robbins created SPARK-15822:


 Summary: segmentation violation in o.a.s.unsafe.types.UTF8String 
with spark.memory.offHeap.enabled=true
 Key: SPARK-15822
 URL: https://issues.apache.org/jira/browse/SPARK-15822
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
 Environment: linux amd64

openjdk version "1.8.0_91"
OpenJDK Runtime Environment (build 1.8.0_91-b14)
OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)

Reporter: Pete Robbins
Priority: Critical


Executors fail with segmentation violation while running application with
spark.memory.offHeap.enabled true
spark.memory.offHeap.size 512m

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400
#
# JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
# Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# J 4816 C2 
org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I
 (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d]

We initially saw this on IBM java on PowerPC box but is recreatable on linux 
with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the 
same code point:

16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48)
java.lang.NullPointerException
at 
org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831)
at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664)
at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
at 
org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
at 
org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:785)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String with spark.memory.offHeap.enabled=true

2016-06-08 Thread Pete Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320380#comment-15320380
 ] 

Pete Robbins commented on SPARK-15822:
--

I'm investigating this and will attach the app and config later

> segmentation violation in o.a.s.unsafe.types.UTF8String with 
> spark.memory.offHeap.enabled=true
> --
>
> Key: SPARK-15822
> URL: https://issues.apache.org/jira/browse/SPARK-15822
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: linux amd64
> openjdk version "1.8.0_91"
> OpenJDK Runtime Environment (build 1.8.0_91-b14)
> OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
>Reporter: Pete Robbins
>Priority: Critical
>
> Executors fail with segmentation violation while running application with
> spark.memory.offHeap.enabled true
> spark.memory.offHeap.size 512m
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400
> #
> # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # J 4816 C2 
> org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I
>  (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d]
> We initially saw this on IBM java on PowerPC box but is recreatable on linux 
> with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the 
> same code point:
> 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48)
> java.lang.NullPointerException
>   at 
> org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831)
>   at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
>   at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664)
>   at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.lang.Thread.run(Thread.java:785)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15821) Should we use mvn -T for multithreaded Spark builds?

2016-06-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320385#comment-15320385
 ] 

Sean Owen commented on SPARK-15821:
---

For building (i.e. compiling and packaging) -- sure. For testing, not sure, 
since I would bet some tests somewhere actually conflict (i.e. holding the same 
lock on a local DB or something), but, wouldn't hurt to try.

> Should we use mvn -T for multithreaded Spark builds?
> 
>
> Key: SPARK-15821
> URL: https://issues.apache.org/jira/browse/SPARK-15821
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Reporter: Adam Roberts
>Priority: Minor
>
> With Maven we can build Spark in a multithreaded way and benefit from 
> increased build time performance as a result.
> On a machine with eight cores, I noticed the build time reduced from 20-25 
> minutes to five minutes; this is by building with
> mvn -T 1C -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -DskipTests clean 
> package
> -T 1C says that we'll use one extra thread for each core available, I've 
> never experienced a problem with using this option (ranging from a single 
> cored box to one with 192 cores available)
> Should we use this for building Spark quicker or is the Jenkins job 
> deliberately set up such that each "executor" is needed for each pull request 
> and we wouldn't see an improvement anyway? 
> This can be discovered by checking core utilization across the farm and can 
> potentially reduce our build times.
> Here's more information on the feature: 
> https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3
> If this isn't suitable for the current farm then I think we should document 
> it for those building Spark from source



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15086) Update Java API once the Scala one is finalized

2016-06-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320386#comment-15320386
 ] 

Sean Owen commented on SPARK-15086:
---

Yes, that's clear, but this doesn't address the 3 questions above, which are 
the stickier questions about what to do here.

> Update Java API once the Scala one is finalized
> ---
>
> Key: SPARK-15086
> URL: https://issues.apache.org/jira/browse/SPARK-15086
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Priority: Blocker
>
> We should make sure we update the Java API once the Scala one is finalized. 
> This includes adding the equivalent API in Java as well as deprecating the 
> old ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15823) Add @property for 'property' in MulticlassMetrics

2016-06-08 Thread zhengruifeng (JIRA)

zhengruifeng created SPARK-15823:


 Summary: Add @property for 'property' in MulticlassMetrics
 Key: SPARK-15823
 URL: https://issues.apache.org/jira/browse/SPARK-15823
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: zhengruifeng
Priority: Minor


'accuracy' should be decorated with `@property` to keep step with other methods 
in `pyspark.MulticlassMetrics`, like `weightedPrecision`, `weightedRecall`, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15823) Add @property for 'property' in MulticlassMetrics

2016-06-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320392#comment-15320392
 ] 

Sean Owen commented on SPARK-15823:
---

OK. Can you review these changes for similar issues in one pass?

> Add @property for 'property' in MulticlassMetrics
> -
>
> Key: SPARK-15823
> URL: https://issues.apache.org/jira/browse/SPARK-15823
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Priority: Minor
>
> 'accuracy' should be decorated with `@property` to keep step with other 
> methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, 
> `weightedRecall`, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15823) Add @property for 'property' in MulticlassMetrics

2016-06-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320393#comment-15320393
 ] 

Apache Spark commented on SPARK-15823:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/13560

> Add @property for 'property' in MulticlassMetrics
> -
>
> Key: SPARK-15823
> URL: https://issues.apache.org/jira/browse/SPARK-15823
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Priority: Minor
>
> 'accuracy' should be decorated with `@property` to keep step with other 
> methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, 
> `weightedRecall`, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15823) Add @property for 'property' in MulticlassMetrics

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15823:


Assignee: Apache Spark

> Add @property for 'property' in MulticlassMetrics
> -
>
> Key: SPARK-15823
> URL: https://issues.apache.org/jira/browse/SPARK-15823
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> 'accuracy' should be decorated with `@property` to keep step with other 
> methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, 
> `weightedRecall`, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15823) Add @property for 'property' in MulticlassMetrics

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15823:


Assignee: (was: Apache Spark)

> Add @property for 'property' in MulticlassMetrics
> -
>
> Key: SPARK-15823
> URL: https://issues.apache.org/jira/browse/SPARK-15823
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Priority: Minor
>
> 'accuracy' should be decorated with `@property` to keep step with other 
> methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, 
> `weightedRecall`, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15824) Run 'with ... insert ... select' failed when use spark thriftserver

2016-06-08 Thread Weizhong (JIRA)

Weizhong created SPARK-15824:


 Summary: Run 'with ... insert ... select' failed when use spark 
thriftserver
 Key: SPARK-15824
 URL: https://issues.apache.org/jira/browse/SPARK-15824
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Weizhong
Priority: Minor


{code:sql}
create table src(k int, v int);
create table src_parquet(k int, v int);
with v as (select 1, 2) insert into table src_parquet from src;
{code}
Will throw exception: spark.sql.execution.id is already set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15816) SQL server based on Postgres protocol

2016-06-08 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320416#comment-15320416
 ] 

Takeshi Yamamuro commented on SPARK-15816:
--

Oh, this is a good start point for that.

About Q1:
We can see the specification of the postgresql frontend/backend wire protocol 
(what's called a `v3 protocol`) here;
https://www.postgresql.org/docs/9.5/static/protocol.html

About Q2:
If we could limit the scope of the implementation, it seems there a feasibility 
to do that.
Actually, `prestogres`, that is a stand-alone server of a gateway for Presto, 
has the same approach with this.
It has some workarounds tough, it has succeeded to implement the gateway via 
the v3 protocol.
https://github.com/treasure-data/prestogres

About Q4:
Since the `postgresql-jdbc` driver implicitly queries system catalogs in 
postgresql to answer system commands or something,
we need to handle these queries to work well. It seems this is a kind of hacky 
points to implement.


I looked for other related implementations and I found H2 databases;
it seems we can use ` the `postgresql-jdbc` driver  to connect them, so this is 
one of references for this discussion.
http://www.h2database.com/html/advanced.html#odbc_driver


> SQL server based on Postgres protocol
> -
>
> Key: SPARK-15816
> URL: https://issues.apache.org/jira/browse/SPARK-15816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> At Spark Summit today this idea came up from a discussion: it would be great 
> to investigate the possibility of implementing a new SQL server using 
> Postgres' protocol, in lieu of Hive ThriftServer 2. I'm creating this ticket 
> to track this idea, in case others have feedback.
> This server can have a simpler architecture, and allows users to leverage a 
> wide range of tools that are already available for Postgres (and many 
> commercial database systems based on Postgres).
> Some of the problems we'd need to figure out are:
> 1. What is the Postgres protocol? Is there an official documentation for it?
> 2. How difficult would it be to implement that protocol in Spark (JVM in 
> particular).
> 3. How does data type mapping work?
> 4. How does system commands work? Would Spark need to support all of 
> Postgres' commands?
> 5. Any restrictions in supporting nested data?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15086) Update Java API once the Scala one is finalized

2016-06-08 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320428#comment-15320428
 ] 

Weichen Xu commented on SPARK-15086:


So, if considering java API compatibility with old version, the thing became 
difficult.
Or if we can create a new class such as JavaSparkContextV2 and using new API ? 
so we can design API to make each API the same with scala one.

> Update Java API once the Scala one is finalized
> ---
>
> Key: SPARK-15086
> URL: https://issues.apache.org/jira/browse/SPARK-15086
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Priority: Blocker
>
> We should make sure we update the Java API once the Scala one is finalized. 
> This includes adding the equivalent API in Java as well as deprecating the 
> old ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13566) Deadlock between MemoryStore and BlockManager

2016-06-08 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-13566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320431#comment-15320431
 ] 

Josef Lindman Hörnlund commented on SPARK-13566:


Do we know if this affects 1.5 as well? 

> Deadlock between MemoryStore and BlockManager
> -
>
> Key: SPARK-13566
> URL: https://issues.apache.org/jira/browse/SPARK-13566
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.0
> Environment: Spark 1.6.0 hadoop2.2.0 jdk1.8.0_65 centOs 6.2
>Reporter: cen yuhai
>Assignee: cen yuhai
> Fix For: 1.6.2
>
>
> ===
> "block-manager-slave-async-thread-pool-1":
> at org.apache.spark.storage.MemoryStore.remove(MemoryStore.scala:216)
> - waiting to lock <0x0005895b09b0> (a 
> org.apache.spark.memory.UnifiedMemoryManager)
> at 
> org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:1114)
> - locked <0x00058ed6aae0> (a org.apache.spark.storage.BlockInfo)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$removeBroadcast$2.apply(BlockManager.scala:1101)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$removeBroadcast$2.apply(BlockManager.scala:1101)
> at scala.collection.immutable.Set$Set2.foreach(Set.scala:94)
> at 
> org.apache.spark.storage.BlockManager.removeBroadcast(BlockManager.scala:1101)
> at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply$mcI$sp(BlockManagerSlaveEndpoint.scala:65)
> at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:65)
> at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:65)
> at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$1.apply(BlockManagerSlaveEndpoint.scala:84)
> at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
> at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> "Executor task launch worker-10":
> at 
> org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1032)
> - waiting to lock <0x00059a0988b8> (a 
> org.apache.spark.storage.BlockInfo)
> at 
> org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1009)
> at 
> org.apache.spark.storage.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:460)
> at 
> org.apache.spark.storage.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:449)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15824) Run 'with ... insert ... select' failed when use spark thriftserver

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15824:


Assignee: (was: Apache Spark)

> Run 'with ... insert ... select' failed when use spark thriftserver
> ---
>
> Key: SPARK-15824
> URL: https://issues.apache.org/jira/browse/SPARK-15824
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Weizhong
>Priority: Minor
>
> {code:sql}
> create table src(k int, v int);
> create table src_parquet(k int, v int);
> with v as (select 1, 2) insert into table src_parquet from src;
> {code}
> Will throw exception: spark.sql.execution.id is already set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15824) Run 'with ... insert ... select' failed when use spark thriftserver

2016-06-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320456#comment-15320456
 ] 

Apache Spark commented on SPARK-15824:
--

User 'Sephiroth-Lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13561

> Run 'with ... insert ... select' failed when use spark thriftserver
> ---
>
> Key: SPARK-15824
> URL: https://issues.apache.org/jira/browse/SPARK-15824
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Weizhong
>Priority: Minor
>
> {code:sql}
> create table src(k int, v int);
> create table src_parquet(k int, v int);
> with v as (select 1, 2) insert into table src_parquet from src;
> {code}
> Will throw exception: spark.sql.execution.id is already set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15824) Run 'with ... insert ... select' failed when use spark thriftserver

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15824:


Assignee: Apache Spark

> Run 'with ... insert ... select' failed when use spark thriftserver
> ---
>
> Key: SPARK-15824
> URL: https://issues.apache.org/jira/browse/SPARK-15824
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Weizhong
>Assignee: Apache Spark
>Priority: Minor
>
> {code:sql}
> create table src(k int, v int);
> create table src_parquet(k int, v int);
> with v as (select 1, 2) insert into table src_parquet from src;
> {code}
> Will throw exception: spark.sql.execution.id is already set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15811) UDFs do not work in Spark 2.0-preview built with scala 2.10

2016-06-08 Thread Gabor Ratky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320466#comment-15320466
 ] 

Gabor Ratky commented on SPARK-15811:
-

I was able to reproduce the issue on a Databricks Cluster using Spark 2.0 
(apache/branch-2.0 preview). I've also tested whether using 
{{sqlContext.registerFunction}} solved the problem but the issue persisted.

> UDFs do not work in Spark 2.0-preview built with scala 2.10
> ---
>
> Key: SPARK-15811
> URL: https://issues.apache.org/jira/browse/SPARK-15811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Franklyn Dsouza
>Priority: Critical
>
> I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following
> {code}
> ./dev/change-version-to-2.10.sh
> ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
> -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
> {code}
> and then ran the following code in a pyspark shell
> {code}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import IntegerType, StructField, StructType
> from pyspark.sql.functions import udf
> from pyspark.sql.types import Row
> spark = SparkSession.builder.master('local[4]').appName('2.0 
> DF').getOrCreate()
> add_one = udf(lambda x: x + 1, IntegerType())
> schema = StructType([StructField('a', IntegerType(), False)])
> df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
> df.select(add_one(df.a).alias('incremented')).collect()
> {code}
> This never returns with a result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15086) Update Java API once the Scala one is finalized

2016-06-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320495#comment-15320495
 ] 

Sean Owen commented on SPARK-15086:
---

I think we can alter the API if that makes sense. Those really aren't the 
tricky questions here. See above.

> Update Java API once the Scala one is finalized
> ---
>
> Key: SPARK-15086
> URL: https://issues.apache.org/jira/browse/SPARK-15086
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Priority: Blocker
>
> We should make sure we update the Java API once the Scala one is finalized. 
> This includes adding the equivalent API in Java as well as deprecating the 
> old ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15821) Should we use mvn -T for multithreaded Spark builds?

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15821:


Assignee: Apache Spark

> Should we use mvn -T for multithreaded Spark builds?
> 
>
> Key: SPARK-15821
> URL: https://issues.apache.org/jira/browse/SPARK-15821
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Reporter: Adam Roberts
>Assignee: Apache Spark
>Priority: Minor
>
> With Maven we can build Spark in a multithreaded way and benefit from 
> increased build time performance as a result.
> On a machine with eight cores, I noticed the build time reduced from 20-25 
> minutes to five minutes; this is by building with
> mvn -T 1C -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -DskipTests clean 
> package
> -T 1C says that we'll use one extra thread for each core available, I've 
> never experienced a problem with using this option (ranging from a single 
> cored box to one with 192 cores available)
> Should we use this for building Spark quicker or is the Jenkins job 
> deliberately set up such that each "executor" is needed for each pull request 
> and we wouldn't see an improvement anyway? 
> This can be discovered by checking core utilization across the farm and can 
> potentially reduce our build times.
> Here's more information on the feature: 
> https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3
> If this isn't suitable for the current farm then I think we should document 
> it for those building Spark from source



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15821) Should we use mvn -T for multithreaded Spark builds?

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15821:


Assignee: (was: Apache Spark)

> Should we use mvn -T for multithreaded Spark builds?
> 
>
> Key: SPARK-15821
> URL: https://issues.apache.org/jira/browse/SPARK-15821
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Reporter: Adam Roberts
>Priority: Minor
>
> With Maven we can build Spark in a multithreaded way and benefit from 
> increased build time performance as a result.
> On a machine with eight cores, I noticed the build time reduced from 20-25 
> minutes to five minutes; this is by building with
> mvn -T 1C -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -DskipTests clean 
> package
> -T 1C says that we'll use one extra thread for each core available, I've 
> never experienced a problem with using this option (ranging from a single 
> cored box to one with 192 cores available)
> Should we use this for building Spark quicker or is the Jenkins job 
> deliberately set up such that each "executor" is needed for each pull request 
> and we wouldn't see an improvement anyway? 
> This can be discovered by checking core utilization across the farm and can 
> potentially reduce our build times.
> Here's more information on the feature: 
> https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3
> If this isn't suitable for the current farm then I think we should document 
> it for those building Spark from source



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15821) Should we use mvn -T for multithreaded Spark builds?

2016-06-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320604#comment-15320604
 ] 

Apache Spark commented on SPARK-15821:
--

User 'a-roberts' has created a pull request for this issue:
https://github.com/apache/spark/pull/13562

> Should we use mvn -T for multithreaded Spark builds?
> 
>
> Key: SPARK-15821
> URL: https://issues.apache.org/jira/browse/SPARK-15821
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Reporter: Adam Roberts
>Priority: Minor
>
> With Maven we can build Spark in a multithreaded way and benefit from 
> increased build time performance as a result.
> On a machine with eight cores, I noticed the build time reduced from 20-25 
> minutes to five minutes; this is by building with
> mvn -T 1C -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -DskipTests clean 
> package
> -T 1C says that we'll use one extra thread for each core available, I've 
> never experienced a problem with using this option (ranging from a single 
> cored box to one with 192 cores available)
> Should we use this for building Spark quicker or is the Jenkins job 
> deliberately set up such that each "executor" is needed for each pull request 
> and we wouldn't see an improvement anyway? 
> This can be discovered by checking core utilization across the farm and can 
> potentially reduce our build times.
> Here's more information on the feature: 
> https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3
> If this isn't suitable for the current farm then I think we should document 
> it for those building Spark from source



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15821) Should we use mvn -T for multithreaded Spark builds?

2016-06-08 Thread Adam Roberts (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320614#comment-15320614
 ] 

Adam Roberts commented on SPARK-15821:
--

Agreed with your comment on tests, the above pull request is for README.md and 
building Spark (perhaps better placed in the Building Spark section)

> Should we use mvn -T for multithreaded Spark builds?
> 
>
> Key: SPARK-15821
> URL: https://issues.apache.org/jira/browse/SPARK-15821
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Reporter: Adam Roberts
>Priority: Minor
>
> With Maven we can build Spark in a multithreaded way and benefit from 
> increased build time performance as a result.
> On a machine with eight cores, I noticed the build time reduced from 20-25 
> minutes to five minutes; this is by building with
> mvn -T 1C -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -DskipTests clean 
> package
> -T 1C says that we'll use one extra thread for each core available, I've 
> never experienced a problem with using this option (ranging from a single 
> cored box to one with 192 cores available)
> Should we use this for building Spark quicker or is the Jenkins job 
> deliberately set up such that each "executor" is needed for each pull request 
> and we wouldn't see an improvement anyway? 
> This can be discovered by checking core utilization across the farm and can 
> potentially reduce our build times.
> Here's more information on the feature: 
> https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3
> If this isn't suitable for the current farm then I think we should document 
> it for those building Spark from source



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11183) enable support for mesos 0.24+

2016-06-08 Thread Charles Allen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320840#comment-15320840
 ] 

Charles Allen commented on SPARK-11183:
---

Being able to enable the fetch cache 
http://mesos.apache.org/documentation/latest/fetcher/ would be nice also

> enable support for mesos 0.24+
> --
>
> Key: SPARK-11183
> URL: https://issues.apache.org/jira/browse/SPARK-11183
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Reporter: Ioannis Polyzos
>
> mesos 0.24, the mesos leader info in ZK has changed to json tis result to 
> spark failed to running on 0.24+.
> References : 
>   https://issues.apache.org/jira/browse/MESOS-2340 
>   
> http://mail-archives.apache.org/mod_mbox/mesos-commits/201506.mbox/%3ced4698dc56444bcdac3bdf19134db...@git.apache.org%3E
>   https://github.com/mesos/elasticsearch/issues/338
>   https://github.com/spark-jobserver/spark-jobserver/issues/267



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2016-06-08 Thread Miao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320879#comment-15320879
 ] 

Miao Wang commented on SPARK-15784:
---

I can work on this. Thanks!

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15812) Allow sorting on aggregated streaming dataframe when the output mode is Complete

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15812:


Assignee: Apache Spark  (was: Tathagata Das)

> Allow sorting on aggregated streaming dataframe when the output mode is 
> Complete
> 
>
> Key: SPARK-15812
> URL: https://issues.apache.org/jira/browse/SPARK-15812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> When the output mode is complete, then the output of a streaming aggregation 
> essentially will contain the complete aggregates every time. So this is not 
> different from a batch dataset within an incremental execution. Other 
> non-streaming operations should be supported on this dataset. In this JIRA, 
> we are just adding support for sorting, as it is a common useful 
> functionality. Support for other operations will come later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15812) Allow sorting on aggregated streaming dataframe when the output mode is Complete

2016-06-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320908#comment-15320908
 ] 

Apache Spark commented on SPARK-15812:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/13549

> Allow sorting on aggregated streaming dataframe when the output mode is 
> Complete
> 
>
> Key: SPARK-15812
> URL: https://issues.apache.org/jira/browse/SPARK-15812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> When the output mode is complete, then the output of a streaming aggregation 
> essentially will contain the complete aggregates every time. So this is not 
> different from a batch dataset within an incremental execution. Other 
> non-streaming operations should be supported on this dataset. In this JIRA, 
> we are just adding support for sorting, as it is a common useful 
> functionality. Support for other operations will come later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15812) Allow sorting on aggregated streaming dataframe when the output mode is Complete

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15812:


Assignee: Tathagata Das  (was: Apache Spark)

> Allow sorting on aggregated streaming dataframe when the output mode is 
> Complete
> 
>
> Key: SPARK-15812
> URL: https://issues.apache.org/jira/browse/SPARK-15812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> When the output mode is complete, then the output of a streaming aggregation 
> essentially will contain the complete aggregates every time. So this is not 
> different from a batch dataset within an incremental execution. Other 
> non-streaming operations should be supported on this dataset. In this JIRA, 
> we are just adding support for sorting, as it is a common useful 
> functionality. Support for other operations will come later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11183) enable support for mesos 0.24+

2016-06-08 Thread Charles Allen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320914#comment-15320914
 ] 

Charles Allen commented on SPARK-11183:
---

Eventually it could be wroth adopting something like 
https://github.com/mesosphere/mesos-rxjava to plug into the mesos cluster

> enable support for mesos 0.24+
> --
>
> Key: SPARK-11183
> URL: https://issues.apache.org/jira/browse/SPARK-11183
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Reporter: Ioannis Polyzos
>
> mesos 0.24, the mesos leader info in ZK has changed to json tis result to 
> spark failed to running on 0.24+.
> References : 
>   https://issues.apache.org/jira/browse/MESOS-2340 
>   
> http://mail-archives.apache.org/mod_mbox/mesos-commits/201506.mbox/%3ced4698dc56444bcdac3bdf19134db...@git.apache.org%3E
>   https://github.com/mesos/elasticsearch/issues/338
>   https://github.com/spark-jobserver/spark-jobserver/issues/267



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15046) When running hive-thriftserver with yarn on a secure cluster the workers fail with java.lang.NumberFormatException

2016-06-08 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-15046:
---
Target Version/s: 2.0.0
Priority: Blocker  (was: Major)
 Component/s: YARN

Marking as blocker since this is a regression.

> When running hive-thriftserver with yarn on a secure cluster the workers fail 
> with java.lang.NumberFormatException
> --
>
> Key: SPARK-15046
> URL: https://issues.apache.org/jira/browse/SPARK-15046
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Trystan Leftwich
>Priority: Blocker
>
> When running hive-thriftserver with yarn on a secure cluster 
> (spark.yarn.principal and spark.yarn.keytab are set) the workers fail with 
> the following error.
> {code}
> 16/04/30 22:40:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
> java.lang.NumberFormatException: For input string: "86400079ms"
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Long.parseLong(Long.java:441)
>   at java.lang.Long.parseLong(Long.java:483)
>   at 
> scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
>   at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
>   at 
> org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
>   at 
> org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
>   at scala.Option.map(Option.scala:146)
>   at org.apache.spark.SparkConf.getLong(SparkConf.scala:380)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.getTimeFromNowToRenewal(SparkHadoopUtil.scala:289)
>   at 
> org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.org$apache$spark$deploy$yarn$AMDelegationTokenRenewer$$scheduleRenewal$1(AMDelegationTokenRenewer.scala:89)
>   at 
> org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.scheduleLoginFromKeytab(AMDelegationTokenRenewer.scala:121)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:243)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:723)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:721)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:748)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15825) sort-merge-join gives invalid results when joining on a tupled key

2016-06-08 Thread Andres Perez (JIRA)

Andres Perez created SPARK-15825:


 Summary: sort-merge-join gives invalid results when joining on a 
tupled key
 Key: SPARK-15825
 URL: https://issues.apache.org/jira/browse/SPARK-15825
 Project: Spark
  Issue Type: Bug
  Components: SQL
 Environment: spark 2.0.0-SNAPSHOT
Reporter: Andres Perez


{noformat}
  import org.apache.spark.sql.functions
  val left = List("0", "1", "2").toDS()
.map{ k => ((k, 0), "l") }

  val right = List("0", "1", "2").toDS()
.map{ k => ((k, 0), "r") }

  val result = left.toDF("k", "v").as[((String, Int), String)].alias("left")
.joinWith(right.toDF("k", "v").as[((String, Int), String)].alias("right"), 
functions.col("left.k") === functions.col("right.k"), "inner")
.as[(((String, Int), String), ((String, Int), String))]
{noformat}

When broadcast joins are enabled, we get the expected output:

{noformat}
(((0,0),l),((0,0),r))
(((1,0),l),((1,0),r))
(((2,0),l),((2,0),r))
{noformat}

However, when broadcast joins are disabled (i.e. setting 
spark.sql.autoBroadcastJoinThreshold to -1), the result is incorrect:

{noformat}
(((2,0),l),((2,-1),))
(((0,0),l),((0,-313907893),))
(((1,0),l),((null,-313907893),))
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory

2016-06-08 Thread Sandeep (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321116#comment-15321116
 ] 

Sandeep commented on SPARK-2984:


Can this bug be reopened please ? I am seeing the issue with spark 1.6.1. as 
well on AWS. 
Caused by: java.io.FileNotFoundException: File 
s3n://xxx/_temporary/0/task_201606080516_0004_m_79 does not exist.
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:506)
  at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
  at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
  at 
org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
  at 
org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
  at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151)
  ... 42 more


> FileNotFoundException on _temporary directory
> -
>
> Key: SPARK-2984
> URL: https://issues.apache.org/jira/browse/SPARK-2984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.0
>
>
> We've seen several stacktraces and threads on the user mailing list where 
> people are having issues with a {{FileNotFoundException}} stemming from an 
> HDFS path containing {{_temporary}}.
> I ([~aash]) think this may be related to {{spark.speculation}}.  I think the 
> error condition might manifest in this circumstance:
> 1) task T starts on a executor E1
> 2) it takes a long time, so task T' is started on another executor E2
> 3) T finishes in E1 so moves its data from {{_temporary}} to the final 
> destination and deletes the {{_temporary}} directory during cleanup
> 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but 
> those files no longer exist!  exception
> Some samples:
> {noformat}
> 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 
> 140774430 ms.0
> java.io.FileNotFoundException: File 
> hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
>  does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
> at 
> org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
> at 
> org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
> at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> -- Che

[jira] [Commented] (SPARK-14485) Task finished cause fetch failure when its executor has already been removed by driver

2016-06-08 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321177#comment-15321177
 ] 

Kay Ousterhout commented on SPARK-14485:


I commented on the pull request, but want to continue the discussion here for 
archiving purposes.

My understanding is that this pull request fixes the following sequence of 
events:
(1) A task completes on an executor
(2) The executor fails
(3) The scheduler is notified about the task completing.
(4) A future stage that depends on the task runs, and fails, because the 
executor where the data was stored has failed.

With the proposed pull request, in step (3), the scheduler ignores the update, 
because it came from a failed executor.

I don't think we should do this for a few reasons:

(a) If the task didn't have a result stored on the executor (e.g., it computed 
some result on the RDD that it sent directly back to the master, like counting 
the elements in the RDD), it doesn't need to be failed, and can complete 
successfully.  With this change, we'd unnecessarily re-run the task.
(b) If the task did had in IndirectTaskResult (where it was too big to be sent 
directly to the master), the TaskResultGetter will fail to get the task result, 
and the task will be marked as failed.  This already worked correctly with the 
old code (AFAIK).
(c) This change is attempting to fix a third case, where the task had shuffle 
data that's now inaccessible because the machine had died.  I don't think it 
makes sense to fix this, because you can imagine a slight change in timing that 
causes the order of (2) and (3) above to be swapped.  In this case, even with 
the proposed code change, we're still stuck with the fetch failure and 
re-running the map stage.  Furthermore, it's possible (and likely!) that there 
were other map tasks that ran on the failed executor, and those tasks won't be 
failed and re-run with this change, so the reduce stage will still fail.  In 
general, the reason we have the fetch failure mechanism is because it can 
happen that shuffle data gets lost, and rather than detecting every kind of 
map-side failure, it's simpler to fail on the reduce side and then re-run the 
necessary tasks in the map stage.

Given all of the above, I'd advocate for reverting this change and marking the 
JIRA as won't fix.  [~vanzin] [~iward] let me know what your thoughts are. 

> Task finished cause fetch failure when its executor has already been removed 
> by driver 
> ---
>
> Key: SPARK-14485
> URL: https://issues.apache.org/jira/browse/SPARK-14485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.5.2
>Reporter: iward
>Assignee: iward
> Fix For: 2.0.0
>
>
> Now, when executor is removed by driver with heartbeats timeout, driver will 
> re-queue the task on this executor and send a kill command to cluster to kill 
> this executor.
> But, in a situation, the running task of this executor is finished and return 
> result to driver before this executor killed by kill command sent by driver. 
> At this situation, driver will accept the task finished event and ignore  
> speculative task and re-queued task. But, as we know, this executor has 
> removed by driver, the result of this finished task can not save in driver 
> because the *BlockManagerId* has also removed from *BlockManagerMaster* by 
> driver. So, the result data of this stage is not complete, and then, it will 
> cause fetch failure.
> For example, the following is the task log:
> {noformat}
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN HeartbeatReceiver: Removing 
> executor 322 with no recent heartbeats: 256015 ms exceeds timeout 25 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 ERROR YarnScheduler: Lost executor 
> 322 on BJHC-HERA-16168.hadoop.jd.local: Executor heartbeat timed out after 
> 256015 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO TaskSetManager: Re-queueing 
> tasks for 322 from TaskSet 107.0
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN TaskSetManager: Lost task 
> 229.0 in stage 107.0 (TID 10384, BJHC-HERA-16168.hadoop.jd.local): 
> ExecutorLostFailure (executor 322 lost)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO DAGScheduler: Executor lost: 
> 322 (epoch 11)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMasterEndpoint: 
> Trying to remove executor 322 from BlockManagerMaster.
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMaster: Removed 
> 322 successfully in removeExecutor
> {noformat}
> {noformat}
> 2015-12-31 04:38:52 INFO 15/12/31 04:38:52 INFO TaskSetManager: Finished task 
> 229.0 in stage 107.0 (TID 10384) in 272315 ms on 
> BJHC-HERA-16168.hadoop.jd.local (579/700)
> 2015-12-31 04:40:12 INFO 15/1

[jira] [Created] (SPARK-15826) PipedRDD relies on JVM default encoding

2016-06-08 Thread Tejas Patil (JIRA)

Tejas Patil created SPARK-15826:
---

 Summary: PipedRDD relies on JVM default encoding
 Key: SPARK-15826
 URL: https://issues.apache.org/jira/browse/SPARK-15826
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Tejas Patil
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15826) PipedRDD relies on JVM default encoding

2016-06-08 Thread Tejas Patil (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated SPARK-15826:

Description: 
Encountered an issue wherein the code works in some cluster but fails on 
another one for the same input. After debugging realised that PipedRDD is 
picking default char encoding from the JVM which may be different across 
different platforms. 

Making it use UTF-8 encoding just like `ScriptTransformation` does

> PipedRDD relies on JVM default encoding
> ---
>
> Key: SPARK-15826
> URL: https://issues.apache.org/jira/browse/SPARK-15826
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Tejas Patil
>Priority: Trivial
>
> Encountered an issue wherein the code works in some cluster but fails on 
> another one for the same input. After debugging realised that PipedRDD is 
> picking default char encoding from the JVM which may be different across 
> different platforms. 
> Making it use UTF-8 encoding just like `ScriptTransformation` does



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15826) PipedRDD to strictly use UTF-8 and not rely on default encoding

2016-06-08 Thread Tejas Patil (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated SPARK-15826:

Summary: PipedRDD to strictly use UTF-8 and not rely on default encoding  
(was: PipedRDD relies on JVM default encoding)

> PipedRDD to strictly use UTF-8 and not rely on default encoding
> ---
>
> Key: SPARK-15826
> URL: https://issues.apache.org/jira/browse/SPARK-15826
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Tejas Patil
>Priority: Trivial
>
> Encountered an issue wherein the code works in some cluster but fails on 
> another one for the same input. After debugging realised that PipedRDD is 
> picking default char encoding from the JVM which may be different across 
> different platforms. 
> Making it use UTF-8 encoding just like `ScriptTransformation` does



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15826) PipedRDD to strictly use UTF-8 and not rely on default encoding

2016-06-08 Thread Tejas Patil (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated SPARK-15826:

Description: 
Encountered an issue wherein the code works in some cluster but fails on 
another one for the same input. After debugging realised that PipedRDD is 
picking default char encoding from the JVM which may be different across 
different platforms. Making it use UTF-8 encoding just like 
`ScriptTransformation` does.

Stack trace:
{noformat}
Caused by: java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at 
scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67)
at org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:185)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1612)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160)
at 
org.apache.spark.SparkContext$$anonfun$runJob$6.apply(SparkContext.scala:1868)
at 
org.apache.spark.SparkContext$$anonfun$runJob$6.apply(SparkContext.scala:1868)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

  was:
Encountered an issue wherein the code works in some cluster but fails on 
another one for the same input. After debugging realised that PipedRDD is 
picking default char encoding from the JVM which may be different across 
different platforms. 

Making it use UTF-8 encoding just like `ScriptTransformation` does


> PipedRDD to strictly use UTF-8 and not rely on default encoding
> ---
>
> Key: SPARK-15826
> URL: https://issues.apache.org/jira/browse/SPARK-15826
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Tejas Patil
>Priority: Trivial
>
> Encountered an issue wherein the code works in some cluster but fails on 
> another one for the same input. After debugging realised that PipedRDD is 
> picking default char encoding from the JVM which may be different across 
> different platforms. Making it use UTF-8 encoding just like 
> `ScriptTransformation` does.
> Stack trace:
> {noformat}
> Caused by: java.nio.charset.MalformedInputException: Input length = 1
>   at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
>   at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
>   at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>   at java.io.InputStreamReader.read(InputStreamReader.java:184)
>   at java.io.BufferedReader.fill(BufferedReader.java:161)
>   at java.io.BufferedReader.readLine(BufferedReader.java:324)
>   at java.io.BufferedReader.readLine(BufferedReader.java:389)
>   at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67)
>   at org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:185)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1612)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$6.apply(SparkContext.scala:1868)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$6.apply(SparkContext.scala:1868)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.ap

[jira] [Assigned] (SPARK-15826) PipedRDD to strictly use UTF-8 and not rely on default encoding

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15826:


Assignee: (was: Apache Spark)

> PipedRDD to strictly use UTF-8 and not rely on default encoding
> ---
>
> Key: SPARK-15826
> URL: https://issues.apache.org/jira/browse/SPARK-15826
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Tejas Patil
>Priority: Trivial
>
> Encountered an issue wherein the code works in some cluster but fails on 
> another one for the same input. After debugging realised that PipedRDD is 
> picking default char encoding from the JVM which may be different across 
> different platforms. Making it use UTF-8 encoding just like 
> `ScriptTransformation` does.
> Stack trace:
> {noformat}
> Caused by: java.nio.charset.MalformedInputException: Input length = 1
>   at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
>   at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
>   at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>   at java.io.InputStreamReader.read(InputStreamReader.java:184)
>   at java.io.BufferedReader.fill(BufferedReader.java:161)
>   at java.io.BufferedReader.readLine(BufferedReader.java:324)
>   at java.io.BufferedReader.readLine(BufferedReader.java:389)
>   at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67)
>   at org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:185)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1612)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$6.apply(SparkContext.scala:1868)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$6.apply(SparkContext.scala:1868)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15826) PipedRDD to strictly use UTF-8 and not rely on default encoding

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15826:


Assignee: Apache Spark

> PipedRDD to strictly use UTF-8 and not rely on default encoding
> ---
>
> Key: SPARK-15826
> URL: https://issues.apache.org/jira/browse/SPARK-15826
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Tejas Patil
>Assignee: Apache Spark
>Priority: Trivial
>
> Encountered an issue wherein the code works in some cluster but fails on 
> another one for the same input. After debugging realised that PipedRDD is 
> picking default char encoding from the JVM which may be different across 
> different platforms. Making it use UTF-8 encoding just like 
> `ScriptTransformation` does.
> Stack trace:
> {noformat}
> Caused by: java.nio.charset.MalformedInputException: Input length = 1
>   at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
>   at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
>   at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>   at java.io.InputStreamReader.read(InputStreamReader.java:184)
>   at java.io.BufferedReader.fill(BufferedReader.java:161)
>   at java.io.BufferedReader.readLine(BufferedReader.java:324)
>   at java.io.BufferedReader.readLine(BufferedReader.java:389)
>   at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67)
>   at org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:185)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1612)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$6.apply(SparkContext.scala:1868)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$6.apply(SparkContext.scala:1868)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14485) Task finished cause fetch failure when its executor has already been removed by driver

2016-06-08 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321202#comment-15321202
 ] 

Marcelo Vanzin commented on SPARK-14485:


I commented on the PR, but will mostly repeat it here.

I think your point (a) is valid. It should be a rare case, though, so I don't 
feel strongly one way or another about it.

The change helps (b) because it can avoid unnecessary logs to the output. It's 
a minor issue, I grant you that.

Similarly, the change helps (c) by avoiding log noise in certain situations. 
The race you mention does exist, but users get really antsy when they see 
exceptions in logs, so if we can help avoid those it's always good.

So to me it boils down to case (a): we can revert the change and live with the 
extra noise in the logs, we can add code to handle that case and still clean 
the logs, or we can live with the added inefficiency which, in my view, should 
be hit pretty rarely.


> Task finished cause fetch failure when its executor has already been removed 
> by driver 
> ---
>
> Key: SPARK-14485
> URL: https://issues.apache.org/jira/browse/SPARK-14485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.5.2
>Reporter: iward
>Assignee: iward
> Fix For: 2.0.0
>
>
> Now, when executor is removed by driver with heartbeats timeout, driver will 
> re-queue the task on this executor and send a kill command to cluster to kill 
> this executor.
> But, in a situation, the running task of this executor is finished and return 
> result to driver before this executor killed by kill command sent by driver. 
> At this situation, driver will accept the task finished event and ignore  
> speculative task and re-queued task. But, as we know, this executor has 
> removed by driver, the result of this finished task can not save in driver 
> because the *BlockManagerId* has also removed from *BlockManagerMaster* by 
> driver. So, the result data of this stage is not complete, and then, it will 
> cause fetch failure.
> For example, the following is the task log:
> {noformat}
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN HeartbeatReceiver: Removing 
> executor 322 with no recent heartbeats: 256015 ms exceeds timeout 25 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 ERROR YarnScheduler: Lost executor 
> 322 on BJHC-HERA-16168.hadoop.jd.local: Executor heartbeat timed out after 
> 256015 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO TaskSetManager: Re-queueing 
> tasks for 322 from TaskSet 107.0
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN TaskSetManager: Lost task 
> 229.0 in stage 107.0 (TID 10384, BJHC-HERA-16168.hadoop.jd.local): 
> ExecutorLostFailure (executor 322 lost)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO DAGScheduler: Executor lost: 
> 322 (epoch 11)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMasterEndpoint: 
> Trying to remove executor 322 from BlockManagerMaster.
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMaster: Removed 
> 322 successfully in removeExecutor
> {noformat}
> {noformat}
> 2015-12-31 04:38:52 INFO 15/12/31 04:38:52 INFO TaskSetManager: Finished task 
> 229.0 in stage 107.0 (TID 10384) in 272315 ms on 
> BJHC-HERA-16168.hadoop.jd.local (579/700)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Ignoring 
> task-finished event for 229.1 in stage 107.0 because task 229 has already 
> completed successfully
> {noformat}
> {noformat}
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO DAGScheduler: Submitting 3 
> missing tasks from ShuffleMapStage 107 (MapPartitionsRDD[263] at 
> mapPartitions at Exchange.scala:137)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO YarnScheduler: Adding task 
> set 107.1 with 3 tasks
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 0.0 in stage 107.1 (TID 10863, BJHC-HERA-18043.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 1.0 in stage 107.1 (TID 10864, BJHC-HERA-9291.hadoop.jd.local, PROCESS_LOCAL, 
> 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 2.0 in stage 107.1 (TID 10865, BJHC-HERA-16047.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> {noformat}
> Driver will check the stage's result is not complete, and submit missing 
> task, but this time, the next stage has run because previous stage has finish 
> for its task is all finished although its result is not complete.
> {noformat}
> 2015-12-31 04:40:13 INFO 15/12/31 04:40:13 WARN TaskSetManager: Lost task 
> 39.0 in stage 109.0 (TID 10905, BJHC-HERA-9357.hadoop.jd.local): 
> FetchFailed(null, shuffleId=11, mapId=-1,

[jira] [Commented] (SPARK-15826) PipedRDD to strictly use UTF-8 and not rely on default encoding

2016-06-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321203#comment-15321203
 ] 

Apache Spark commented on SPARK-15826:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/13563

> PipedRDD to strictly use UTF-8 and not rely on default encoding
> ---
>
> Key: SPARK-15826
> URL: https://issues.apache.org/jira/browse/SPARK-15826
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Tejas Patil
>Priority: Trivial
>
> Encountered an issue wherein the code works in some cluster but fails on 
> another one for the same input. After debugging realised that PipedRDD is 
> picking default char encoding from the JVM which may be different across 
> different platforms. Making it use UTF-8 encoding just like 
> `ScriptTransformation` does.
> Stack trace:
> {noformat}
> Caused by: java.nio.charset.MalformedInputException: Input length = 1
>   at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
>   at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
>   at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>   at java.io.InputStreamReader.read(InputStreamReader.java:184)
>   at java.io.BufferedReader.fill(BufferedReader.java:161)
>   at java.io.BufferedReader.readLine(BufferedReader.java:324)
>   at java.io.BufferedReader.readLine(BufferedReader.java:389)
>   at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67)
>   at org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:185)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1612)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$6.apply(SparkContext.scala:1868)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$6.apply(SparkContext.scala:1868)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15827) Publish Spark's forked sbt-pom-reader to Maven Central

2016-06-08 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-15827:
--

 Summary: Publish Spark's forked sbt-pom-reader to Maven Central
 Key: SPARK-15827
 URL: https://issues.apache.org/jira/browse/SPARK-15827
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Reporter: Josh Rosen
Assignee: Josh Rosen


Spark's SBT build currently uses a fork of the sbt-pom-reader plugin but 
depends on that fork via a SBT subproject which is cloned from 
https://github.com/scrapcodes/sbt-pom-reader/tree/ignore_artifact_id. This 
unnecessarily slows down the initial build on fresh machines and is also risky 
because it risks a build breakage in case that GitHub repository ever changes 
or is deleted.

In order to address these issues, I propose to publish a pre-built binary of 
our forked sbt-pom-reader plugin to Maven Central under the org.spark-project 
namespace.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14485) Task finished cause fetch failure when its executor has already been removed by driver

2016-06-08 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321238#comment-15321238
 ] 

Marcelo Vanzin commented on SPARK-14485:


Just to distill my comment a little bit more, without writing extra code there 
will be recomputation either way: the old code would cause the downstream task 
to fail, the new code will cause the original task to be recomputed. I prefer 
the new one better because it avoids noise in the logs, but, either way works.

> Task finished cause fetch failure when its executor has already been removed 
> by driver 
> ---
>
> Key: SPARK-14485
> URL: https://issues.apache.org/jira/browse/SPARK-14485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.5.2
>Reporter: iward
>Assignee: iward
> Fix For: 2.0.0
>
>
> Now, when executor is removed by driver with heartbeats timeout, driver will 
> re-queue the task on this executor and send a kill command to cluster to kill 
> this executor.
> But, in a situation, the running task of this executor is finished and return 
> result to driver before this executor killed by kill command sent by driver. 
> At this situation, driver will accept the task finished event and ignore  
> speculative task and re-queued task. But, as we know, this executor has 
> removed by driver, the result of this finished task can not save in driver 
> because the *BlockManagerId* has also removed from *BlockManagerMaster* by 
> driver. So, the result data of this stage is not complete, and then, it will 
> cause fetch failure.
> For example, the following is the task log:
> {noformat}
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN HeartbeatReceiver: Removing 
> executor 322 with no recent heartbeats: 256015 ms exceeds timeout 25 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 ERROR YarnScheduler: Lost executor 
> 322 on BJHC-HERA-16168.hadoop.jd.local: Executor heartbeat timed out after 
> 256015 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO TaskSetManager: Re-queueing 
> tasks for 322 from TaskSet 107.0
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN TaskSetManager: Lost task 
> 229.0 in stage 107.0 (TID 10384, BJHC-HERA-16168.hadoop.jd.local): 
> ExecutorLostFailure (executor 322 lost)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO DAGScheduler: Executor lost: 
> 322 (epoch 11)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMasterEndpoint: 
> Trying to remove executor 322 from BlockManagerMaster.
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMaster: Removed 
> 322 successfully in removeExecutor
> {noformat}
> {noformat}
> 2015-12-31 04:38:52 INFO 15/12/31 04:38:52 INFO TaskSetManager: Finished task 
> 229.0 in stage 107.0 (TID 10384) in 272315 ms on 
> BJHC-HERA-16168.hadoop.jd.local (579/700)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Ignoring 
> task-finished event for 229.1 in stage 107.0 because task 229 has already 
> completed successfully
> {noformat}
> {noformat}
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO DAGScheduler: Submitting 3 
> missing tasks from ShuffleMapStage 107 (MapPartitionsRDD[263] at 
> mapPartitions at Exchange.scala:137)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO YarnScheduler: Adding task 
> set 107.1 with 3 tasks
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 0.0 in stage 107.1 (TID 10863, BJHC-HERA-18043.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 1.0 in stage 107.1 (TID 10864, BJHC-HERA-9291.hadoop.jd.local, PROCESS_LOCAL, 
> 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 2.0 in stage 107.1 (TID 10865, BJHC-HERA-16047.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> {noformat}
> Driver will check the stage's result is not complete, and submit missing 
> task, but this time, the next stage has run because previous stage has finish 
> for its task is all finished although its result is not complete.
> {noformat}
> 2015-12-31 04:40:13 INFO 15/12/31 04:40:13 WARN TaskSetManager: Lost task 
> 39.0 in stage 109.0 (TID 10905, BJHC-HERA-9357.hadoop.jd.local): 
> FetchFailed(null, shuffleId=11, mapId=-1, reduceId=39, message=
> 2015-12-31 04:40:13 INFO 
> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
> location for shuffle 11
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(M

[jira] [Commented] (SPARK-15783) Fix more flakiness: o.a.s.scheduler.BlacklistIntegrationSuite

2016-06-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321247#comment-15321247
 ] 

Apache Spark commented on SPARK-15783:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/13565

> Fix more flakiness: o.a.s.scheduler.BlacklistIntegrationSuite
> -
>
> Key: SPARK-15783
> URL: https://issues.apache.org/jira/browse/SPARK-15783
> Project: Spark
>  Issue Type: Test
>Reporter: Imran Rashid
>Priority: Minor
>
> Looks like SPARK-15714 didn't address all the sources of flakiness.  First 
> turning the test off to stop breaking builds, then will try to fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15827) Publish Spark's forked sbt-pom-reader to Maven Central

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15827:


Assignee: Josh Rosen  (was: Apache Spark)

> Publish Spark's forked sbt-pom-reader to Maven Central
> --
>
> Key: SPARK-15827
> URL: https://issues.apache.org/jira/browse/SPARK-15827
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Spark's SBT build currently uses a fork of the sbt-pom-reader plugin but 
> depends on that fork via a SBT subproject which is cloned from 
> https://github.com/scrapcodes/sbt-pom-reader/tree/ignore_artifact_id. This 
> unnecessarily slows down the initial build on fresh machines and is also 
> risky because it risks a build breakage in case that GitHub repository ever 
> changes or is deleted.
> In order to address these issues, I propose to publish a pre-built binary of 
> our forked sbt-pom-reader plugin to Maven Central under the org.spark-project 
> namespace.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15678) Not use cache on appends and overwrites

2016-06-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321249#comment-15321249
 ] 

Apache Spark commented on SPARK-15678:
--

User 'sameeragarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/13566

> Not use cache on appends and overwrites
> ---
>
> Key: SPARK-15678
> URL: https://issues.apache.org/jira/browse/SPARK-15678
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Sameer Agarwal
>
> SparkSQL currently doesn't drop caches if the underlying data is overwritten.
> {code}
> val dir = "/tmp/test"
> sqlContext.range(1000).write.mode("overwrite").parquet(dir)
> val df = sqlContext.read.parquet(dir).cache()
> df.count() // outputs 1000
> sqlContext.range(10).write.mode("overwrite").parquet(dir)
> sqlContext.read.parquet(dir).count() // outputs 1000 instead of 10 < We 
> are still using the cached dataset
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15827) Publish Spark's forked sbt-pom-reader to Maven Central

2016-06-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321248#comment-15321248
 ] 

Apache Spark commented on SPARK-15827:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/13564

> Publish Spark's forked sbt-pom-reader to Maven Central
> --
>
> Key: SPARK-15827
> URL: https://issues.apache.org/jira/browse/SPARK-15827
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Spark's SBT build currently uses a fork of the sbt-pom-reader plugin but 
> depends on that fork via a SBT subproject which is cloned from 
> https://github.com/scrapcodes/sbt-pom-reader/tree/ignore_artifact_id. This 
> unnecessarily slows down the initial build on fresh machines and is also 
> risky because it risks a build breakage in case that GitHub repository ever 
> changes or is deleted.
> In order to address these issues, I propose to publish a pre-built binary of 
> our forked sbt-pom-reader plugin to Maven Central under the org.spark-project 
> namespace.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15827) Publish Spark's forked sbt-pom-reader to Maven Central

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15827:


Assignee: Apache Spark  (was: Josh Rosen)

> Publish Spark's forked sbt-pom-reader to Maven Central
> --
>
> Key: SPARK-15827
> URL: https://issues.apache.org/jira/browse/SPARK-15827
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> Spark's SBT build currently uses a fork of the sbt-pom-reader plugin but 
> depends on that fork via a SBT subproject which is cloned from 
> https://github.com/scrapcodes/sbt-pom-reader/tree/ignore_artifact_id. This 
> unnecessarily slows down the initial build on fresh machines and is also 
> risky because it risks a build breakage in case that GitHub repository ever 
> changes or is deleted.
> In order to address these issues, I propose to publish a pre-built binary of 
> our forked sbt-pom-reader plugin to Maven Central under the org.spark-project 
> namespace.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11765) Avoid assign UI port between browser unsafe ports (or just 4045: lockd)

2016-06-08 Thread Willy Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321311#comment-15321311
 ] 

Willy Lee commented on SPARK-11765:
---

So users should be picking their own ports hoping they aren't already in use? 
It seems easier to just start from 4140 and increment. The next port blocked by 
webkit/chrome/etc isn't until 6000.

> Avoid assign UI port between browser unsafe ports (or just 4045: lockd)
> ---
>
> Key: SPARK-11765
> URL: https://issues.apache.org/jira/browse/SPARK-11765
> Project: Spark
>  Issue Type: Improvement
>Reporter: Jungtaek Lim
>Priority: Minor
>
> Spark UI port starts on 4040, and UI port is incremented by 1 for every 
> confliction.
> In our use case, we have some drivers running at the same time, which makes 
> UI port to be assigned to 4045, which is treated to unsafe port for chrome 
> and mozilla.
> http://src.chromium.org/viewvc/chrome/trunk/src/net/base/net_util.cc?view=markup
> http://www-archive.mozilla.org/projects/netlib/PortBanning.html#portlist
> We would like to avoid assigning UI to these ports, or just avoid assigning 
> UI port to 4045 which is too close to default port.
> If we'd like to accept this idea, I'm happy to work on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-12712) test-dependencies.sh fails with difference in manifests

2016-06-08 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reopened SPARK-12712:

  Assignee: Josh Rosen

I've managed to reproduce this problem in one of my own CI environments. This 
problem is triggered when Spark's test-dependendencies script runs with an 
initially-empty .m2 cache: extra log output from downloading dependencies 
breaks the regex used in this script.

I have a fix for this and will open a PR shortly.

> test-dependencies.sh fails with difference in manifests
> ---
>
> Key: SPARK-12712
> URL: https://issues.apache.org/jira/browse/SPARK-12712
> Project: Spark
>  Issue Type: Bug
>Reporter: Stavros Kontopoulos
>Assignee: Josh Rosen
>
> Test-dependencies.sh script fails.
> This relates to this https://github.com/apache/spark/pull/10461
> Check failure here:
> https://ci.typesafe.com/job/ghprb-spark-multi-conf/label=Spark-Ora-JDK7-PV,scala_version=2.10/84/console
> My pr does not change dependencies shouldnt the pr manifest be generated with 
> full dependencies it seems empty. should i use replace-manifest?
> Reproducing it locally on that jenkins instance i get this: 
> Spark's published dependencies DO NOT MATCH the manifest file 
> (dev/spark-deps).
> To update the manifest file, run './dev/test-dependencies.sh 
> --replace-manifest'.
> diff --git a/dev/deps/spark-deps-hadoop-2.6 
> b/dev/pr-deps/spark-deps-hadoop-2.6
> index e703c7a..3aa2c38 100644
> --- a/dev/deps/spark-deps-hadoop-2.6
> +++ b/dev/pr-deps/spark-deps-hadoop-2.6
> @@ -1,190 +1,2 @@
> -JavaEWAH-0.3.2.jar
> -RoaringBitmap-0.5.11.jar
> -ST4-4.0.4.jar
> -activation-1.1.1.jar
> -akka-actor_2.10-2.3.11.jar
> -akka-remote_2.10-2.3.11.jar
> -akka-slf4j_2.10-2.3.11.jar
> -antlr-runtime-3.5.2.jar
> -aopalliance-1.0.jar
> -apache-log4j-extras-1.2.17.jar
> -apacheds-i18n-2.0.0-M15.jar
> -apacheds-kerberos-codec-2.0.0-M15.jar
> -api-asn1-api-1.0.0-M20.jar
> -api-util-1.0.0-M20.jar
> -arpack_combined_all-0.1.jar
> -asm-3.1.jar
> -asm-commons-3.1.jar
> -asm-tree-3.1.jar
> -avro-1.7.7.jar
> -avro-ipc-1.7.7-tests.jar
> -avro-ipc-1.7.7.jar
> -avro-mapred-1.7.7-hadoop2.jar
> -base64-2.3.8.jar
> -bcprov-jdk15on-1.51.jar
> -bonecp-0.8.0.RELEASE.jar
> -breeze-macros_2.10-0.11.2.jar
> -breeze_2.10-0.11.2.jar
> -calcite-avatica-1.2.0-incubating.jar
> -calcite-core-1.2.0-incubating.jar
> -calcite-linq4j-1.2.0-incubating.jar
> -chill-java-0.5.0.jar
> -chill_2.10-0.5.0.jar
> -commons-beanutils-1.7.0.jar
> -commons-beanutils-core-1.8.0.jar
> -commons-cli-1.2.jar
> -commons-codec-1.10.jar
> -commons-collections-3.2.2.jar
> -commons-compiler-2.7.6.jar
> -commons-compress-1.4.1.jar
> -commons-configuration-1.6.jar
> -commons-dbcp-1.4.jar
> -commons-digester-1.8.jar
> -commons-httpclient-3.1.jar
> -commons-io-2.4.jar
> -commons-lang-2.6.jar
> -commons-lang3-3.3.2.jar
> -commons-logging-1.1.3.jar
> -commons-math3-3.4.1.jar
> -commons-net-2.2.jar
> -commons-pool-1.5.4.jar
> -compress-lzf-1.0.3.jar
> -config-1.2.1.jar
> -core-1.1.2.jar
> -curator-client-2.6.0.jar
> -curator-framework-2.6.0.jar
> -curator-recipes-2.6.0.jar
> -datanucleus-api-jdo-3.2.6.jar
> -datanucleus-core-3.2.10.jar
> -datanucleus-rdbms-3.2.9.jar
> -derby-10.10.1.1.jar
> -eigenbase-properties-1.1.5.jar
> -geronimo-annotation_1.0_spec-1.1.1.jar
> -geronimo-jaspic_1.0_spec-1.0.jar
> -geronimo-jta_1.1_spec-1.1.1.jar
> -groovy-all-2.1.6.jar
> -gson-2.2.4.jar
> -guice-3.0.jar
> -guice-servlet-3.0.jar
> -hadoop-annotations-2.6.0.jar
> -hadoop-auth-2.6.0.jar
> -hadoop-client-2.6.0.jar
> -hadoop-common-2.6.0.jar
> -hadoop-hdfs-2.6.0.jar
> -hadoop-mapreduce-client-app-2.6.0.jar
> -hadoop-mapreduce-client-common-2.6.0.jar
> -hadoop-mapreduce-client-core-2.6.0.jar
> -hadoop-mapreduce-client-jobclient-2.6.0.jar
> -hadoop-mapreduce-client-shuffle-2.6.0.jar
> -hadoop-yarn-api-2.6.0.jar
> -hadoop-yarn-client-2.6.0.jar
> -hadoop-yarn-common-2.6.0.jar
> -hadoop-yarn-server-common-2.6.0.jar
> -hadoop-yarn-server-web-proxy-2.6.0.jar
> -htrace-core-3.0.4.jar
> -httpclient-4.3.2.jar
> -httpcore-4.3.2.jar
> -ivy-2.4.0.jar
> -jackson-annotations-2.4.4.jar
> -jackson-core-2.4.4.jar
> -jackson-core-asl-1.9.13.jar
> -jackson-databind-2.4.4.jar
> -jackson-jaxrs-1.9.13.jar
> -jackson-mapper-asl-1.9.13.jar
> -jackson-module-scala_2.10-2.4.4.jar
> -jackson-xc-1.9.13.jar
> -janino-2.7.8.jar
> -jansi-1.4.jar
> -java-xmlbuilder-1.0.jar
> -javax.inject-1.jar
> -javax.servlet-3.0.0.v201112011016.jar
> -javolution-5.5.1.jar
> -jaxb-api-2.2.2.jar
> -jaxb-impl-2.2.3-1.jar
> -jcl-over-slf4j-1.7.10.jar
> -jdo-api-3.0.1.jar
> -jersey-client-1.9.jar
> -jersey-core-1.9.jar
> -jersey-guice-1.9.jar
> -jersey-json-1.9.jar
> -jersey-server-1.9.jar
> -jets3t-0.9.3.jar
> -jettison-1.1.jar
> -jetty-6.1.26.jar
> -jetty-all-7.6.0.v20120127.jar
> -jetty-util-6.1.26.jar
> -jline-2.10.5.jar
> -jline-

[jira] [Updated] (SPARK-12712) test-dependencies.sh script fails when run against empty .m2 cache

2016-06-08 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-12712:
---
Target Version/s: 1.6.2, 2.0.0
 Component/s: Project Infra

> test-dependencies.sh script fails when run against empty .m2 cache
> --
>
> Key: SPARK-12712
> URL: https://issues.apache.org/jira/browse/SPARK-12712
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Stavros Kontopoulos
>Assignee: Josh Rosen
>
> Test-dependencies.sh script fails.
> This relates to this https://github.com/apache/spark/pull/10461
> Check failure here:
> https://ci.typesafe.com/job/ghprb-spark-multi-conf/label=Spark-Ora-JDK7-PV,scala_version=2.10/84/console
> My pr does not change dependencies shouldnt the pr manifest be generated with 
> full dependencies it seems empty. should i use replace-manifest?
> Reproducing it locally on that jenkins instance i get this: 
> Spark's published dependencies DO NOT MATCH the manifest file 
> (dev/spark-deps).
> To update the manifest file, run './dev/test-dependencies.sh 
> --replace-manifest'.
> diff --git a/dev/deps/spark-deps-hadoop-2.6 
> b/dev/pr-deps/spark-deps-hadoop-2.6
> index e703c7a..3aa2c38 100644
> --- a/dev/deps/spark-deps-hadoop-2.6
> +++ b/dev/pr-deps/spark-deps-hadoop-2.6
> @@ -1,190 +1,2 @@
> -JavaEWAH-0.3.2.jar
> -RoaringBitmap-0.5.11.jar
> -ST4-4.0.4.jar
> -activation-1.1.1.jar
> -akka-actor_2.10-2.3.11.jar
> -akka-remote_2.10-2.3.11.jar
> -akka-slf4j_2.10-2.3.11.jar
> -antlr-runtime-3.5.2.jar
> -aopalliance-1.0.jar
> -apache-log4j-extras-1.2.17.jar
> -apacheds-i18n-2.0.0-M15.jar
> -apacheds-kerberos-codec-2.0.0-M15.jar
> -api-asn1-api-1.0.0-M20.jar
> -api-util-1.0.0-M20.jar
> -arpack_combined_all-0.1.jar
> -asm-3.1.jar
> -asm-commons-3.1.jar
> -asm-tree-3.1.jar
> -avro-1.7.7.jar
> -avro-ipc-1.7.7-tests.jar
> -avro-ipc-1.7.7.jar
> -avro-mapred-1.7.7-hadoop2.jar
> -base64-2.3.8.jar
> -bcprov-jdk15on-1.51.jar
> -bonecp-0.8.0.RELEASE.jar
> -breeze-macros_2.10-0.11.2.jar
> -breeze_2.10-0.11.2.jar
> -calcite-avatica-1.2.0-incubating.jar
> -calcite-core-1.2.0-incubating.jar
> -calcite-linq4j-1.2.0-incubating.jar
> -chill-java-0.5.0.jar
> -chill_2.10-0.5.0.jar
> -commons-beanutils-1.7.0.jar
> -commons-beanutils-core-1.8.0.jar
> -commons-cli-1.2.jar
> -commons-codec-1.10.jar
> -commons-collections-3.2.2.jar
> -commons-compiler-2.7.6.jar
> -commons-compress-1.4.1.jar
> -commons-configuration-1.6.jar
> -commons-dbcp-1.4.jar
> -commons-digester-1.8.jar
> -commons-httpclient-3.1.jar
> -commons-io-2.4.jar
> -commons-lang-2.6.jar
> -commons-lang3-3.3.2.jar
> -commons-logging-1.1.3.jar
> -commons-math3-3.4.1.jar
> -commons-net-2.2.jar
> -commons-pool-1.5.4.jar
> -compress-lzf-1.0.3.jar
> -config-1.2.1.jar
> -core-1.1.2.jar
> -curator-client-2.6.0.jar
> -curator-framework-2.6.0.jar
> -curator-recipes-2.6.0.jar
> -datanucleus-api-jdo-3.2.6.jar
> -datanucleus-core-3.2.10.jar
> -datanucleus-rdbms-3.2.9.jar
> -derby-10.10.1.1.jar
> -eigenbase-properties-1.1.5.jar
> -geronimo-annotation_1.0_spec-1.1.1.jar
> -geronimo-jaspic_1.0_spec-1.0.jar
> -geronimo-jta_1.1_spec-1.1.1.jar
> -groovy-all-2.1.6.jar
> -gson-2.2.4.jar
> -guice-3.0.jar
> -guice-servlet-3.0.jar
> -hadoop-annotations-2.6.0.jar
> -hadoop-auth-2.6.0.jar
> -hadoop-client-2.6.0.jar
> -hadoop-common-2.6.0.jar
> -hadoop-hdfs-2.6.0.jar
> -hadoop-mapreduce-client-app-2.6.0.jar
> -hadoop-mapreduce-client-common-2.6.0.jar
> -hadoop-mapreduce-client-core-2.6.0.jar
> -hadoop-mapreduce-client-jobclient-2.6.0.jar
> -hadoop-mapreduce-client-shuffle-2.6.0.jar
> -hadoop-yarn-api-2.6.0.jar
> -hadoop-yarn-client-2.6.0.jar
> -hadoop-yarn-common-2.6.0.jar
> -hadoop-yarn-server-common-2.6.0.jar
> -hadoop-yarn-server-web-proxy-2.6.0.jar
> -htrace-core-3.0.4.jar
> -httpclient-4.3.2.jar
> -httpcore-4.3.2.jar
> -ivy-2.4.0.jar
> -jackson-annotations-2.4.4.jar
> -jackson-core-2.4.4.jar
> -jackson-core-asl-1.9.13.jar
> -jackson-databind-2.4.4.jar
> -jackson-jaxrs-1.9.13.jar
> -jackson-mapper-asl-1.9.13.jar
> -jackson-module-scala_2.10-2.4.4.jar
> -jackson-xc-1.9.13.jar
> -janino-2.7.8.jar
> -jansi-1.4.jar
> -java-xmlbuilder-1.0.jar
> -javax.inject-1.jar
> -javax.servlet-3.0.0.v201112011016.jar
> -javolution-5.5.1.jar
> -jaxb-api-2.2.2.jar
> -jaxb-impl-2.2.3-1.jar
> -jcl-over-slf4j-1.7.10.jar
> -jdo-api-3.0.1.jar
> -jersey-client-1.9.jar
> -jersey-core-1.9.jar
> -jersey-guice-1.9.jar
> -jersey-json-1.9.jar
> -jersey-server-1.9.jar
> -jets3t-0.9.3.jar
> -jettison-1.1.jar
> -jetty-6.1.26.jar
> -jetty-all-7.6.0.v20120127.jar
> -jetty-util-6.1.26.jar
> -jline-2.10.5.jar
> -jline-2.12.jar
> -joda-time-2.9.jar
> -jodd-core-3.5.2.jar
> -jpam-1.1.jar
> -json-20090211.jar
> -json4s-ast_2.10-3.2.10.jar
> -json4s-core_2.10-3.2.10.jar
> -json4s-jackson_2.10-3.2.10.jar
> -jsr305-1.3.9.jar
> -jta-1.1.jar

[jira] [Updated] (SPARK-12712) test-dependencies.sh script fails when run against empty .m2 cache

2016-06-08 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-12712:
---
Summary: test-dependencies.sh script fails when run against empty .m2 cache 
 (was: test-dependencies.sh fails with difference in manifests)

> test-dependencies.sh script fails when run against empty .m2 cache
> --
>
> Key: SPARK-12712
> URL: https://issues.apache.org/jira/browse/SPARK-12712
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Stavros Kontopoulos
>Assignee: Josh Rosen
>
> Test-dependencies.sh script fails.
> This relates to this https://github.com/apache/spark/pull/10461
> Check failure here:
> https://ci.typesafe.com/job/ghprb-spark-multi-conf/label=Spark-Ora-JDK7-PV,scala_version=2.10/84/console
> My pr does not change dependencies shouldnt the pr manifest be generated with 
> full dependencies it seems empty. should i use replace-manifest?
> Reproducing it locally on that jenkins instance i get this: 
> Spark's published dependencies DO NOT MATCH the manifest file 
> (dev/spark-deps).
> To update the manifest file, run './dev/test-dependencies.sh 
> --replace-manifest'.
> diff --git a/dev/deps/spark-deps-hadoop-2.6 
> b/dev/pr-deps/spark-deps-hadoop-2.6
> index e703c7a..3aa2c38 100644
> --- a/dev/deps/spark-deps-hadoop-2.6
> +++ b/dev/pr-deps/spark-deps-hadoop-2.6
> @@ -1,190 +1,2 @@
> -JavaEWAH-0.3.2.jar
> -RoaringBitmap-0.5.11.jar
> -ST4-4.0.4.jar
> -activation-1.1.1.jar
> -akka-actor_2.10-2.3.11.jar
> -akka-remote_2.10-2.3.11.jar
> -akka-slf4j_2.10-2.3.11.jar
> -antlr-runtime-3.5.2.jar
> -aopalliance-1.0.jar
> -apache-log4j-extras-1.2.17.jar
> -apacheds-i18n-2.0.0-M15.jar
> -apacheds-kerberos-codec-2.0.0-M15.jar
> -api-asn1-api-1.0.0-M20.jar
> -api-util-1.0.0-M20.jar
> -arpack_combined_all-0.1.jar
> -asm-3.1.jar
> -asm-commons-3.1.jar
> -asm-tree-3.1.jar
> -avro-1.7.7.jar
> -avro-ipc-1.7.7-tests.jar
> -avro-ipc-1.7.7.jar
> -avro-mapred-1.7.7-hadoop2.jar
> -base64-2.3.8.jar
> -bcprov-jdk15on-1.51.jar
> -bonecp-0.8.0.RELEASE.jar
> -breeze-macros_2.10-0.11.2.jar
> -breeze_2.10-0.11.2.jar
> -calcite-avatica-1.2.0-incubating.jar
> -calcite-core-1.2.0-incubating.jar
> -calcite-linq4j-1.2.0-incubating.jar
> -chill-java-0.5.0.jar
> -chill_2.10-0.5.0.jar
> -commons-beanutils-1.7.0.jar
> -commons-beanutils-core-1.8.0.jar
> -commons-cli-1.2.jar
> -commons-codec-1.10.jar
> -commons-collections-3.2.2.jar
> -commons-compiler-2.7.6.jar
> -commons-compress-1.4.1.jar
> -commons-configuration-1.6.jar
> -commons-dbcp-1.4.jar
> -commons-digester-1.8.jar
> -commons-httpclient-3.1.jar
> -commons-io-2.4.jar
> -commons-lang-2.6.jar
> -commons-lang3-3.3.2.jar
> -commons-logging-1.1.3.jar
> -commons-math3-3.4.1.jar
> -commons-net-2.2.jar
> -commons-pool-1.5.4.jar
> -compress-lzf-1.0.3.jar
> -config-1.2.1.jar
> -core-1.1.2.jar
> -curator-client-2.6.0.jar
> -curator-framework-2.6.0.jar
> -curator-recipes-2.6.0.jar
> -datanucleus-api-jdo-3.2.6.jar
> -datanucleus-core-3.2.10.jar
> -datanucleus-rdbms-3.2.9.jar
> -derby-10.10.1.1.jar
> -eigenbase-properties-1.1.5.jar
> -geronimo-annotation_1.0_spec-1.1.1.jar
> -geronimo-jaspic_1.0_spec-1.0.jar
> -geronimo-jta_1.1_spec-1.1.1.jar
> -groovy-all-2.1.6.jar
> -gson-2.2.4.jar
> -guice-3.0.jar
> -guice-servlet-3.0.jar
> -hadoop-annotations-2.6.0.jar
> -hadoop-auth-2.6.0.jar
> -hadoop-client-2.6.0.jar
> -hadoop-common-2.6.0.jar
> -hadoop-hdfs-2.6.0.jar
> -hadoop-mapreduce-client-app-2.6.0.jar
> -hadoop-mapreduce-client-common-2.6.0.jar
> -hadoop-mapreduce-client-core-2.6.0.jar
> -hadoop-mapreduce-client-jobclient-2.6.0.jar
> -hadoop-mapreduce-client-shuffle-2.6.0.jar
> -hadoop-yarn-api-2.6.0.jar
> -hadoop-yarn-client-2.6.0.jar
> -hadoop-yarn-common-2.6.0.jar
> -hadoop-yarn-server-common-2.6.0.jar
> -hadoop-yarn-server-web-proxy-2.6.0.jar
> -htrace-core-3.0.4.jar
> -httpclient-4.3.2.jar
> -httpcore-4.3.2.jar
> -ivy-2.4.0.jar
> -jackson-annotations-2.4.4.jar
> -jackson-core-2.4.4.jar
> -jackson-core-asl-1.9.13.jar
> -jackson-databind-2.4.4.jar
> -jackson-jaxrs-1.9.13.jar
> -jackson-mapper-asl-1.9.13.jar
> -jackson-module-scala_2.10-2.4.4.jar
> -jackson-xc-1.9.13.jar
> -janino-2.7.8.jar
> -jansi-1.4.jar
> -java-xmlbuilder-1.0.jar
> -javax.inject-1.jar
> -javax.servlet-3.0.0.v201112011016.jar
> -javolution-5.5.1.jar
> -jaxb-api-2.2.2.jar
> -jaxb-impl-2.2.3-1.jar
> -jcl-over-slf4j-1.7.10.jar
> -jdo-api-3.0.1.jar
> -jersey-client-1.9.jar
> -jersey-core-1.9.jar
> -jersey-guice-1.9.jar
> -jersey-json-1.9.jar
> -jersey-server-1.9.jar
> -jets3t-0.9.3.jar
> -jettison-1.1.jar
> -jetty-6.1.26.jar
> -jetty-all-7.6.0.v20120127.jar
> -jetty-util-6.1.26.jar
> -jline-2.10.5.jar
> -jline-2.12.jar
> -joda-time-2.9.jar
> -jodd-core-3.5.2.jar
> -jpam-1.1.jar
> -json-20090211.jar
> -json4s-ast_2.10-3.2.10.jar
> -json4s-core_2.10-3.2.1

[jira] [Commented] (SPARK-12712) test-dependencies.sh script fails when run against empty .m2 cache

2016-06-08 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321402#comment-15321402
 ] 

Josh Rosen commented on SPARK-12712:


Minimal local reproduction:

{code}
rm -rf ~/.m2/repository/org/apache/commons/
./dev/test-dependencies.sh
{code}

> test-dependencies.sh script fails when run against empty .m2 cache
> --
>
> Key: SPARK-12712
> URL: https://issues.apache.org/jira/browse/SPARK-12712
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Stavros Kontopoulos
>Assignee: Josh Rosen
>
> Test-dependencies.sh script fails.
> This relates to this https://github.com/apache/spark/pull/10461
> Check failure here:
> https://ci.typesafe.com/job/ghprb-spark-multi-conf/label=Spark-Ora-JDK7-PV,scala_version=2.10/84/console
> My pr does not change dependencies shouldnt the pr manifest be generated with 
> full dependencies it seems empty. should i use replace-manifest?
> Reproducing it locally on that jenkins instance i get this: 
> Spark's published dependencies DO NOT MATCH the manifest file 
> (dev/spark-deps).
> To update the manifest file, run './dev/test-dependencies.sh 
> --replace-manifest'.
> diff --git a/dev/deps/spark-deps-hadoop-2.6 
> b/dev/pr-deps/spark-deps-hadoop-2.6
> index e703c7a..3aa2c38 100644
> --- a/dev/deps/spark-deps-hadoop-2.6
> +++ b/dev/pr-deps/spark-deps-hadoop-2.6
> @@ -1,190 +1,2 @@
> -JavaEWAH-0.3.2.jar
> -RoaringBitmap-0.5.11.jar
> -ST4-4.0.4.jar
> -activation-1.1.1.jar
> -akka-actor_2.10-2.3.11.jar
> -akka-remote_2.10-2.3.11.jar
> -akka-slf4j_2.10-2.3.11.jar
> -antlr-runtime-3.5.2.jar
> -aopalliance-1.0.jar
> -apache-log4j-extras-1.2.17.jar
> -apacheds-i18n-2.0.0-M15.jar
> -apacheds-kerberos-codec-2.0.0-M15.jar
> -api-asn1-api-1.0.0-M20.jar
> -api-util-1.0.0-M20.jar
> -arpack_combined_all-0.1.jar
> -asm-3.1.jar
> -asm-commons-3.1.jar
> -asm-tree-3.1.jar
> -avro-1.7.7.jar
> -avro-ipc-1.7.7-tests.jar
> -avro-ipc-1.7.7.jar
> -avro-mapred-1.7.7-hadoop2.jar
> -base64-2.3.8.jar
> -bcprov-jdk15on-1.51.jar
> -bonecp-0.8.0.RELEASE.jar
> -breeze-macros_2.10-0.11.2.jar
> -breeze_2.10-0.11.2.jar
> -calcite-avatica-1.2.0-incubating.jar
> -calcite-core-1.2.0-incubating.jar
> -calcite-linq4j-1.2.0-incubating.jar
> -chill-java-0.5.0.jar
> -chill_2.10-0.5.0.jar
> -commons-beanutils-1.7.0.jar
> -commons-beanutils-core-1.8.0.jar
> -commons-cli-1.2.jar
> -commons-codec-1.10.jar
> -commons-collections-3.2.2.jar
> -commons-compiler-2.7.6.jar
> -commons-compress-1.4.1.jar
> -commons-configuration-1.6.jar
> -commons-dbcp-1.4.jar
> -commons-digester-1.8.jar
> -commons-httpclient-3.1.jar
> -commons-io-2.4.jar
> -commons-lang-2.6.jar
> -commons-lang3-3.3.2.jar
> -commons-logging-1.1.3.jar
> -commons-math3-3.4.1.jar
> -commons-net-2.2.jar
> -commons-pool-1.5.4.jar
> -compress-lzf-1.0.3.jar
> -config-1.2.1.jar
> -core-1.1.2.jar
> -curator-client-2.6.0.jar
> -curator-framework-2.6.0.jar
> -curator-recipes-2.6.0.jar
> -datanucleus-api-jdo-3.2.6.jar
> -datanucleus-core-3.2.10.jar
> -datanucleus-rdbms-3.2.9.jar
> -derby-10.10.1.1.jar
> -eigenbase-properties-1.1.5.jar
> -geronimo-annotation_1.0_spec-1.1.1.jar
> -geronimo-jaspic_1.0_spec-1.0.jar
> -geronimo-jta_1.1_spec-1.1.1.jar
> -groovy-all-2.1.6.jar
> -gson-2.2.4.jar
> -guice-3.0.jar
> -guice-servlet-3.0.jar
> -hadoop-annotations-2.6.0.jar
> -hadoop-auth-2.6.0.jar
> -hadoop-client-2.6.0.jar
> -hadoop-common-2.6.0.jar
> -hadoop-hdfs-2.6.0.jar
> -hadoop-mapreduce-client-app-2.6.0.jar
> -hadoop-mapreduce-client-common-2.6.0.jar
> -hadoop-mapreduce-client-core-2.6.0.jar
> -hadoop-mapreduce-client-jobclient-2.6.0.jar
> -hadoop-mapreduce-client-shuffle-2.6.0.jar
> -hadoop-yarn-api-2.6.0.jar
> -hadoop-yarn-client-2.6.0.jar
> -hadoop-yarn-common-2.6.0.jar
> -hadoop-yarn-server-common-2.6.0.jar
> -hadoop-yarn-server-web-proxy-2.6.0.jar
> -htrace-core-3.0.4.jar
> -httpclient-4.3.2.jar
> -httpcore-4.3.2.jar
> -ivy-2.4.0.jar
> -jackson-annotations-2.4.4.jar
> -jackson-core-2.4.4.jar
> -jackson-core-asl-1.9.13.jar
> -jackson-databind-2.4.4.jar
> -jackson-jaxrs-1.9.13.jar
> -jackson-mapper-asl-1.9.13.jar
> -jackson-module-scala_2.10-2.4.4.jar
> -jackson-xc-1.9.13.jar
> -janino-2.7.8.jar
> -jansi-1.4.jar
> -java-xmlbuilder-1.0.jar
> -javax.inject-1.jar
> -javax.servlet-3.0.0.v201112011016.jar
> -javolution-5.5.1.jar
> -jaxb-api-2.2.2.jar
> -jaxb-impl-2.2.3-1.jar
> -jcl-over-slf4j-1.7.10.jar
> -jdo-api-3.0.1.jar
> -jersey-client-1.9.jar
> -jersey-core-1.9.jar
> -jersey-guice-1.9.jar
> -jersey-json-1.9.jar
> -jersey-server-1.9.jar
> -jets3t-0.9.3.jar
> -jettison-1.1.jar
> -jetty-6.1.26.jar
> -jetty-all-7.6.0.v20120127.jar
> -jetty-util-6.1.26.jar
> -jline-2.10.5.jar
> -jline-2.12.jar
> -joda-time-2.9.jar
> -jodd-core-3.5.2.jar
> -jpam-1.1.jar
> -json-20090211.jar
> -json4s-ast_2.10-3.2.10.ja

[jira] [Updated] (SPARK-15807) Support varargs for dropDuplicates in Dataset/DataFrame

2016-06-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15807:
--
Description: 
This issue adds `varargs`-types `dropDuplicates` functions in 
`Dataset/DataFrame`. Currently, `dropDuplicates` supports only `Seq` or `Array`.

{code}
scala> val ds = spark.createDataFrame(Seq(("a", 1), ("b", 2), ("a", 2)))
ds: org.apache.spark.sql.DataFrame = [_1: string, _2: int]

scala> ds.dropDuplicates(Seq("_1", "_2"))
res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: string, _2: 
int]

scala> ds.dropDuplicates("_1", "_2")
:26: error: overloaded method value dropDuplicates with alternatives:
  (colNames: 
Array[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] 
  (colNames: Seq[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] 

  ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
 cannot be applied to (String, String)
   ds.dropDuplicates("_1", "_2")
  ^
{code}

  was:
This issue adds `varargs`-types `distinct/dropDuplicates` functions in 
`Dataset/DataFrame`. Currently, `distinct` does not get arguments, and 
`dropDuplicates` supports only `Seq` or `Array`.

{code}
scala> val ds = spark.createDataFrame(Seq(("a", 1), ("b", 2), ("a", 2)))
ds: org.apache.spark.sql.DataFrame = [_1: string, _2: int]

scala> ds.dropDuplicates(Seq("_1", "_2"))
res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: string, _2: 
int]

scala> ds.dropDuplicates("_1", "_2")
:26: error: overloaded method value dropDuplicates with alternatives:
  (colNames: 
Array[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] 
  (colNames: Seq[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] 

  ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
 cannot be applied to (String, String)
   ds.dropDuplicates("_1", "_2")
  ^

scala> ds.distinct("_1", "_2")
:26: error: too many arguments for method distinct: 
()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
   ds.distinct("_1", "_2")
{code}

Summary: Support varargs for dropDuplicates in Dataset/DataFrame  (was: 
Support varargs for distinct/dropDuplicates in Dataset/DataFrame)

> Support varargs for dropDuplicates in Dataset/DataFrame
> ---
>
> Key: SPARK-15807
> URL: https://issues.apache.org/jira/browse/SPARK-15807
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>
> This issue adds `varargs`-types `dropDuplicates` functions in 
> `Dataset/DataFrame`. Currently, `dropDuplicates` supports only `Seq` or 
> `Array`.
> {code}
> scala> val ds = spark.createDataFrame(Seq(("a", 1), ("b", 2), ("a", 2)))
> ds: org.apache.spark.sql.DataFrame = [_1: string, _2: int]
> scala> ds.dropDuplicates(Seq("_1", "_2"))
> res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: string, 
> _2: int]
> scala> ds.dropDuplicates("_1", "_2")
> :26: error: overloaded method value dropDuplicates with alternatives:
>   (colNames: 
> Array[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] 
>   (colNames: 
> Seq[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] 
>   ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
>  cannot be applied to (String, String)
>ds.dropDuplicates("_1", "_2")
>   ^
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12712) test-dependencies.sh script fails when run against empty .m2 cache

2016-06-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321415#comment-15321415
 ] 

Apache Spark commented on SPARK-12712:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/13568

> test-dependencies.sh script fails when run against empty .m2 cache
> --
>
> Key: SPARK-12712
> URL: https://issues.apache.org/jira/browse/SPARK-12712
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Stavros Kontopoulos
>Assignee: Josh Rosen
>
> Test-dependencies.sh script fails.
> This relates to this https://github.com/apache/spark/pull/10461
> Check failure here:
> https://ci.typesafe.com/job/ghprb-spark-multi-conf/label=Spark-Ora-JDK7-PV,scala_version=2.10/84/console
> My pr does not change dependencies shouldnt the pr manifest be generated with 
> full dependencies it seems empty. should i use replace-manifest?
> Reproducing it locally on that jenkins instance i get this: 
> Spark's published dependencies DO NOT MATCH the manifest file 
> (dev/spark-deps).
> To update the manifest file, run './dev/test-dependencies.sh 
> --replace-manifest'.
> diff --git a/dev/deps/spark-deps-hadoop-2.6 
> b/dev/pr-deps/spark-deps-hadoop-2.6
> index e703c7a..3aa2c38 100644
> --- a/dev/deps/spark-deps-hadoop-2.6
> +++ b/dev/pr-deps/spark-deps-hadoop-2.6
> @@ -1,190 +1,2 @@
> -JavaEWAH-0.3.2.jar
> -RoaringBitmap-0.5.11.jar
> -ST4-4.0.4.jar
> -activation-1.1.1.jar
> -akka-actor_2.10-2.3.11.jar
> -akka-remote_2.10-2.3.11.jar
> -akka-slf4j_2.10-2.3.11.jar
> -antlr-runtime-3.5.2.jar
> -aopalliance-1.0.jar
> -apache-log4j-extras-1.2.17.jar
> -apacheds-i18n-2.0.0-M15.jar
> -apacheds-kerberos-codec-2.0.0-M15.jar
> -api-asn1-api-1.0.0-M20.jar
> -api-util-1.0.0-M20.jar
> -arpack_combined_all-0.1.jar
> -asm-3.1.jar
> -asm-commons-3.1.jar
> -asm-tree-3.1.jar
> -avro-1.7.7.jar
> -avro-ipc-1.7.7-tests.jar
> -avro-ipc-1.7.7.jar
> -avro-mapred-1.7.7-hadoop2.jar
> -base64-2.3.8.jar
> -bcprov-jdk15on-1.51.jar
> -bonecp-0.8.0.RELEASE.jar
> -breeze-macros_2.10-0.11.2.jar
> -breeze_2.10-0.11.2.jar
> -calcite-avatica-1.2.0-incubating.jar
> -calcite-core-1.2.0-incubating.jar
> -calcite-linq4j-1.2.0-incubating.jar
> -chill-java-0.5.0.jar
> -chill_2.10-0.5.0.jar
> -commons-beanutils-1.7.0.jar
> -commons-beanutils-core-1.8.0.jar
> -commons-cli-1.2.jar
> -commons-codec-1.10.jar
> -commons-collections-3.2.2.jar
> -commons-compiler-2.7.6.jar
> -commons-compress-1.4.1.jar
> -commons-configuration-1.6.jar
> -commons-dbcp-1.4.jar
> -commons-digester-1.8.jar
> -commons-httpclient-3.1.jar
> -commons-io-2.4.jar
> -commons-lang-2.6.jar
> -commons-lang3-3.3.2.jar
> -commons-logging-1.1.3.jar
> -commons-math3-3.4.1.jar
> -commons-net-2.2.jar
> -commons-pool-1.5.4.jar
> -compress-lzf-1.0.3.jar
> -config-1.2.1.jar
> -core-1.1.2.jar
> -curator-client-2.6.0.jar
> -curator-framework-2.6.0.jar
> -curator-recipes-2.6.0.jar
> -datanucleus-api-jdo-3.2.6.jar
> -datanucleus-core-3.2.10.jar
> -datanucleus-rdbms-3.2.9.jar
> -derby-10.10.1.1.jar
> -eigenbase-properties-1.1.5.jar
> -geronimo-annotation_1.0_spec-1.1.1.jar
> -geronimo-jaspic_1.0_spec-1.0.jar
> -geronimo-jta_1.1_spec-1.1.1.jar
> -groovy-all-2.1.6.jar
> -gson-2.2.4.jar
> -guice-3.0.jar
> -guice-servlet-3.0.jar
> -hadoop-annotations-2.6.0.jar
> -hadoop-auth-2.6.0.jar
> -hadoop-client-2.6.0.jar
> -hadoop-common-2.6.0.jar
> -hadoop-hdfs-2.6.0.jar
> -hadoop-mapreduce-client-app-2.6.0.jar
> -hadoop-mapreduce-client-common-2.6.0.jar
> -hadoop-mapreduce-client-core-2.6.0.jar
> -hadoop-mapreduce-client-jobclient-2.6.0.jar
> -hadoop-mapreduce-client-shuffle-2.6.0.jar
> -hadoop-yarn-api-2.6.0.jar
> -hadoop-yarn-client-2.6.0.jar
> -hadoop-yarn-common-2.6.0.jar
> -hadoop-yarn-server-common-2.6.0.jar
> -hadoop-yarn-server-web-proxy-2.6.0.jar
> -htrace-core-3.0.4.jar
> -httpclient-4.3.2.jar
> -httpcore-4.3.2.jar
> -ivy-2.4.0.jar
> -jackson-annotations-2.4.4.jar
> -jackson-core-2.4.4.jar
> -jackson-core-asl-1.9.13.jar
> -jackson-databind-2.4.4.jar
> -jackson-jaxrs-1.9.13.jar
> -jackson-mapper-asl-1.9.13.jar
> -jackson-module-scala_2.10-2.4.4.jar
> -jackson-xc-1.9.13.jar
> -janino-2.7.8.jar
> -jansi-1.4.jar
> -java-xmlbuilder-1.0.jar
> -javax.inject-1.jar
> -javax.servlet-3.0.0.v201112011016.jar
> -javolution-5.5.1.jar
> -jaxb-api-2.2.2.jar
> -jaxb-impl-2.2.3-1.jar
> -jcl-over-slf4j-1.7.10.jar
> -jdo-api-3.0.1.jar
> -jersey-client-1.9.jar
> -jersey-core-1.9.jar
> -jersey-guice-1.9.jar
> -jersey-json-1.9.jar
> -jersey-server-1.9.jar
> -jets3t-0.9.3.jar
> -jettison-1.1.jar
> -jetty-6.1.26.jar
> -jetty-all-7.6.0.v20120127.jar
> -jetty-util-6.1.26.jar
> -jline-2.10.5.jar
> -jline-2.12.jar
> -joda-time-2.9.jar
> -jodd-core-3.5.2.jar
> -jpam-1.1.jar
> -json-20090211.jar
> -json4s-ast_2.10-3.2.10.jar
> -js

[jira] [Updated] (SPARK-15828) YARN is not aware of Spark's External Shuffle Service

2016-06-08 Thread Miles Crawford (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miles Crawford updated SPARK-15828:
---
Description: 
When using Spark with dynamic allocation, it is common for all containers on a
particular YARN node to be released.  This is generally okay because of the
external shuffle service.

The problem arises when YARN is attempting to downsize the cluster - once all
containers on the node are gone, YARN will decommission the node, regardless of
whether the external shuffle service is still required!

The once the node is shut down, jobs begin failing with messages such as:
{code}
2016-06-07 18:56:40,016 ERROR o.a.s.n.shuffle.RetryingBlockFetcher: Exception 
while beginning fetch of 13 outstanding blocks
java.io.IOException: Failed to connect to 
ip-10-12-32-67.us-west-2.compute.internal/10.12.32.67:7337
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
 
~[d58092b50d2880a1c259cb51c6ed83955f97e34a4b75cedaa8ab00f89a09df50-spark-network-common_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
 
~[d58092b50d2880a1c259cb51c6ed83955f97e34a4b75cedaa8ab00f89a09df50-spark-network-common_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
 
~[2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
 
[2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
 
[2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
 
[2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:152)
 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:316)
 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:263)
 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:112)
 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:43)
 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98) 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
[d56f3336b4a0fcc71fe8beb90052dbafd0e8

[jira] [Created] (SPARK-15828) YARN is not aware of Spark's External Shuffle Service

2016-06-08 Thread Miles Crawford (JIRA)

Miles Crawford created SPARK-15828:
--

 Summary: YARN is not aware of Spark's External Shuffle Service
 Key: SPARK-15828
 URL: https://issues.apache.org/jira/browse/SPARK-15828
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.1
 Environment: EMR
Reporter: Miles Crawford


When using Spark with dynamic allocation, it is common for all containers on a
particular YARN node to be released.  This is generally okay because of the
external shuffle service.

The problem arises when YARN is attempting to downsize the cluster - once all
containers on the node are gone, YARN will decommission the node, regardless of
whether the external shuffle service is still required!

The once the node is shut down, jobs begin failing with messages such as:
```
2016-06-07 18:56:40,016 ERROR o.a.s.n.shuffle.RetryingBlockFetcher: Exception 
while beginning fetch of 13 outstanding blocks
java.io.IOException: Failed to connect to 
ip-10-12-32-67.us-west-2.compute.internal/10.12.32.67:7337
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
 
~[d58092b50d2880a1c259cb51c6ed83955f97e34a4b75cedaa8ab00f89a09df50-spark-network-common_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
 
~[d58092b50d2880a1c259cb51c6ed83955f97e34a4b75cedaa8ab00f89a09df50-spark-network-common_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
 
~[2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
 
[2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
 
[2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
 
[2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:152)
 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:316)
 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:263)
 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:112)
 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:43)
 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98) 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
[d56f3336b4a0fcc71fe8beb90052dba

[jira] [Created] (SPARK-15829) spark master webpage links to application UI broke when running in cluster mode

2016-06-08 Thread Andrew Davidson (JIRA)

Andrew Davidson created SPARK-15829:
---

 Summary: spark master webpage links to application UI broke when 
running in cluster mode
 Key: SPARK-15829
 URL: https://issues.apache.org/jira/browse/SPARK-15829
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.6.1
 Environment: AWS ec2 cluster
Reporter: Andrew Davidson
Priority: Critical


Hi 
I created a cluster using the spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2

I use the stand alone cluster manager. I have a streaming app running in 
cluster mode. I notice the master webpage links to the application UI page are 
incorrect

It does not look like jira will let my upload images. I'll try and describe the 
web pages and the bug

My master is running on
http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:8080/

It has a section marked "applications". If I click on one of the running 
application ids I am taken to a page showing "Executor Summary".  This page has 
a link to teh 'application detail UI'  the url is 
http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:4041/

Notice it things the application UI is running on the cluster master.

It is actually running on the same machine as the driver on port 4041. I was 
able to reverse engine the url by noticing the private ip address is part of 
the worker id . For example   worker-20160322041632-172.31.23.201-34909

next I went on the aws ec2 console to find the public DNS name for this machine 
http://ec2-54-193-104-169.us-west-1.compute.amazonaws.com:4041/streaming/

Kind regards

Andy




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15830) Spark application should get hive tokens only when needed

2016-06-08 Thread Yesha Vora (JIRA)

Yesha Vora created SPARK-15830:
--

 Summary: Spark application should get hive tokens only when needed
 Key: SPARK-15830
 URL: https://issues.apache.org/jira/browse/SPARK-15830
 Project: Spark
  Issue Type: Improvement
Reporter: Yesha Vora


Currently , All spark application try to get Hive tokens (Even if application 
does not use them) if Hive is installed on the cluster.

Due to this practice, spark application which does not require Hive fails when 
Hive service (metastore) is down for some reason.

Thus, spark should only try to get Hive tokens when required. It should not 
fetch hive token if it is not needed by application.

Example : Spark Pi application does not perform any hive related actions. But 
Spark Pi application still fails if hive metastore service is down.
{code}
16/06/08 01:18:42 INFO YarnSparkHadoopUtil: getting token for namenode: 
hdfs://xxx:8020/user/xx/.sparkStaging/application_1465347287950_0001
16/06/08 01:18:42 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 7 for xx 
on xx.xx.xx.xxx:8020
16/06/08 01:18:43 INFO metastore: Trying to connect to metastore with URI 
thrift://xx.xx.xx.xxx:9090
16/06/08 01:18:43 WARN metastore: Failed to connect to the MetaStore Server...
16/06/08 01:18:43 INFO metastore: Waiting 5 seconds before next connection 
attempt.
16/06/08 01:18:48 INFO metastore: Trying to connect to metastore with URI 
thrift://xx.xx.xx.xxx:9090
16/06/08 01:18:48 WARN metastore: Failed to connect to the MetaStore Server...
16/06/08 01:18:48 INFO metastore: Waiting 5 seconds before next connection 
attempt.
16/06/08 01:18:53 INFO metastore: Trying to connect to metastore with URI 
thrift://xx.xx.xx.xxx:9090
16/06/08 01:18:53 WARN metastore: Failed to connect to the MetaStore Server...
16/06/08 01:18:53 INFO metastore: Waiting 5 seconds before next connection 
attempt.
16/06/08 01:18:59 WARN Hive: Failed to access metastore. This class should not 
accessed in runtime.
org.apache.hadoop.hive.ql.metadata.Hive Exception : java.lang.Runtime Exception 
: Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498){code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15831) Kryo 2.21 TreeMap serialization bug causes random job failures with RDDs of HBase puts

2016-06-08 Thread JIRA

Charles Gariépy-Ikeson created SPARK-15831:
--

 Summary: Kryo 2.21 TreeMap serialization bug causes random job 
failures with RDDs of HBase puts
 Key: SPARK-15831
 URL: https://issues.apache.org/jira/browse/SPARK-15831
 Project: Spark
  Issue Type: Bug
Reporter: Charles Gariépy-Ikeson


This was found on Spark 1.5, but it seems that all Spark 1.x brings in the 
problematic dependency in question.

Kryo 2.21 has a bug when serializing TreeMap that causes intermittent failures 
in Spark. This problem cause be seen especially when sinking data to HBase 
using a RDD of HBase Puts (which internally have TreeMap).

Kryo fixed the issue in 2.21.1. Current work around involves setting 
"spark.kryo.referenceTracking" to false.

For reference see:
Kryo commit: 
https://github.com/EsotericSoftware/kryo/commit/00ffc7ed443e022a8438d1e4c4f5b86fe4f9912b
TreeMap Kryo Issue: https://github.com/EsotericSoftware/kryo/issues/112
HBase Put Kryo Issue: https://github.com/EsotericSoftware/kryo/issues/428



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15830) Spark application should get hive tokens only when it is required

2016-06-08 Thread Yesha Vora (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yesha Vora updated SPARK-15830:
---
Summary: Spark application should get hive tokens only when it is required  
(was: Spark application should get hive tokens only when needed)

> Spark application should get hive tokens only when it is required
> -
>
> Key: SPARK-15830
> URL: https://issues.apache.org/jira/browse/SPARK-15830
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yesha Vora
>
> Currently , All spark application try to get Hive tokens (Even if application 
> does not use them) if Hive is installed on the cluster.
> Due to this practice, spark application which does not require Hive fails 
> when Hive service (metastore) is down for some reason.
> Thus, spark should only try to get Hive tokens when required. It should not 
> fetch hive token if it is not needed by application.
> Example : Spark Pi application does not perform any hive related actions. But 
> Spark Pi application still fails if hive metastore service is down.
> {code}
> 16/06/08 01:18:42 INFO YarnSparkHadoopUtil: getting token for namenode: 
> hdfs://xxx:8020/user/xx/.sparkStaging/application_1465347287950_0001
> 16/06/08 01:18:42 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 7 for 
> xx on xx.xx.xx.xxx:8020
> 16/06/08 01:18:43 INFO metastore: Trying to connect to metastore with URI 
> thrift://xx.xx.xx.xxx:9090
> 16/06/08 01:18:43 WARN metastore: Failed to connect to the MetaStore Server...
> 16/06/08 01:18:43 INFO metastore: Waiting 5 seconds before next connection 
> attempt.
> 16/06/08 01:18:48 INFO metastore: Trying to connect to metastore with URI 
> thrift://xx.xx.xx.xxx:9090
> 16/06/08 01:18:48 WARN metastore: Failed to connect to the MetaStore Server...
> 16/06/08 01:18:48 INFO metastore: Waiting 5 seconds before next connection 
> attempt.
> 16/06/08 01:18:53 INFO metastore: Trying to connect to metastore with URI 
> thrift://xx.xx.xx.xxx:9090
> 16/06/08 01:18:53 WARN metastore: Failed to connect to the MetaStore Server...
> 16/06/08 01:18:53 INFO metastore: Waiting 5 seconds before next connection 
> attempt.
> 16/06/08 01:18:59 WARN Hive: Failed to access metastore. This class should 
> not accessed in runtime.
> org.apache.hadoop.hive.ql.metadata.Hive Exception : java.lang.Runtime 
> Exception : Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
> at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
> at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
> at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498){code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15743) Prevent saving with all-column partitioning

2016-06-08 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15743:
-
Labels: releasenotes  (was: )

> Prevent saving with all-column partitioning
> ---
>
> Key: SPARK-15743
> URL: https://issues.apache.org/jira/browse/SPARK-15743
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>  Labels: releasenotes
>
> When saving datasets on storage, `partitionBy` provides an easy way to 
> construct the directory structure. However, if a user choose all columns as 
> partition columns, some exceptions occurs.
> - ORC: `AnalysisException` on **future read** due to schema inference failure.
> - Parquet: `InvalidSchemaException` on **write execution** due to Parquet 
> limitation.
> The followings are the examples.
> **ORC with all column partitioning**
> {code}
> scala> 
> spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save("/tmp/data")
>   
>   
> scala> spark.read.format("orc").load("/tmp/data").collect()
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at 
> /tmp/data. It must be specified manually;
> {code}
> **Parquet with all-column partitioning**
> {code}
> scala> 
> spark.range(100).write.format("parquet").mode("overwrite").partitionBy("id").save("/tmp/data")
> [Stage 0:>  (0 + 8) / 
> 8]16/06/02 16:51:17 ERROR Utils: Aborting task
> org.apache.parquet.schema.InvalidSchemaException: A group type can not be 
> empty. Parquet does not support empty group without leaves. Empty group: 
> spark_schema
> ... (lots of error messages)
> {code}
> Although some formats like JSON support all-column partitioning without any 
> problem, it seems not a good idea to make lots of empty directories. 
> This issue prevents this by consistently raising `AnalysisException` before 
> saving. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11695) Set s3a credentials by default similarly to s3 and s3n

2016-06-08 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321539#comment-15321539
 ] 

Steve Loughran commented on SPARK-11695:


There's some interesting ramifications to this code: it means that if the env 
vars are set then they overwrite any value in core-default.xml. It's also going 
to slightly complicate the workings of HADOOP-12807; now that the AWS env vars 
are being picked up, there's a whole set of config options which ought to be 
handled together. The session token is the big one. If that var is set, then 
fixing up the fs.s3a things will stop operations working.

http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html#cli-environment




> Set s3a credentials by default similarly to s3 and s3n
> --
>
> Key: SPARK-11695
> URL: https://issues.apache.org/jira/browse/SPARK-11695
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Chris Bannister
>Assignee: Chris Bannister
>Priority: Trivial
> Fix For: 1.6.0
>
>
> When creating a new hadoop configuration Spark sets s3 and s3n credentials if 
> the environment variables are set, it should also add s3a.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15791) NPE in ScalarSubquery

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15791:


Assignee: Apache Spark  (was: Eric Liang)

> NPE in ScalarSubquery
> -
>
> Key: SPARK-15791
> URL: https://issues.apache.org/jira/browse/SPARK-15791
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> {code}
> Job aborted due to stage failure: Task 0 in stage 146.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 146.0 (TID 48828, 10.0.206.208): 
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.ScalarSubquery.dataType(subquery.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.CaseWhenBase.dataType(conditionalExpressions.scala:103)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:165)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$output$1.apply(basicPhysicalOperators.scala:33)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$output$1.apply(basicPhysicalOperators.scala:33)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.execution.ProjectExec.output(basicPhysicalOperators.scala:33)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.output(WholeStageCodegenExec.scala:291)
>   at 
> org.apache.spark.sql.execution.DeserializeToObjectExec$$anonfun$2.apply(objects.scala:85)
>   at 
> org.apache.spark.sql.execution.DeserializeToObjectExec$$anonfun$2.apply(objects.scala:84)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:775)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:775)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15791) NPE in ScalarSubquery

2016-06-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321574#comment-15321574
 ] 

Apache Spark commented on SPARK-15791:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/13569

> NPE in ScalarSubquery
> -
>
> Key: SPARK-15791
> URL: https://issues.apache.org/jira/browse/SPARK-15791
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Eric Liang
>
> {code}
> Job aborted due to stage failure: Task 0 in stage 146.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 146.0 (TID 48828, 10.0.206.208): 
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.ScalarSubquery.dataType(subquery.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.CaseWhenBase.dataType(conditionalExpressions.scala:103)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:165)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$output$1.apply(basicPhysicalOperators.scala:33)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$output$1.apply(basicPhysicalOperators.scala:33)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.execution.ProjectExec.output(basicPhysicalOperators.scala:33)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.output(WholeStageCodegenExec.scala:291)
>   at 
> org.apache.spark.sql.execution.DeserializeToObjectExec$$anonfun$2.apply(objects.scala:85)
>   at 
> org.apache.spark.sql.execution.DeserializeToObjectExec$$anonfun$2.apply(objects.scala:84)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:775)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:775)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15791) NPE in ScalarSubquery

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15791:


Assignee: Eric Liang  (was: Apache Spark)

> NPE in ScalarSubquery
> -
>
> Key: SPARK-15791
> URL: https://issues.apache.org/jira/browse/SPARK-15791
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Eric Liang
>
> {code}
> Job aborted due to stage failure: Task 0 in stage 146.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 146.0 (TID 48828, 10.0.206.208): 
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.ScalarSubquery.dataType(subquery.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.CaseWhenBase.dataType(conditionalExpressions.scala:103)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:165)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$output$1.apply(basicPhysicalOperators.scala:33)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$output$1.apply(basicPhysicalOperators.scala:33)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.execution.ProjectExec.output(basicPhysicalOperators.scala:33)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.output(WholeStageCodegenExec.scala:291)
>   at 
> org.apache.spark.sql.execution.DeserializeToObjectExec$$anonfun$2.apply(objects.scala:85)
>   at 
> org.apache.spark.sql.execution.DeserializeToObjectExec$$anonfun$2.apply(objects.scala:84)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:775)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:775)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15832) Embedded IN/EXISTS predicate subquery throws TreeNodeException

2016-06-08 Thread Ioana Delaney (JIRA)

Ioana Delaney created SPARK-15832:
-

 Summary: Embedded IN/EXISTS predicate subquery throws 
TreeNodeException
 Key: SPARK-15832
 URL: https://issues.apache.org/jira/browse/SPARK-15832
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Ioana Delaney
Priority: Minor


Queries with embedded existential sub-query predicates throws exception when 
building the physical plan.

Example failing query:

{code}
scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t1")
scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t2")
scala> sql("select c1 from t1 where (case when c2 in (select c2 from t2) then 2 
else 3 end) IN (select c2 from t1)").show()

Binding attribute, tree: c2#239
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: c2#239
  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
  at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)

  ...
  at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
  at 
org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66)
  at 
org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.execution.joins.HashJoin$class.org$apache$spark$sql$execution$joins$HashJoin$$x$8(HashJoin.scala:66)
  at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8$lzycompute(BroadcastHashJoinExec.scala:38)
  at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8(BroadcastHashJoinExec.scala:38)
  at 
org.apache.spark.sql.execution.joins.HashJoin$class.buildKeys(HashJoin.scala:63)
  at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys$lzycompute(BroadcastHashJoinExec.scala:38)
  at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys(BroadcastHashJoinExec.scala:38)
  at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.requiredChildDistribution(BroadcastHashJoinExec.scala:52)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15832) Embedded IN/EXISTS predicate subquery throws TreeNodeException

2016-06-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321617#comment-15321617
 ] 

Apache Spark commented on SPARK-15832:
--

User 'ioana-delaney' has created a pull request for this issue:
https://github.com/apache/spark/pull/13570

> Embedded IN/EXISTS predicate subquery throws TreeNodeException
> --
>
> Key: SPARK-15832
> URL: https://issues.apache.org/jira/browse/SPARK-15832
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ioana Delaney
>Priority: Minor
>
> Queries with embedded existential sub-query predicates throws exception when 
> building the physical plan.
> Example failing query:
> {code}
> scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t1")
> scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t2")
> scala> sql("select c1 from t1 where (case when c2 in (select c2 from t2) then 
> 2 else 3 end) IN (select c2 from t1)").show()
> Binding attribute, tree: c2#239
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: c2#239
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>   ...
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$class.org$apache$spark$sql$execution$joins$HashJoin$$x$8(HashJoin.scala:66)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8$lzycompute(BroadcastHashJoinExec.scala:38)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8(BroadcastHashJoinExec.scala:38)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$class.buildKeys(HashJoin.scala:63)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys$lzycompute(BroadcastHashJoinExec.scala:38)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys(BroadcastHashJoinExec.scala:38)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.requiredChildDistribution(BroadcastHashJoinExec.scala:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15832) Embedded IN/EXISTS predicate subquery throws TreeNodeException

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15832:


Assignee: (was: Apache Spark)

> Embedded IN/EXISTS predicate subquery throws TreeNodeException
> --
>
> Key: SPARK-15832
> URL: https://issues.apache.org/jira/browse/SPARK-15832
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ioana Delaney
>Priority: Minor
>
> Queries with embedded existential sub-query predicates throws exception when 
> building the physical plan.
> Example failing query:
> {code}
> scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t1")
> scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t2")
> scala> sql("select c1 from t1 where (case when c2 in (select c2 from t2) then 
> 2 else 3 end) IN (select c2 from t1)").show()
> Binding attribute, tree: c2#239
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: c2#239
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>   ...
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$class.org$apache$spark$sql$execution$joins$HashJoin$$x$8(HashJoin.scala:66)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8$lzycompute(BroadcastHashJoinExec.scala:38)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8(BroadcastHashJoinExec.scala:38)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$class.buildKeys(HashJoin.scala:63)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys$lzycompute(BroadcastHashJoinExec.scala:38)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys(BroadcastHashJoinExec.scala:38)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.requiredChildDistribution(BroadcastHashJoinExec.scala:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15832) Embedded IN/EXISTS predicate subquery throws TreeNodeException

2016-06-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15832:


Assignee: Apache Spark

> Embedded IN/EXISTS predicate subquery throws TreeNodeException
> --
>
> Key: SPARK-15832
> URL: https://issues.apache.org/jira/browse/SPARK-15832
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ioana Delaney
>Assignee: Apache Spark
>Priority: Minor
>
> Queries with embedded existential sub-query predicates throws exception when 
> building the physical plan.
> Example failing query:
> {code}
> scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t1")
> scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t2")
> scala> sql("select c1 from t1 where (case when c2 in (select c2 from t2) then 
> 2 else 3 end) IN (select c2 from t1)").show()
> Binding attribute, tree: c2#239
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: c2#239
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>   ...
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$class.org$apache$spark$sql$execution$joins$HashJoin$$x$8(HashJoin.scala:66)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8$lzycompute(BroadcastHashJoinExec.scala:38)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8(BroadcastHashJoinExec.scala:38)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$class.buildKeys(HashJoin.scala:63)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys$lzycompute(BroadcastHashJoinExec.scala:38)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys(BroadcastHashJoinExec.scala:38)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.requiredChildDistribution(BroadcastHashJoinExec.scala:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15832) Embedded IN/EXISTS predicate subquery throws TreeNodeException

2016-06-08 Thread Ioana Delaney (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ioana Delaney updated SPARK-15832:
--
Component/s: SQL

> Embedded IN/EXISTS predicate subquery throws TreeNodeException
> --
>
> Key: SPARK-15832
> URL: https://issues.apache.org/jira/browse/SPARK-15832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ioana Delaney
>Priority: Minor
>
> Queries with embedded existential sub-query predicates throws exception when 
> building the physical plan.
> Example failing query:
> {code}
> scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t1")
> scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t2")
> scala> sql("select c1 from t1 where (case when c2 in (select c2 from t2) then 
> 2 else 3 end) IN (select c2 from t1)").show()
> Binding attribute, tree: c2#239
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: c2#239
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>   ...
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$class.org$apache$spark$sql$execution$joins$HashJoin$$x$8(HashJoin.scala:66)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8$lzycompute(BroadcastHashJoinExec.scala:38)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8(BroadcastHashJoinExec.scala:38)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$class.buildKeys(HashJoin.scala:63)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys$lzycompute(BroadcastHashJoinExec.scala:38)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys(BroadcastHashJoinExec.scala:38)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.requiredChildDistribution(BroadcastHashJoinExec.scala:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8426) Add blacklist mechanism for YARN container allocation

2016-06-08 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321626#comment-15321626
 ] 

Kay Ousterhout commented on SPARK-8426:
---

Imran can you enable commenting on the design doc?

> Add blacklist mechanism for YARN container allocation
> -
>
> Key: SPARK-8426
> URL: https://issues.apache.org/jira/browse/SPARK-8426
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, YARN
>Reporter: Saisai Shao
>Priority: Minor
> Attachments: DesignDocforBlacklistMechanism.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 127 matches

Mail list logo