[jira] [Updated] (SPARK-2963) There no documentation for building about SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-2963: -- Description: Currently, if we'd like to use ThriftServer or CLI for SparkSQL, we need to use -Phive-thriftserver option on building but it's implicit. I think we need to describe how to build. was: Currently, if we'd like to use SparkSQL, we need to use -Phive-thriftserver option on building but it's implicit. I think we need to describe how to build. There no documentation for building about SparkSQL -- Key: SPARK-2963 URL: https://issues.apache.org/jira/browse/SPARK-2963 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Kousuke Saruta Currently, if we'd like to use ThriftServer or CLI for SparkSQL, we need to use -Phive-thriftserver option on building but it's implicit. I think we need to describe how to build. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2963) There no documentation about building ThriftServer and CLI for SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-2963: -- Summary: There no documentation about building ThriftServer and CLI for SparkSQL (was: There no documentation for building about SparkSQL) There no documentation about building ThriftServer and CLI for SparkSQL --- Key: SPARK-2963 URL: https://issues.apache.org/jira/browse/SPARK-2963 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Kousuke Saruta Currently, if we'd like to use ThriftServer or CLI for SparkSQL, we need to use -Phive-thriftserver option on building but it's implicit. I think we need to describe how to build. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2963) There no documentation about building to use HiveServer and CLI for SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-2963: -- Summary: There no documentation about building to use HiveServer and CLI for SparkSQL (was: There no documentation about building ThriftServer and CLI for SparkSQL) There no documentation about building to use HiveServer and CLI for SparkSQL Key: SPARK-2963 URL: https://issues.apache.org/jira/browse/SPARK-2963 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Kousuke Saruta Currently, if we'd like to use ThriftServer or CLI for SparkSQL, we need to use -Phive-thriftserver option on building but it's implicit. I think we need to describe how to build. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2963) There no documentation about building to use HiveServer and CLI for SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-2963: -- Description: Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use -Phive-thriftserver option when building but it's implicit. I think we need to describe how to build. was: Currently, if we'd like to use ThriftServer or CLI for SparkSQL, we need to use -Phive-thriftserver option on building but it's implicit. I think we need to describe how to build. There no documentation about building to use HiveServer and CLI for SparkSQL Key: SPARK-2963 URL: https://issues.apache.org/jira/browse/SPARK-2963 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Kousuke Saruta Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use -Phive-thriftserver option when building but it's implicit. I think we need to describe how to build. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2204) Scheduler for Mesos in fine-grained mode launches tasks on wrong executors
[ https://issues.apache.org/jira/browse/SPARK-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092376#comment-14092376 ] Xu Zhongxing edited comment on SPARK-2204 at 8/11/14 6:49 AM: -- I encountered this issue again when I use Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector master branch. Maybe this is not fixed on some failure/exception paths. I run spark in coarse-grained mode. There are some exceptions thrown at the executors. But the spark driver is waiting and printing repeatedly: TRACE [spark-akka.actor.default-dispatcher-17] 2014-08-11 10:57:32,998 Logging.scala (line 66) Checking for hosts with\ no recent heart beats in BlockManagerMaster. The mesos master WARNING log: W0811 10:32:58.172175 1646 master.cpp:2103] Ignoring unknown exited executor 20140808-113811-858302656-5050-1645-2 on slave 20140808-113811-858302656-505\ 0-1645-2 (ndb9) W0811 10:32:58.181217 1649 master.cpp:2103] Ignoring unknown exited executor 20140808-113811-858302656-5050-1645-5 on slave 20140808-113811-858302656-505\ 0-1645-5 (ndb5) W0811 10:32:58.277014 1650 master.cpp:2103] Ignoring unknown exited executor 20140808-113811-858302656-5050-1645-3 on slave 20140808-113811-858302656-505\ 0-1645-3 (ndb6) W0811 10:32:58.344130 1648 master.cpp:2103] Ignoring unknown exited executor 20140808-113811-858302656-5050-1645-0 on slave 20140808-113811-858302656-505\ 0-1645-0 (ndb0) W0811 10:32:58.354117 1651 master.cpp:2103] Ignoring unknown exited executor 20140804-095254-505981120-5050-20258-11 on slave 20140804-095254-505981120-5\ 050-20258-11 (ndb2) W0811 10:32:58.550233 1647 master.cpp:2103] Ignoring unknown exited executor 20140804-172212-505981120-5050-26571-2 on slave 20140804-172212-505981120-50\ 50-26571-2 (ndb3) W0811 10:32:58.793258 1653 master.cpp:2103] Ignoring unknown exited executor 20140804-095254-505981120-5050-20258-19 on slave 20140804-095254-505981120-5\ 050-20258-19 (ndb1) W0811 10:32:58.904842 1652 master.cpp:2103] Ignoring unknown exited executor 20140804-172212-505981120-5050-26571-0 on slave 20140804-172212-505981120-50\ 50-26571-0 (ndb4) Some other logs are at: https://github.com/datastax/spark-cassandra-connector/issues/134 was (Author: xuzhongxing): I encountered this issue again when I use Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector master branch. I run spark in coarse-grained mode. There are some exceptions thrown at the executors. But the spark driver is waiting and printing repeatedly: TRACE [spark-akka.actor.default-dispatcher-17] 2014-08-11 10:57:32,998 Logging.scala (line 66) Checking for hosts with\ no recent heart beats in BlockManagerMaster. The mesos master WARNING log: W0811 10:32:58.172175 1646 master.cpp:2103] Ignoring unknown exited executor 20140808-113811-858302656-5050-1645-2 on slave 20140808-113811-858302656-505\ 0-1645-2 (ndb9) W0811 10:32:58.181217 1649 master.cpp:2103] Ignoring unknown exited executor 20140808-113811-858302656-5050-1645-5 on slave 20140808-113811-858302656-505\ 0-1645-5 (ndb5) W0811 10:32:58.277014 1650 master.cpp:2103] Ignoring unknown exited executor 20140808-113811-858302656-5050-1645-3 on slave 20140808-113811-858302656-505\ 0-1645-3 (ndb6) W0811 10:32:58.344130 1648 master.cpp:2103] Ignoring unknown exited executor 20140808-113811-858302656-5050-1645-0 on slave 20140808-113811-858302656-505\ 0-1645-0 (ndb0) W0811 10:32:58.354117 1651 master.cpp:2103] Ignoring unknown exited executor 20140804-095254-505981120-5050-20258-11 on slave 20140804-095254-505981120-5\ 050-20258-11 (ndb2) W0811 10:32:58.550233 1647 master.cpp:2103] Ignoring unknown exited executor 20140804-172212-505981120-5050-26571-2 on slave 20140804-172212-505981120-50\ 50-26571-2 (ndb3) W0811 10:32:58.793258 1653 master.cpp:2103] Ignoring unknown exited executor 20140804-095254-505981120-5050-20258-19 on slave 20140804-095254-505981120-5\ 050-20258-19 (ndb1) W0811 10:32:58.904842 1652 master.cpp:2103] Ignoring unknown exited executor 20140804-172212-505981120-5050-26571-0 on slave 20140804-172212-505981120-50\ 50-26571-0 (ndb4) Some other logs are at: https://github.com/datastax/spark-cassandra-connector/issues/134 Scheduler for Mesos in fine-grained mode launches tasks on wrong executors -- Key: SPARK-2204 URL: https://issues.apache.org/jira/browse/SPARK-2204 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.0 Reporter: Sebastien Rainville Assignee: Sebastien Rainville Priority: Blocker Fix For: 1.0.1, 1.1.0 MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer]) is assuming that TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning task lists in the same order as the offers it was passed, but in
[jira] [Created] (SPARK-2964) Wrong silent option in spark-sql script
Kousuke Saruta created SPARK-2964: - Summary: Wrong silent option in spark-sql script Key: SPARK-2964 URL: https://issues.apache.org/jira/browse/SPARK-2964 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Kousuke Saruta Priority: Minor In spark-sql script, -s option is handled as silent option but org.apache.hadoop.hive.cli.OptionProcessor interpret -S (large character) as silent mode option. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2965) Fix HashOuterJoin output nullabilities.
Takuya Ueshin created SPARK-2965: Summary: Fix HashOuterJoin output nullabilities. Key: SPARK-2965 URL: https://issues.apache.org/jira/browse/SPARK-2965 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Output attributes of opposite side of {{OuterJoin}} should be nullable. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2966) Add an approximation algorithm for hierarchical clustering to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Ishikawa updated SPARK-2966: --- Summary: Add an approximation algorithm for hierarchical clustering to MLlib (was: Add an approximation algorithm for hierarchical clustering algorithm to MLlib) Add an approximation algorithm for hierarchical clustering to MLlib --- Key: SPARK-2966 URL: https://issues.apache.org/jira/browse/SPARK-2966 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Yu Ishikawa Priority: Minor A hierarchical clustering algorithm is a useful unsupervised learning method. Koga. et al. proposed highly scalable hierarchical clustering altgorithm in (1). I would like to implement this method. I suggest adding an approximate hierarchical clustering algorithm to MLlib. I'd like this to be assigned to me. h3. Reference # Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing http://dl.acm.org/citation.cfm?id=1266811 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2967) Several SQL unit test failed when sort-based shuffle is enabled
Saisai Shao created SPARK-2967: -- Summary: Several SQL unit test failed when sort-based shuffle is enabled Key: SPARK-2967 URL: https://issues.apache.org/jira/browse/SPARK-2967 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Saisai Shao Several SQLQuerySuite unit test failed when sort-based shuffle is enabled. Seems SQL test uses GenericMutableRow which will make ExternalSorter's internal buffer all refered to the same object finally because of object's mutability. Seems row should be copied when feeding into ExternalSorter. The error shows below, though have many failures, I only pasted part of them: {noformat} SQLQuerySuite: - SPARK-2041 column name equals tablename - SPARK-2407 Added Parser of SQL SUBSTR() - index into array - left semi greater than predicate - index into array of arrays - agg *** FAILED *** Results do not match for query: Aggregate ['a], ['a,SUM('b) AS c1#38] UnresolvedRelation None, testData2, None == Analyzed Plan == Aggregate [a#4], [a#4,SUM(CAST(b#5, LongType)) AS c1#38L] SparkLogicalPlan (ExistingRdd [a#4,b#5], MapPartitionsRDD[7] at mapPartitions at basicOperators.scala:215) == Physical Plan == Aggregate false, [a#4], [a#4,SUM(PartialSum#40L) AS c1#38L] Exchange (HashPartitioning [a#4], 200) Aggregate true, [a#4], [a#4,SUM(CAST(b#5, LongType)) AS PartialSum#40L] ExistingRdd [a#4,b#5], MapPartitionsRDD[7] at mapPartitions at basicOperators.scala:215 == Results == !== Correct Answer - 3 == == Spark Answer - 3 == !Vector(1, 3) [1,3] !Vector(2, 3) [1,3] !Vector(3, 3) [1,3] (QueryTest.scala:53) - aggregates with nulls - select * - simple select - sorting *** FAILED *** Results do not match for query: Sort ['a ASC,'b ASC] Project [*] UnresolvedRelation None, testData2, None == Analyzed Plan == Sort [a#4 ASC,b#5 ASC] Project [a#4,b#5] SparkLogicalPlan (ExistingRdd [a#4,b#5], MapPartitionsRDD[7] at mapPartitions at basicOperators.scala:215) == Physical Plan == Sort [a#4 ASC,b#5 ASC], true Exchange (RangePartitioning [a#4 ASC,b#5 ASC], 200) ExistingRdd [a#4,b#5], MapPartitionsRDD[7] at mapPartitions at basicOperators.scala:215 == Results == !== Correct Answer - 6 == == Spark Answer - 6 == !Vector(1, 1) [3,2] !Vector(1, 2) [3,2] !Vector(2, 1) [3,2] !Vector(2, 2) [3,2] !Vector(3, 1) [3,2] !Vector(3, 2) [3,2] (QueryTest.scala:53) - limit - average - average overflow *** FAILED *** Results do not match for query: Aggregate ['b], [AVG('a) AS c0#90,'b] UnresolvedRelation None, largeAndSmallInts, None == Analyzed Plan == Aggregate [b#3], [AVG(CAST(a#2, LongType)) AS c0#90,b#3] SparkLogicalPlan (ExistingRdd [a#2,b#3], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:215) == Physical Plan == Aggregate false, [b#3], [(CAST(SUM(PartialSum#93L), DoubleType) / CAST(SUM(PartialCount#94L), DoubleType)) AS c0#90,b#3] Exchange (HashPartitioning [b#3], 200) Aggregate true, [b#3], [b#3,COUNT(CAST(a#2, LongType)) AS PartialCount#94L,SUM(CAST(a#2, LongType)) AS PartialSum#93L] ExistingRdd [a#2,b#3], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:215 == Results == !== Correct Answer - 2 == == Spark Answer - 2 == !Vector(2.0, 2) [2.147483645E9,1] !Vector(2.147483645E9, 1) [2.147483645E9,1] (QueryTest.scala:53) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2969) Make ScalaReflection be able to handle MapType.containsNull and MapType.valueContainsNull.
[ https://issues.apache.org/jira/browse/SPARK-2969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-2969: - Description: Make {{ScalaReflection}} be able to handle like: - Seq\[Int] as ArrayType(IntegerType, containsNull = false) - Seq\[java.lang.Integer] as ArrayType(IntegerType, containsNull = true) - Map\[Int, Long] as MapType(IntegerType, LongType, valueContainsNull = false) - Map\[Int, java.lang.Long] as MapType(IntegerType, LongType, valueContainsNull = true) was: Make {{ScalaReflection}} be able to handle: - Seq\[Int] as ArrayType(IntegerType, containsNull = false) - Seq\[java.lang.Integer] as ArrayType(IntegerType, containsNull = true) - Map\[Int, Long] as MapType(IntegerType, LongType, valueContainsNull = false) - Map\[Int, java.lang.Long] as MapType(IntegerType, LongType, valueContainsNull = true) Make ScalaReflection be able to handle MapType.containsNull and MapType.valueContainsNull. -- Key: SPARK-2969 URL: https://issues.apache.org/jira/browse/SPARK-2969 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin Make {{ScalaReflection}} be able to handle like: - Seq\[Int] as ArrayType(IntegerType, containsNull = false) - Seq\[java.lang.Integer] as ArrayType(IntegerType, containsNull = true) - Map\[Int, Long] as MapType(IntegerType, LongType, valueContainsNull = false) - Map\[Int, java.lang.Long] as MapType(IntegerType, LongType, valueContainsNull = true) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2969) Make ScalaReflection be able to handle MapType.containsNull and MapType.valueContainsNull.
Takuya Ueshin created SPARK-2969: Summary: Make ScalaReflection be able to handle MapType.containsNull and MapType.valueContainsNull. Key: SPARK-2969 URL: https://issues.apache.org/jira/browse/SPARK-2969 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin Make {{ScalaReflection}} be able to handle: - Seq\[Int] as ArrayType(IntegerType, containsNull = false) - Seq\[java.lang.Integer] as ArrayType(IntegerType, containsNull = true) - Map\[Int, Long] as MapType(IntegerType, LongType, valueContainsNull = false) - Map\[Int, java.lang.Long] as MapType(IntegerType, LongType, valueContainsNull = true) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2969) Make ScalaReflection be able to handle MapType.containsNull and MapType.valueContainsNull.
[ https://issues.apache.org/jira/browse/SPARK-2969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092671#comment-14092671 ] Apache Spark commented on SPARK-2969: - User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/1889 Make ScalaReflection be able to handle MapType.containsNull and MapType.valueContainsNull. -- Key: SPARK-2969 URL: https://issues.apache.org/jira/browse/SPARK-2969 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin Make {{ScalaReflection}} be able to handle like: - Seq\[Int] as ArrayType(IntegerType, containsNull = false) - Seq\[java.lang.Integer] as ArrayType(IntegerType, containsNull = true) - Map\[Int, Long] as MapType(IntegerType, LongType, valueContainsNull = false) - Map\[Int, java.lang.Long] as MapType(IntegerType, LongType, valueContainsNull = true) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2878) Inconsistent Kryo serialisation with custom Kryo Registrator
[ https://issues.apache.org/jira/browse/SPARK-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092677#comment-14092677 ] Apache Spark commented on SPARK-2878: - User 'GrahamDennis' has created a pull request for this issue: https://github.com/apache/spark/pull/1890 Inconsistent Kryo serialisation with custom Kryo Registrator Key: SPARK-2878 URL: https://issues.apache.org/jira/browse/SPARK-2878 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.0.2 Environment: Linux RedHat EL 6, 4-node Spark cluster. Reporter: Graham Dennis The custom Kryo Registrator (a class with the org.apache.spark.serializer.KryoRegistrator trait) is not used with every Kryo instance created, and this causes inconsistent serialisation and deserialisation. The Kryo Registrator is sometimes not used because of a ClassNotFound exception that only occurs if it *isn't* the Worker thread (of an Executor) that tries to create the KryoRegistrator. A complete description of the problem and a project reproducing the problem can be found at https://github.com/GrahamDennis/spark-kryo-serialisation I have currently only tested this with Spark 1.0.0, but will try to test against 1.0.2. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled
[ https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092705#comment-14092705 ] Kousuke Saruta commented on SPARK-2970: --- I noticed it's not caused by the reason above. It's caused by shutdown hook of FileSystem. I have already resolved it to execute shutdown hook for stopping SparkSQLContext before the shutdown hook for FileSystem. spark-sql script ends with IOException when EventLogging is enabled --- Key: SPARK-2970 URL: https://issues.apache.org/jira/browse/SPARK-2970 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Environment: CDH5.1.0 (Hadoop 2.3.0) Reporter: Kousuke Saruta When spark-sql script run with spark.eventLog.enabled set true, it ends with IOException because FileLogger can not create APPLICATION_COMPLETE file in HDFS. I think it's because FIleSystem is closed by HiveSessionImplWithUGI. It has a code as follows. {code} public void close() throws HiveSQLException { try { acquire(); ShimLoader.getHadoopShims().closeAllForUGI(sessionUgi); cancelDelegationToken(); } finally { release(); super.close(); } } {code} When using Hadoop 2.0+, ShimLoader.getHadoopShim above returns Hadoop23Shim which extends HadoopShimSecure. HadoopShimSecure#closeAllForUGI is implemented as follows. {code} @Override public void closeAllForUGI(UserGroupInformation ugi) { try { FileSystem.closeAllForUGI(ugi); } catch (IOException e) { LOG.error(Could not clean up file-system handles for UGI: + ugi, e); } } {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled
[ https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092710#comment-14092710 ] Apache Spark commented on SPARK-2970: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/1891 spark-sql script ends with IOException when EventLogging is enabled --- Key: SPARK-2970 URL: https://issues.apache.org/jira/browse/SPARK-2970 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Environment: CDH5.1.0 (Hadoop 2.3.0) Reporter: Kousuke Saruta When spark-sql script run with spark.eventLog.enabled set true, it ends with IOException because FileLogger can not create APPLICATION_COMPLETE file in HDFS. I think it's because FIleSystem is closed by HiveSessionImplWithUGI. It has a code as follows. {code} public void close() throws HiveSQLException { try { acquire(); ShimLoader.getHadoopShims().closeAllForUGI(sessionUgi); cancelDelegationToken(); } finally { release(); super.close(); } } {code} When using Hadoop 2.0+, ShimLoader.getHadoopShim above returns Hadoop23Shim which extends HadoopShimSecure. HadoopShimSecure#closeAllForUGI is implemented as follows. {code} @Override public void closeAllForUGI(UserGroupInformation ugi) { try { FileSystem.closeAllForUGI(ugi); } catch (IOException e) { LOG.error(Could not clean up file-system handles for UGI: + ugi, e); } } {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2971) Orphaned YARN ApplicationMaster lingers forever
Shay Rojansky created SPARK-2971: Summary: Orphaned YARN ApplicationMaster lingers forever Key: SPARK-2971 URL: https://issues.apache.org/jira/browse/SPARK-2971 Project: Spark Issue Type: Bug Affects Versions: 1.0.2 Environment: Python yarn client mode, Cloudera 5.1.0 on Ubuntu precise Reporter: Shay Rojansky We have cases where if CTRL-C is hit during a Spark job startup, a YARN ApplicationMaster is created but cannot connect to the driver (presumably because the driver has terminated). Once an AM enters this state it never exits it, and has to be manually killed in YARN. Here's an excerpt from the AM logs: {noformat} SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/yarn/nm/usercache/roji/filecache/40/spark-assembly-1.0.2-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 14/08/11 16:29:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/08/11 16:29:39 INFO SecurityManager: Changing view acls to: roji 14/08/11 16:29:39 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(roji) 14/08/11 16:29:40 INFO Slf4jLogger: Slf4jLogger started 14/08/11 16:29:40 INFO Remoting: Starting remoting 14/08/11 16:29:40 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkyar...@g024.grid.eaglerd.local:34075] 14/08/11 16:29:40 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkyar...@g024.grid.eaglerd.local:34075] 14/08/11 16:29:40 INFO RMProxy: Connecting to ResourceManager at master.grid.eaglerd.local/192.168.41.100:8030 14/08/11 16:29:40 INFO ExecutorLauncher: ApplicationAttemptId: appattempt_1407759736957_0014_01 14/08/11 16:29:40 INFO ExecutorLauncher: Registering the ApplicationMaster 14/08/11 16:29:40 INFO ExecutorLauncher: Waiting for Spark driver to be reachable. 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ... 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ... 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ... 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ... 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2962) Suboptimal scheduling in spark
[ https://issues.apache.org/jira/browse/SPARK-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092807#comment-14092807 ] Mridul Muralidharan commented on SPARK-2962: On further investigation : a) The primary issue is a combination of SPARK-2089 and current schedule behavior for pendingTasksWithNoPrefs. SPARK-2089 leads to very bad allocation of nodes - particularly has an impact on bigger clusters. It leads to a lot of block having no data or rack local executors - causing them to end up in pendingTasksWithNoPrefs. While loading data off dfs, when an executor is being scheduled, even though there might be rack local schedules available for it (or, on waiting a while, data local too - see (b) below), because of current scheduler behavior, tasks from pendingTasksWithNoPrefs get scheduled : causing a large number of ANY tasks to be scheduled at the very onset. The combination of these, with lack of marginal alleviation via (b) is what caused the performance impact. b) spark.scheduler.minRegisteredExecutorsRatio was not yet been used in the workload - so that might alleviate some of the non deterministic waiting and ensuring adequate executors are allocated ! Thanks [~lirui] Suboptimal scheduling in spark -- Key: SPARK-2962 URL: https://issues.apache.org/jira/browse/SPARK-2962 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: All Reporter: Mridul Muralidharan In findTask, irrespective of 'locality' specified, pendingTasksWithNoPrefs are always scheduled with PROCESS_LOCAL pendingTasksWithNoPrefs contains tasks which currently do not have any alive locations - but which could come in 'later' : particularly relevant when spark app is just coming up and containers are still being added. This causes a large number of non node local tasks to be scheduled incurring significant network transfers in the cluster when running with non trivial datasets. The comment // Look for no-pref tasks after rack-local tasks since they can run anywhere. is misleading in the method code : locality levels start from process_local down to any, and so no prefs get scheduled much before rack. Also note that, currentLocalityIndex is reset to the taskLocality returned by this method - so returning PROCESS_LOCAL as the level will trigger wait times again. (Was relevant before recent change to scheduler, and might be again based on resolution of this issue). Found as part of writing test for SPARK-2931 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1777) Pass cached blocks directly to disk if memory is not large enough
[ https://issues.apache.org/jira/browse/SPARK-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092826#comment-14092826 ] Apache Spark commented on SPARK-1777: - User 'liyezhang556520' has created a pull request for this issue: https://github.com/apache/spark/pull/1892 Pass cached blocks directly to disk if memory is not large enough --- Key: SPARK-1777 URL: https://issues.apache.org/jira/browse/SPARK-1777 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Andrew Or Priority: Critical Fix For: 1.1.0 Attachments: spark-1777-design-doc.pdf Currently in Spark we entirely unroll a partition and then check whether it will cause us to exceed the storage limit. This has an obvious problem - if the partition itself is enough to push us over the storage limit (and eventually over the JVM heap), it will cause an OOM. This can happen in cases where a single partition is very large or when someone is running examples locally with a small heap. https://github.com/apache/spark/blob/f6ff2a61d00d12481bfb211ae13d6992daacdcc2/core/src/main/scala/org/apache/spark/CacheManager.scala#L148 We should think a bit about the most elegant way to fix this - it shares some similarities with the external aggregation code. A simple idea is to periodically check the size of the buffer as we are unrolling and see if we are over the memory limit. If we are we could prepend the existing buffer to the iterator and write that entire thing out to disk. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled
[ https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092860#comment-14092860 ] Cheng Lian commented on SPARK-2970: --- [~sarutak] Would you mind to update the issue description? Otherwise it can be confusing for people that don't see your comments below. Thanks. spark-sql script ends with IOException when EventLogging is enabled --- Key: SPARK-2970 URL: https://issues.apache.org/jira/browse/SPARK-2970 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Environment: CDH5.1.0 (Hadoop 2.3.0) Reporter: Kousuke Saruta When spark-sql script run with spark.eventLog.enabled set true, it ends with IOException because FileLogger can not create APPLICATION_COMPLETE file in HDFS. I think it's because FIleSystem is closed by HiveSessionImplWithUGI. It has a code as follows. {code} public void close() throws HiveSQLException { try { acquire(); ShimLoader.getHadoopShims().closeAllForUGI(sessionUgi); cancelDelegationToken(); } finally { release(); super.close(); } } {code} When using Hadoop 2.0+, ShimLoader.getHadoopShim above returns Hadoop23Shim which extends HadoopShimSecure. HadoopShimSecure#closeAllForUGI is implemented as follows. {code} @Override public void closeAllForUGI(UserGroupInformation ugi) { try { FileSystem.closeAllForUGI(ugi); } catch (IOException e) { LOG.error(Could not clean up file-system handles for UGI: + ugi, e); } } {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled
[ https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-2970: -- Description: When spark-sql script run with spark.eventLog.enabled set true, it ends with IOException because FileLogger can not create APPLICATION_COMPLETE file in HDFS. It's is because a shutdown hook of SparkSQLCLIDriver is executed after a shutdown hook of org.apache.hadoop.fs.FileSystem is executed. When spark.eventLog.enabled is true, the hook of SparkSQLCLIDriver finally try to create a file to mark the application finished but the hook of FileSystem try to close FileSystem. was: When spark-sql script run with spark.eventLog.enabled set true, it ends with IOException because FileLogger can not create APPLICATION_COMPLETE file in HDFS. I think it's because FIleSystem is closed by HiveSessionImplWithUGI. It has a code as follows. {code} public void close() throws HiveSQLException { try { acquire(); ShimLoader.getHadoopShims().closeAllForUGI(sessionUgi); cancelDelegationToken(); } finally { release(); super.close(); } } {code} When using Hadoop 2.0+, ShimLoader.getHadoopShim above returns Hadoop23Shim which extends HadoopShimSecure. HadoopShimSecure#closeAllForUGI is implemented as follows. {code} @Override public void closeAllForUGI(UserGroupInformation ugi) { try { FileSystem.closeAllForUGI(ugi); } catch (IOException e) { LOG.error(Could not clean up file-system handles for UGI: + ugi, e); } } {code} spark-sql script ends with IOException when EventLogging is enabled --- Key: SPARK-2970 URL: https://issues.apache.org/jira/browse/SPARK-2970 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Environment: CDH5.1.0 (Hadoop 2.3.0) Reporter: Kousuke Saruta When spark-sql script run with spark.eventLog.enabled set true, it ends with IOException because FileLogger can not create APPLICATION_COMPLETE file in HDFS. It's is because a shutdown hook of SparkSQLCLIDriver is executed after a shutdown hook of org.apache.hadoop.fs.FileSystem is executed. When spark.eventLog.enabled is true, the hook of SparkSQLCLIDriver finally try to create a file to mark the application finished but the hook of FileSystem try to close FileSystem. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled
[ https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092879#comment-14092879 ] Kousuke Saruta commented on SPARK-2970: --- [~liancheng] Thank you pointing my mistake. I've modified the description. spark-sql script ends with IOException when EventLogging is enabled --- Key: SPARK-2970 URL: https://issues.apache.org/jira/browse/SPARK-2970 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Environment: CDH5.1.0 (Hadoop 2.3.0) Reporter: Kousuke Saruta When spark-sql script run with spark.eventLog.enabled set true, it ends with IOException because FileLogger can not create APPLICATION_COMPLETE file in HDFS. It's is because a shutdown hook of SparkSQLCLIDriver is executed after a shutdown hook of org.apache.hadoop.fs.FileSystem is executed. When spark.eventLog.enabled is true, the hook of SparkSQLCLIDriver finally try to create a file to mark the application finished but the hook of FileSystem try to close FileSystem. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored
[ https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092881#comment-14092881 ] Thomas Graves commented on SPARK-2089: -- Sandy, just wondering if you have any ETA on fix for this? With YARN, preferredNodeLocalityData isn't honored --- Key: SPARK-2089 URL: https://issues.apache.org/jira/browse/SPARK-2089 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Priority: Critical When running in YARN cluster mode, apps can pass preferred locality data when constructing a Spark context that will dictate where to request executor containers. This is currently broken because of a race condition. The Spark-YARN code runs the user class and waits for it to start up a SparkContext. During its initialization, the SparkContext will create a YarnClusterScheduler, which notifies a monitor in the Spark-YARN code that . The Spark-Yarn code then immediately fetches the preferredNodeLocationData from the SparkContext and uses it to start requesting containers. But in the SparkContext constructor that takes the preferredNodeLocationData, setting preferredNodeLocationData comes after the rest of the initialization, so, if the Spark-YARN code comes around quickly enough after being notified, the data that's fetched is the empty unset version. The occurred during all of my runs. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2963) There no documentation about building to use HiveServer and CLI for SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092889#comment-14092889 ] Cheng Lian edited comment on SPARK-2963 at 8/11/14 3:31 PM: Actually [there is|https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#running-the-thrift-jdbc-server], but the Spark CLI part is incomplete. Would you mind to update the Issue title and description? Thanks. was (Author: lian cheng): Actually [there is|https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#running-the-thrift-jdbc-server] but the Spark CLI part is incomplete. Would you mind to update the Issue title and description? Thanks. There no documentation about building to use HiveServer and CLI for SparkSQL Key: SPARK-2963 URL: https://issues.apache.org/jira/browse/SPARK-2963 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Kousuke Saruta Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use -Phive-thriftserver option when building but it's implicit. I think we need to describe how to build. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2963) The description about building to use HiveServer and CLI is imcomplete
[ https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092894#comment-14092894 ] Kousuke Saruta commented on SPARK-2963: --- I've updated this title and Github's one. The description about building to use HiveServer and CLI is imcomplete -- Key: SPARK-2963 URL: https://issues.apache.org/jira/browse/SPARK-2963 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Kousuke Saruta Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use -Phive-thriftserver option when building but it's implicit. I think we need to describe how to build. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2963) The description about building to use HiveServer and CLI is imcomplete
[ https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-2963: -- Description: Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use -Phive-thriftserver option when building but it's description is incomplete. (was: Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use -Phive-thriftserver option when building but it's implicit. I think we need to describe how to build.) The description about building to use HiveServer and CLI is imcomplete -- Key: SPARK-2963 URL: https://issues.apache.org/jira/browse/SPARK-2963 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Kousuke Saruta Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use -Phive-thriftserver option when building but it's description is incomplete. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1297) Upgrade HBase dependency to 0.98.0
[ https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated SPARK-1297: -- Attachment: spark-1297-v4.txt Patch v4 adds two profiles to examples/pom.xml : hbase-hadoop1 (default) hbase-hadoop2 I verified that compilation passes with either profile active. Upgrade HBase dependency to 0.98.0 -- Key: SPARK-1297 URL: https://issues.apache.org/jira/browse/SPARK-1297 Project: Spark Issue Type: Task Reporter: Ted Yu Priority: Minor Attachments: spark-1297-v2.txt, spark-1297-v4.txt HBase 0.94.6 was released 11 months ago. Upgrade HBase dependency to 0.98.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0
[ https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092967#comment-14092967 ] Sean Owen commented on SPARK-1297: -- I think you may want to open a PR rather than post patches. Code reviews happen on github.com I see what you did there by triggering one or the other profile with the hbase.profile property. Yeah, that may be the least disruptive way to play this. But don't the profiles need to select the hadoop-compat module appropriate for Hadoop 1 vs Hadoop 2? Upgrade HBase dependency to 0.98.0 -- Key: SPARK-1297 URL: https://issues.apache.org/jira/browse/SPARK-1297 Project: Spark Issue Type: Task Reporter: Ted Yu Priority: Minor Attachments: spark-1297-v2.txt, spark-1297-v4.txt HBase 0.94.6 was released 11 months ago. Upgrade HBase dependency to 0.98.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped
Shay Rojansky created SPARK-2972: Summary: APPLICATION_COMPLETE not created in Python unless context explicitly stopped Key: SPARK-2972 URL: https://issues.apache.org/jira/browse/SPARK-2972 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.2 Environment: Cloudera 5.1, yarn master on ubuntu precise Reporter: Shay Rojansky If you don't explicitly stop a SparkContext at the end of a Python application with sc.stop(), an APPLICATION_COMPLETE file isn't created and the job doesn't get picked up by the history server. This can be easily reproduced with pyspark (but affects scripts as well). The current workaround is to wrap the entire script with a try/finally and stop manually. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2973) Add a way to show tables without executing a job
Aaron Davidson created SPARK-2973: - Summary: Add a way to show tables without executing a job Key: SPARK-2973 URL: https://issues.apache.org/jira/browse/SPARK-2973 Project: Spark Issue Type: Improvement Components: SQL Reporter: Aaron Davidson Right now, sql(show tables).collect() will start a Spark job which shows up in the UI. There should be a way to get these without this step. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0
[ https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092988#comment-14092988 ] Ted Yu commented on SPARK-1297: --- HBase client doesn't need to specify dependency on hbase-hadoop1-compat or hbase-hadoop2-compat I can open a PR once there is positive feedback on the approach - I came from a project where reviews mostly happen on JIRA :-) Can someone assign this issue to me ? Upgrade HBase dependency to 0.98.0 -- Key: SPARK-1297 URL: https://issues.apache.org/jira/browse/SPARK-1297 Project: Spark Issue Type: Task Reporter: Ted Yu Priority: Minor Attachments: spark-1297-v2.txt, spark-1297-v4.txt HBase 0.94.6 was released 11 months ago. Upgrade HBase dependency to 0.98.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0
[ https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093012#comment-14093012 ] Ted Yu commented on SPARK-1297: --- https://github.com/apache/spark/pull/1893 Upgrade HBase dependency to 0.98.0 -- Key: SPARK-1297 URL: https://issues.apache.org/jira/browse/SPARK-1297 Project: Spark Issue Type: Task Reporter: Ted Yu Priority: Minor Attachments: spark-1297-v2.txt, spark-1297-v4.txt HBase 0.94.6 was released 11 months ago. Upgrade HBase dependency to 0.98.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0
[ https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093018#comment-14093018 ] Apache Spark commented on SPARK-1297: - User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/1893 Upgrade HBase dependency to 0.98.0 -- Key: SPARK-1297 URL: https://issues.apache.org/jira/browse/SPARK-1297 Project: Spark Issue Type: Task Reporter: Ted Yu Priority: Minor Attachments: spark-1297-v2.txt, spark-1297-v4.txt HBase 0.94.6 was released 11 months ago. Upgrade HBase dependency to 0.98.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2974) Utils.getLocalDir() may return non-existent spark.local.dir directory
Josh Rosen created SPARK-2974: - Summary: Utils.getLocalDir() may return non-existent spark.local.dir directory Key: SPARK-2974 URL: https://issues.apache.org/jira/browse/SPARK-2974 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Josh Rosen Priority: Blocker The patch for [SPARK-2324] modified Spark to ignore a certain number of invalid local directories. Unfortunately, the {{Utils.getLocalDir()}} method returns the _first_ local directory from {{spark.local.dir}}, which might not exist. This can lead to confusing FileNotFound errors when executors attempt to fetch files. (I commented on this at https://github.com/apache/spark/pull/1274#issuecomment-51537965, but I'm opening a JIRA so we don't forget to fix it). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2717) BasicBlockFetchIterator#next should log when it gets stuck
[ https://issues.apache.org/jira/browse/SPARK-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2717: --- Priority: Major (was: Blocker) BasicBlockFetchIterator#next should log when it gets stuck -- Key: SPARK-2717 URL: https://issues.apache.org/jira/browse/SPARK-2717 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Josh Rosen If this is stuck for a long time waiting for blocks, we should log what nodes it is waiting for to help debugging. One way to do this is to call take() with a timeout (e.g. 60 seconds) and when the timeout expires log a message for the blocks it is still waiting for. This could all happen in a loop so that the wait just restarts after the message is logged. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2717) BasicBlockFetchIterator#next should log when it gets stuck
[ https://issues.apache.org/jira/browse/SPARK-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2717: --- Priority: Critical (was: Major) BasicBlockFetchIterator#next should log when it gets stuck -- Key: SPARK-2717 URL: https://issues.apache.org/jira/browse/SPARK-2717 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Josh Rosen Priority: Critical If this is stuck for a long time waiting for blocks, we should log what nodes it is waiting for to help debugging. One way to do this is to call take() with a timeout (e.g. 60 seconds) and when the timeout expires log a message for the blocks it is still waiting for. This could all happen in a loop so that the wait just restarts after the message is logged. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2931) getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2931: --- Target Version/s: 1.1.0 getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException --- Key: SPARK-2931 URL: https://issues.apache.org/jira/browse/SPARK-2931 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: Spark EC2, spark-1.1.0-snapshot1, sort-by-key spark-perf benchmark Reporter: Josh Rosen Priority: Blocker Attachments: scala-sort-by-key.err, test.patch When running Spark Perf's sort-by-key benchmark on EC2 with v1.1.0-snapshot, I get the following errors (one per task): {code} 14/08/08 18:54:22 INFO scheduler.TaskSetManager: Starting task 39.0 in stage 0.0 (TID 39, ip-172-31-14-30.us-west-2.compute.internal, PROCESS_LOCAL, 1003 bytes) 14/08/08 18:54:22 INFO cluster.SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkexecu...@ip-172-31-9-213.us-west-2.compute.internal:58901/user/Executor#1436065036] with ID 0 14/08/08 18:54:22 ERROR actor.OneForOneStrategy: 1 java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:475) at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:409) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:261) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:257) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:254) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:254) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:153) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:103) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} This causes the job to hang. I can deterministically reproduce this by re-running the test, either in isolation or as part of the full performance testing suite. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2931) getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2931: --- Fix Version/s: (was: 1.1.0) getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException --- Key: SPARK-2931 URL: https://issues.apache.org/jira/browse/SPARK-2931 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: Spark EC2, spark-1.1.0-snapshot1, sort-by-key spark-perf benchmark Reporter: Josh Rosen Priority: Blocker Attachments: scala-sort-by-key.err, test.patch When running Spark Perf's sort-by-key benchmark on EC2 with v1.1.0-snapshot, I get the following errors (one per task): {code} 14/08/08 18:54:22 INFO scheduler.TaskSetManager: Starting task 39.0 in stage 0.0 (TID 39, ip-172-31-14-30.us-west-2.compute.internal, PROCESS_LOCAL, 1003 bytes) 14/08/08 18:54:22 INFO cluster.SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkexecu...@ip-172-31-9-213.us-west-2.compute.internal:58901/user/Executor#1436065036] with ID 0 14/08/08 18:54:22 ERROR actor.OneForOneStrategy: 1 java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:475) at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:409) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:261) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:257) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:254) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:254) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:153) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:103) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} This causes the job to hang. I can deterministically reproduce this by re-running the test, either in isolation or as part of the full performance testing suite. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2963) The description about building to use HiveServer and CLI is incomplete
[ https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-2963: -- Summary: The description about building to use HiveServer and CLI is incomplete (was: The description about building to use HiveServer and CLI is imcomplete) The description about building to use HiveServer and CLI is incomplete -- Key: SPARK-2963 URL: https://issues.apache.org/jira/browse/SPARK-2963 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Kousuke Saruta Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use -Phive-thriftserver option when building but it's description is incomplete. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2976) There are too many tabs in some source files
Kousuke Saruta created SPARK-2976: - Summary: There are too many tabs in some source files Key: SPARK-2976 URL: https://issues.apache.org/jira/browse/SPARK-2976 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: Kousuke Saruta Priority: Minor Currently, there are too many tabs in source file, which does not correspond to coding style. I saw following 3 files have tabs. * sorttable.js * JavaPageRank.java * JavaKinesisWordCountASL.java -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093119#comment-14093119 ] Yin Huai commented on SPARK-2890: - What is the semantic when you have columns with same names? Spark SQL should allow SELECT with duplicated columns - Key: SPARK-2890 URL: https://issues.apache.org/jira/browse/SPARK-2890 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Jianshi Huang Spark reported error java.lang.IllegalArgumentException with messages: java.lang.IllegalArgumentException: requirement failed: Found fields with the same name. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317) at org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310) at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306) at org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83) at org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433) After trial and error, it seems it's caused by duplicated columns in my select clause. I made the duplication on purpose for my code to parse correctly. I think we should allow users to specify duplicated columns as return value. Jianshi -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2790) PySpark zip() doesn't work properly if RDDs have different serializers
[ https://issues.apache.org/jira/browse/SPARK-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093133#comment-14093133 ] Apache Spark commented on SPARK-2790: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/1894 PySpark zip() doesn't work properly if RDDs have different serializers -- Key: SPARK-2790 URL: https://issues.apache.org/jira/browse/SPARK-2790 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0, 1.1.0 Reporter: Josh Rosen Assignee: Davies Liu Priority: Critical In PySpark, attempting to {{zip()}} two RDDs may fail if the RDDs have different serializers (e.g. batched vs. unbatched), even if those RDDs have the same number of partitions and same numbers of elements. This problem occurs in the MLlib Python APIs, where we might want to zip a JavaRDD of LabelledPoints with a JavaRDD of batch-serialized Python objects. This is problematic because whether zip() succeeds or errors depends on the partitioning / batching strategy, and we don't want to surface the serialization details to users. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1284) pyspark hangs after IOError on Executor
[ https://issues.apache.org/jira/browse/SPARK-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093137#comment-14093137 ] Davies Liu commented on SPARK-1284: --- [~jblomo], could you reproduce this on master or 1.1 branch? Maybe the pyspark did not hange after this error message, the take() had finished successfully before the error message pop up. The noisy error messages had been fixed in PR https://github.com/apache/spark/pull/1625 pyspark hangs after IOError on Executor --- Key: SPARK-1284 URL: https://issues.apache.org/jira/browse/SPARK-1284 Project: Spark Issue Type: Bug Components: PySpark Reporter: Jim Blomo Assignee: Davies Liu When running a reduceByKey over a cached RDD, Python fails with an exception, but the failure is not detected by the task runner. Spark and the pyspark shell hang waiting for the task to finish. The error is: {code} PySpark worker failed with exception: Traceback (most recent call last): File /home/hadoop/spark/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /home/hadoop/spark/python/pyspark/serializers.py, line 182, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /home/hadoop/spark/python/pyspark/serializers.py, line 118, in dump_stream self._write_with_length(obj, stream) File /home/hadoop/spark/python/pyspark/serializers.py, line 130, in _write_with_length stream.write(serialized) IOError: [Errno 104] Connection reset by peer 14/03/19 22:48:15 INFO scheduler.TaskSetManager: Serialized task 4.0:0 as 4257 bytes in 47 ms Traceback (most recent call last): File /home/hadoop/spark/python/pyspark/daemon.py, line 117, in launch_worker worker(listen_sock) File /home/hadoop/spark/python/pyspark/daemon.py, line 107, in worker outfile.flush() IOError: [Errno 32] Broken pipe {code} I can reproduce the error by running take(10) on the cached RDD before running reduceByKey (which looks at the whole input file). Affects Version 1.0.0-SNAPSHOT (4d88030486) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2700) Hidden files (such as .impala_insert_staging) should be filtered out by sqlContext.parquetFile
[ https://issues.apache.org/jira/browse/SPARK-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093150#comment-14093150 ] Yin Huai commented on SPARK-2700: - Can we resolve it? Hidden files (such as .impala_insert_staging) should be filtered out by sqlContext.parquetFile -- Key: SPARK-2700 URL: https://issues.apache.org/jira/browse/SPARK-2700 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.0.1 Reporter: Teng Qiu Fix For: 1.1.0 when creating a table in impala, a hidden folder .impala_insert_staging will be created in the folder of table. if we want to load such a table using Spark SQL API sqlContext.parquetFile, this hidden folder makes trouble, spark try to get metadata from this folder, you will see the exception: {code:borderStyle=solid} Caused by: java.io.IOException: Could not read footer for file FileStatus{path=hdfs://xxx:8020/user/hive/warehouse/parquet_strings/.impala_insert_staging; isDirectory=true; modification_time=1406333729252; access_time=0; owner=hdfs; group=hdfs; permission=rwxr-xr-x; isSymlink=false} ... ... Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path is not a file: /user/hive/warehouse/parquet_strings/.impala_insert_staging {code} and impala side do not think this is their problem: https://issues.cloudera.org/browse/IMPALA-837 (IMPALA-837 Delete .impala_insert_staging directory after INSERT) so maybe we should filter out these hidden folder/file by reading parquet tables -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2948) PySpark doesn't work on Python 2.6
[ https://issues.apache.org/jira/browse/SPARK-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2948. --- Resolution: Fixed Fix Version/s: 1.1.0 PySpark doesn't work on Python 2.6 -- Key: SPARK-2948 URL: https://issues.apache.org/jira/browse/SPARK-2948 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: CentOS 6.5 / Python 2.6.6 Reporter: Kousuke Saruta Assignee: Josh Rosen Priority: Blocker Fix For: 1.1.0 In serializser.py, collections.namedtuple is redefined as follows. {code} def namedtuple(name, fields, verbose=False, rename=False): cls = _old_namedtuple(name, fields, verbose, rename) return _hack_namedtuple(cls) {code} The number of arguments is 4 but the number of arguments of namedtuple for Python 2.6 is 3 so mismatch is occurred. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2954) PySpark MLlib serialization tests fail on Python 2.6
[ https://issues.apache.org/jira/browse/SPARK-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2954. --- Resolution: Fixed Fix Version/s: 1.1.0 PySpark MLlib serialization tests fail on Python 2.6 Key: SPARK-2954 URL: https://issues.apache.org/jira/browse/SPARK-2954 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Reporter: Josh Rosen Assignee: Josh Rosen Fix For: 1.1.0 The PySpark MLlib tests currently fail on Python 2.6 due to problems unpacking data from bytearray using struct.unpack: {code} ** File pyspark/mllib/_common.py, line 181, in __main__._deserialize_double Failed example: _deserialize_double(_serialize_double(1L)) == 1.0 Exception raised: Traceback (most recent call last): File /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py, line 1253, in __run compileflags, 1) in test.globs File doctest __main__._deserialize_double[4], line 1, in module _deserialize_double(_serialize_double(1L)) == 1.0 File pyspark/mllib/_common.py, line 194, in _deserialize_double return struct.unpack(d, ba[offset:])[0] error: unpack requires a string argument of length 8 ** File pyspark/mllib/_common.py, line 184, in __main__._deserialize_double Failed example: _deserialize_double(_serialize_double(sys.float_info.max)) == x Exception raised: Traceback (most recent call last): File /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py, line 1253, in __run compileflags, 1) in test.globs File doctest __main__._deserialize_double[6], line 1, in module _deserialize_double(_serialize_double(sys.float_info.max)) == x File pyspark/mllib/_common.py, line 194, in _deserialize_double return struct.unpack(d, ba[offset:])[0] error: unpack requires a string argument of length 8 ** File pyspark/mllib/_common.py, line 187, in __main__._deserialize_double Failed example: _deserialize_double(_serialize_double(sys.float_info.max)) == y Exception raised: Traceback (most recent call last): File /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py, line 1253, in __run compileflags, 1) in test.globs File doctest __main__._deserialize_double[8], line 1, in module _deserialize_double(_serialize_double(sys.float_info.max)) == y File pyspark/mllib/_common.py, line 194, in _deserialize_double return struct.unpack(d, ba[offset:])[0] error: unpack requires a string argument of length 8 ** {code} It looks like one solution is to wrap the {{bytearray}} with {{buffer()}}: http://stackoverflow.com/a/15467046/590203 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2977) Fix handling of short shuffle manager names in ShuffleBlockManager
Josh Rosen created SPARK-2977: - Summary: Fix handling of short shuffle manager names in ShuffleBlockManager Key: SPARK-2977 URL: https://issues.apache.org/jira/browse/SPARK-2977 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Josh Rosen Since we allow short names for {{spark.shuffle.manager}}, all code that reads that configuration property should be prepared to handle the short names. See my comment at https://github.com/apache/spark/pull/1799#discussion_r16029607 (opening this as a JIRA so we don't forget to fix it). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2101) Python unit tests fail on Python 2.6 because of lack of unittest.skipIf()
[ https://issues.apache.org/jira/browse/SPARK-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2101. --- Resolution: Fixed Fix Version/s: 1.1.0 Python unit tests fail on Python 2.6 because of lack of unittest.skipIf() - Key: SPARK-2101 URL: https://issues.apache.org/jira/browse/SPARK-2101 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Uri Laserson Assignee: Josh Rosen Fix For: 1.1.0 PySpark tests fail with Python 2.6 because they currently depend on {{unittest.skipIf}}, which was only introduced in Python 2.7. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2976) There are too many tabs in some source files
[ https://issues.apache.org/jira/browse/SPARK-2976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093175#comment-14093175 ] Apache Spark commented on SPARK-2976: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/1895 There are too many tabs in some source files Key: SPARK-2976 URL: https://issues.apache.org/jira/browse/SPARK-2976 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: Kousuke Saruta Priority: Minor Currently, there are too many tabs in source file, which does not correspond to coding style. I saw following 3 files have tabs. * sorttable.js * JavaPageRank.java * JavaKinesisWordCountASL.java -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2420) Dependency changes for compatibility with Hive
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brock Noland updated SPARK-2420: Labels: Hive (was: ) Dependency changes for compatibility with Hive -- Key: SPARK-2420 URL: https://issues.apache.org/jira/browse/SPARK-2420 Project: Spark Issue Type: Wish Components: Build Affects Versions: 1.0.0 Reporter: Xuefu Zhang Labels: Hive Attachments: spark_1.0.0.patch During the prototyping of HIVE-7292, many library conflicts showed up because Spark build contains versions of libraries that's vastly different from current major Hadoop version. It would be nice if we can choose versions that's in line with Hadoop or shading them in the assembly. Here are the wish list: 1. Upgrade protobuf version to 2.5.0 from current 2.4.1 2. Shading Spark's jetty and servlet dependency in the assembly. 3. guava version difference. Spark is using a higher version. I'm not sure what's the best solution for this. The list may grow as HIVE-7292 proceeds. For information only, the attached is a patch that we applied on Spark in order to make Spark work with Hive. It gives an idea of the scope of changes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1284) pyspark hangs after IOError on Executor
[ https://issues.apache.org/jira/browse/SPARK-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093219#comment-14093219 ] Jim Blomo commented on SPARK-1284: -- I will try to reproduce on the 1.1 branch later this week, thanks for the update! pyspark hangs after IOError on Executor --- Key: SPARK-1284 URL: https://issues.apache.org/jira/browse/SPARK-1284 Project: Spark Issue Type: Bug Components: PySpark Reporter: Jim Blomo Assignee: Davies Liu When running a reduceByKey over a cached RDD, Python fails with an exception, but the failure is not detected by the task runner. Spark and the pyspark shell hang waiting for the task to finish. The error is: {code} PySpark worker failed with exception: Traceback (most recent call last): File /home/hadoop/spark/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /home/hadoop/spark/python/pyspark/serializers.py, line 182, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /home/hadoop/spark/python/pyspark/serializers.py, line 118, in dump_stream self._write_with_length(obj, stream) File /home/hadoop/spark/python/pyspark/serializers.py, line 130, in _write_with_length stream.write(serialized) IOError: [Errno 104] Connection reset by peer 14/03/19 22:48:15 INFO scheduler.TaskSetManager: Serialized task 4.0:0 as 4257 bytes in 47 ms Traceback (most recent call last): File /home/hadoop/spark/python/pyspark/daemon.py, line 117, in launch_worker worker(listen_sock) File /home/hadoop/spark/python/pyspark/daemon.py, line 107, in worker outfile.flush() IOError: [Errno 32] Broken pipe {code} I can reproduce the error by running take(10) on the cached RDD before running reduceByKey (which looks at the whole input file). Affects Version 1.0.0-SNAPSHOT (4d88030486) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2891) Daemon failed to launch worker
[ https://issues.apache.org/jira/browse/SPARK-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-2891. --- Resolution: Duplicate Fix Version/s: 1.1.0 duplicated to 2898 Daemon failed to launch worker -- Key: SPARK-2891 URL: https://issues.apache.org/jira/browse/SPARK-2891 Project: Spark Issue Type: Bug Components: PySpark Reporter: Davies Liu Priority: Critical Fix For: 1.1.0 daviesliu@dm:~/work/spark-perf$ /Users/daviesliu/work/spark/bin/spark-submit --master spark://dm:7077 pyspark-tests/tests.py SchedulerThroughputTest --num-tasks=1 --num-trials=4 --inter-trial-wait=1 14/08/06 17:58:04 WARN JettyUtils: Failed to create UI on port 4040. Trying again on port 4041. - Failure(java.net.BindException: Address already in use) Daemon failed to fork PySpark worker: [Errno 35] Resource temporarily unavailable 14/08/06 17:59:25 ERROR Executor: Exception in task 9777.0 in stage 1.0 (TID 19777) java.lang.IllegalStateException: Python daemon failed to launch worker at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:71) at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:101) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Daemon failed to fork PySpark worker: [Errno 35] Resource temporarily unavailable 14/08/06 17:59:25 ERROR Executor: Exception in task 9781.0 in stage 1.0 (TID 19781) java.lang.IllegalStateException: Python daemon failed to launch worker at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:71) at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:101) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/08/06 17:59:25 WARN TaskSetManager: Lost task 9777.0 in stage 1.0 (TID 19777, localhost): java.lang.IllegalStateException: Python daemon failed to launch worker org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:71) org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83) org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82) org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55) org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:101) org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[jira] [Commented] (SPARK-2931) getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093295#comment-14093295 ] Apache Spark commented on SPARK-2931: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/1896 getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException --- Key: SPARK-2931 URL: https://issues.apache.org/jira/browse/SPARK-2931 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: Spark EC2, spark-1.1.0-snapshot1, sort-by-key spark-perf benchmark Reporter: Josh Rosen Priority: Blocker Attachments: scala-sort-by-key.err, test.patch When running Spark Perf's sort-by-key benchmark on EC2 with v1.1.0-snapshot, I get the following errors (one per task): {code} 14/08/08 18:54:22 INFO scheduler.TaskSetManager: Starting task 39.0 in stage 0.0 (TID 39, ip-172-31-14-30.us-west-2.compute.internal, PROCESS_LOCAL, 1003 bytes) 14/08/08 18:54:22 INFO cluster.SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkexecu...@ip-172-31-9-213.us-west-2.compute.internal:58901/user/Executor#1436065036] with ID 0 14/08/08 18:54:22 ERROR actor.OneForOneStrategy: 1 java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:475) at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:409) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:261) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:257) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:254) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:254) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:153) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:103) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} This causes the job to hang. I can deterministically reproduce this by re-running the test, either in isolation or as part of the full performance testing suite. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1065) PySpark runs out of memory with large broadcast variables
[ https://issues.apache.org/jira/browse/SPARK-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093413#comment-14093413 ] Vlad Frolov commented on SPARK-1065: I am facing the same issue in my project, where I use PySpark. As a proof of that the big objects I have could easily fit into nodes' memory, I am going to use dummy solution of saving my big objects into HDFS and load them on Python nodes. Does anybody have an idea how to fix the issue in a better way? I don't have enough either Scala nor Java knowledge to fix this in Spark core. However, I feel like broadcast variables could be reimplemented on Python side though it seems a bit dangerous idea because we don't want to have separate implementations of one thing in both languages. That will also save memory, because while we use broadcasts through Scala we have 1 copy in JVM, 1 pickled copy in Python and 1 constructed object copy in Python. PySpark runs out of memory with large broadcast variables - Key: SPARK-1065 URL: https://issues.apache.org/jira/browse/SPARK-1065 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.7.3, 0.8.1, 0.9.0 Reporter: Josh Rosen PySpark's driver components may run out of memory when broadcasting large variables (say 1 gigabyte). Because PySpark's broadcast is implemented on top of Java Spark's broadcast by broadcasting a pickled Python as a byte array, we may be retaining multiple copies of the large object: a pickled copy in the JVM and a deserialized copy in the Python driver. The problem could also be due to memory requirements during pickling. PySpark is also affected by broadcast variables not being garbage collected. Adding an unpersist() method to broadcast variables may fix this: https://github.com/apache/incubator-spark/pull/543. As a first step to fixing this, we should write a failing test to reproduce the error. This was discovered by [~sandy]: [trouble with broadcast variables on pyspark|http://apache-spark-user-list.1001560.n3.nabble.com/trouble-with-broadcast-variables-on-pyspark-tp1301.html]. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0
[ https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093466#comment-14093466 ] Ted Yu commented on SPARK-1297: --- w.r.t. build, by default, hbase-hadoop1 would be used. If user specifies any of the hadoop-2 profiles, hbase-hadoop2 should be specified as well. Upgrade HBase dependency to 0.98.0 -- Key: SPARK-1297 URL: https://issues.apache.org/jira/browse/SPARK-1297 Project: Spark Issue Type: Task Reporter: Ted Yu Priority: Minor Attachments: spark-1297-v2.txt, spark-1297-v4.txt HBase 0.94.6 was released 11 months ago. Upgrade HBase dependency to 0.98.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0
[ https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093468#comment-14093468 ] Sean Owen commented on SPARK-1297: -- Yes I think you'd need to reflect that in changes to the build instructions. They are under docs/ Upgrade HBase dependency to 0.98.0 -- Key: SPARK-1297 URL: https://issues.apache.org/jira/browse/SPARK-1297 Project: Spark Issue Type: Task Reporter: Ted Yu Priority: Minor Attachments: spark-1297-v2.txt, spark-1297-v4.txt HBase 0.94.6 was released 11 months ago. Upgrade HBase dependency to 0.98.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2975) SPARK_LOCAL_DIRS may cause problems when running in local mode
[ https://issues.apache.org/jira/browse/SPARK-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-2975: -- Priority: Critical (was: Minor) I'm raising the priority of this issue to 'critical', since it causes problems when running on a cluster if some tasks are small enough to be run locally on the driver. Here's an example exception: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 21 in stage 0.0 failed 1 times, most recent failure: Lost task 21.0 in stage 0.0 (TID 21, localhost): java.io.IOException: No such file or directory java.io.UnixFileSystem.createFileExclusively(Native Method) java.io.File.createNewFile(File.java:1006) java.io.File.createTempFile(File.java:1989) org.apache.spark.util.Utils$.fetchFile(Utils.scala:335) org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:342) org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:340) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) scala.collection.mutable.HashMap.foreach(HashMap.scala:98) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:340) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:180) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1153) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1142) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1141) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1141) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:682) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:682) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:682) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1359) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} SPARK_LOCAL_DIRS may cause problems when running in local mode -- Key: SPARK-2975 URL: https://issues.apache.org/jira/browse/SPARK-2975 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Josh Rosen Priority: Critical If we're running Spark in local mode and {{SPARK_LOCAL_DIRS}} is set, the {{Executor}} modifies SparkConf so that this value overrides {{spark.local.dir}}. Normally, this is safe because the modification takes place before SparkEnv is created. In local mode, the Executor uses an existing SparkEnv rather than creating a new one, so it winds up with a DiskBlockManager that created local directories with the original {{spark.local.dir}} setting, but other components attempt to use directories specified in the _new_ {{spark.local.dir}},
[jira] [Created] (SPARK-2978) Provide an MR-style shuffle transformation
Sandy Ryza created SPARK-2978: - Summary: Provide an MR-style shuffle transformation Key: SPARK-2978 URL: https://issues.apache.org/jira/browse/SPARK-2978 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Sandy Ryza For Hive on Spark in particular, and running legacy MR code in general, I think it would be useful to provide an MR-style shuffle transformation, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. groupAndSortByKey, groupByKeyAndSortWithinPartition, hadoopStyleShuffle * Allow groupByKey to take an ordering param for keys within a partition -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2978) Provide an MR-style shuffle transformation
[ https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-2978: -- Description: For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide an MR-style shuffle transformation, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. groupAndSortByKey, groupByKeyAndSortWithinPartition, hadoopStyleShuffle * Allow groupByKey to take an ordering param for keys within a partition was: For Hive on Spark in particular, and running legacy MR code in general, I think it would be useful to provide an MR-style shuffle transformation, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. groupAndSortByKey, groupByKeyAndSortWithinPartition, hadoopStyleShuffle * Allow groupByKey to take an ordering param for keys within a partition Provide an MR-style shuffle transformation -- Key: SPARK-2978 URL: https://issues.apache.org/jira/browse/SPARK-2978 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Sandy Ryza For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide an MR-style shuffle transformation, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. groupAndSortByKey, groupByKeyAndSortWithinPartition, hadoopStyleShuffle * Allow groupByKey to take an ordering param for keys within a partition -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2978) Provide an MR-style shuffle transformation
[ https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-2978: -- Description: For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. groupAndSortByKey, groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe? * Allow groupByKey to take an ordering param for keys within a partition was: For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. groupAndSortByKey, groupByKeyAndSortWithinPartition, hadoopStyleShuffle * Allow groupByKey to take an ordering param for keys within a partition Provide an MR-style shuffle transformation -- Key: SPARK-2978 URL: https://issues.apache.org/jira/browse/SPARK-2978 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Sandy Ryza For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. groupAndSortByKey, groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe? * Allow groupByKey to take an ordering param for keys within a partition -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2978) Provide an MR-style shuffle transformation
[ https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-2978: -- Description: For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. groupAndSortByKey, groupByKeyAndSortWithinPartition, hadoopStyleShuffle * Allow groupByKey to take an ordering param for keys within a partition was: For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide an MR-style shuffle transformation, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. groupAndSortByKey, groupByKeyAndSortWithinPartition, hadoopStyleShuffle * Allow groupByKey to take an ordering param for keys within a partition Provide an MR-style shuffle transformation -- Key: SPARK-2978 URL: https://issues.apache.org/jira/browse/SPARK-2978 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Sandy Ryza For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. groupAndSortByKey, groupByKeyAndSortWithinPartition, hadoopStyleShuffle * Allow groupByKey to take an ordering param for keys within a partition -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2979) Improve the convergence rate by minimize the condition number in LOR with LBFGS
DB Tsai created SPARK-2979: -- Summary: Improve the convergence rate by minimize the condition number in LOR with LBFGS Key: SPARK-2979 URL: https://issues.apache.org/jira/browse/SPARK-2979 Project: Spark Issue Type: Improvement Components: MLlib Reporter: DB Tsai Scaling to minimize the condition number: During the optimization process, the convergence (rate) depends on the condition number of the training dataset. Scaling the variables often reduces this condition number, thus mproving the convergence rate dramatically. Without reducing the condition number, some training datasets mixing the columns with different scales may not be able to converge. GLMNET and LIBSVM packages perform the scaling to reduce the condition number, and return the weights in the original scale. See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf Here, if useFeatureScaling is enabled, we will standardize the training features by dividing the variance of each column (without subtracting the mean), and train the model in the scaled space. Then we transform the coefficients from the scaled space to the original scale as GLMNET and LIBSVM do. Currently, it's only enabled in LogisticRegressionWithLBFGS -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2979) Improve the convergence rate by minimize the condition number in LOR with LBFGS
[ https://issues.apache.org/jira/browse/SPARK-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093604#comment-14093604 ] Apache Spark commented on SPARK-2979: - User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/1897 Improve the convergence rate by minimize the condition number in LOR with LBFGS --- Key: SPARK-2979 URL: https://issues.apache.org/jira/browse/SPARK-2979 Project: Spark Issue Type: Improvement Components: MLlib Reporter: DB Tsai Scaling to minimize the condition number: During the optimization process, the convergence (rate) depends on the condition number of the training dataset. Scaling the variables often reduces this condition number, thus mproving the convergence rate dramatically. Without reducing the condition number, some training datasets mixing the columns with different scales may not be able to converge. GLMNET and LIBSVM packages perform the scaling to reduce the condition number, and return the weights in the original scale. See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf Here, if useFeatureScaling is enabled, we will standardize the training features by dividing the variance of each column (without subtracting the mean), and train the model in the scaled space. Then we transform the coefficients from the scaled space to the original scale as GLMNET and LIBSVM do. Currently, it's only enabled in LogisticRegressionWithLBFGS -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2979) Improve the convergence rate by minimizing the condition number in LOR with LBFGS
[ https://issues.apache.org/jira/browse/SPARK-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-2979: --- Summary: Improve the convergence rate by minimizing the condition number in LOR with LBFGS (was: Improve the convergence rate by minimize the condition number in LOR with LBFGS) Improve the convergence rate by minimizing the condition number in LOR with LBFGS - Key: SPARK-2979 URL: https://issues.apache.org/jira/browse/SPARK-2979 Project: Spark Issue Type: Improvement Components: MLlib Reporter: DB Tsai Scaling to minimize the condition number: During the optimization process, the convergence (rate) depends on the condition number of the training dataset. Scaling the variables often reduces this condition number, thus mproving the convergence rate dramatically. Without reducing the condition number, some training datasets mixing the columns with different scales may not be able to converge. GLMNET and LIBSVM packages perform the scaling to reduce the condition number, and return the weights in the original scale. See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf Here, if useFeatureScaling is enabled, we will standardize the training features by dividing the variance of each column (without subtracting the mean), and train the model in the scaled space. Then we transform the coefficients from the scaled space to the original scale as GLMNET and LIBSVM do. Currently, it's only enabled in LogisticRegressionWithLBFGS -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2515) Hypothesis testing
[ https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2515: - Fix Version/s: 1.1.0 Hypothesis testing -- Key: SPARK-2515 URL: https://issues.apache.org/jira/browse/SPARK-2515 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Doris Xin Fix For: 1.1.0 Support common statistical tests in Spark MLlib. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2515) Chi-squared test
[ https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doris Xin updated SPARK-2515: - Summary: Chi-squared test (was: Hypothesis testing) Chi-squared test Key: SPARK-2515 URL: https://issues.apache.org/jira/browse/SPARK-2515 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Doris Xin Fix For: 1.1.0 Support common statistical tests in Spark MLlib. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2515) Chi-squared test
[ https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-2515. Resolution: Implemented Target Version/s: 1.1.0 Chi-squared test Key: SPARK-2515 URL: https://issues.apache.org/jira/browse/SPARK-2515 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Doris Xin Fix For: 1.1.0 Support common statistical tests in Spark MLlib. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2980) Python support for chi-squared test
Doris Xin created SPARK-2980: Summary: Python support for chi-squared test Key: SPARK-2980 URL: https://issues.apache.org/jira/browse/SPARK-2980 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Doris Xin -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2980) Python support for chi-squared test
[ https://issues.apache.org/jira/browse/SPARK-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2980: - Assignee: (was: Doris Xin) Python support for chi-squared test --- Key: SPARK-2980 URL: https://issues.apache.org/jira/browse/SPARK-2980 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Doris Xin -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2934) Adding LogisticRegressionWithLBFGS for training with LBFGS Optimizer
[ https://issues.apache.org/jira/browse/SPARK-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2934: - Assignee: DB Tsai Adding LogisticRegressionWithLBFGS for training with LBFGS Optimizer -- Key: SPARK-2934 URL: https://issues.apache.org/jira/browse/SPARK-2934 Project: Spark Issue Type: New Feature Components: MLlib Reporter: DB Tsai Assignee: DB Tsai -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2844) Existing JVM Hive Context not correctly used in Python Hive Context
[ https://issues.apache.org/jira/browse/SPARK-2844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2844. - Resolution: Fixed Fix Version/s: 1.1.0 Existing JVM Hive Context not correctly used in Python Hive Context --- Key: SPARK-2844 URL: https://issues.apache.org/jira/browse/SPARK-2844 Project: Spark Issue Type: Bug Components: PySpark, SQL Reporter: Ahir Reddy Assignee: Ahir Reddy Fix For: 1.1.0 Unlike the SQLContext, assing an existing JVM HiveContext object into the Python HiveContext constructor does not actually re-use that object. Instead it will create a new HiveContext. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2590) Add config property to disable incremental collection used in Thrift server
[ https://issues.apache.org/jira/browse/SPARK-2590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2590. - Resolution: Fixed Fix Version/s: 1.1.0 Add config property to disable incremental collection used in Thrift server --- Key: SPARK-2590 URL: https://issues.apache.org/jira/browse/SPARK-2590 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker Fix For: 1.1.0 {{SparkSQLOperationManager}} uses {{RDD.toLocalIterator}} to collect the result set one partition at a time. This is useful to avoid OOM when the result is large, but introduces extra job scheduling costs as each partition is collected with a separate job. Users may want to disable this when the result set is expected to be small. *UPDATE* Incremental collection hurts performance because tasks of the last stage of the RDD DAG generated from the SQL query plan are executed sequentially. Thus we decided to disable it by default. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2965) Fix HashOuterJoin output nullabilities.
[ https://issues.apache.org/jira/browse/SPARK-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2965. - Resolution: Fixed Fix Version/s: 1.1.0 Assignee: Takuya Ueshin Fix HashOuterJoin output nullabilities. --- Key: SPARK-2965 URL: https://issues.apache.org/jira/browse/SPARK-2965 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Assignee: Takuya Ueshin Fix For: 1.1.0 Output attributes of opposite side of {{OuterJoin}} should be nullable. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2968) Fix nullabilities of Explode.
[ https://issues.apache.org/jira/browse/SPARK-2968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2968. - Resolution: Fixed Fix Version/s: 1.1.0 Assignee: Takuya Ueshin Fix nullabilities of Explode. - Key: SPARK-2968 URL: https://issues.apache.org/jira/browse/SPARK-2968 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Assignee: Takuya Ueshin Fix For: 1.1.0 Output nullabilities of {{Explode}} could be detemined by {{ArrayType.containsNull}} or {{MapType.valueContainsNull}}. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2650) Caching tables larger than memory causes OOMs
[ https://issues.apache.org/jira/browse/SPARK-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2650. - Resolution: Fixed Fix Version/s: 1.1.0 Assignee: Michael Armbrust (was: Cheng Lian) Target Version/s: 1.1.0 (was: 1.2.0) Caching tables larger than memory causes OOMs - Key: SPARK-2650 URL: https://issues.apache.org/jira/browse/SPARK-2650 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0, 1.0.1 Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Critical Fix For: 1.1.0 The logic for setting up the initial column buffers is different for Spark SQL compared to Shark and I'm seeing OOMs when caching tables that are larger than available memory (where shark was okay). Two suspicious things: the intialSize is always set to 0 so we always go with the default. The default looks like it was copied from code like 10 * 1024 * 1024... but in Spark SQL its 10 * 102 * 1024. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2981) PartitionStrategy: VertexID hash overflow
Larry Xiao created SPARK-2981: - Summary: PartitionStrategy: VertexID hash overflow Key: SPARK-2981 URL: https://issues.apache.org/jira/browse/SPARK-2981 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.0.2 Reporter: Larry Xiao In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts. The Long is overflowed, and when cast to Int: {quote} scala (1125899906842597L*1).toInt res1: Int = -27 scala (1125899906842597L*2).toInt res2: Int = -54 scala (1125899906842597L*3).toInt res3: Int = -81 {quote} As the cast produce number that are multiplies of 3, the partition is not useable when partitioning to multiples of 3. for example when you partition to 6 or 9 parts: {quote} 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) {quote} I think solution is to cast after mod. {quote} scala (1125899906842597L*3) res4: Long = 3377699720527791 scala (1125899906842597L*3) % 9 res5: Long = 3 scala ((1125899906842597L*3) % 9).toInt res5: Int = 3 {quote} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2981) PartitionStrategy: VertexID hash overflow
[ https://issues.apache.org/jira/browse/SPARK-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-2981: -- Description: In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts. The Long is overflowed, and when cast to Int: {quote} scala (1125899906842597L*1).toInt res1: Int = -27 scala (1125899906842597L*2).toInt res2: Int = -54 scala (1125899906842597L*3).toInt res3: Int = -81 {quote} As the cast produce number that are multiplies of 3, the partition is not useable when partitioning to multiples of 3. for example when you partition to 6 or 9 parts: {quote} 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) so the vertices are partitioned to 0,3 for 6; and 0 for 9 {quote} I think solution is to cast after mod. {quote} scala (1125899906842597L*3) res4: Long = 3377699720527791 scala (1125899906842597L*3) % 9 res5: Long = 3 scala ((1125899906842597L*3) % 9).toInt res5: Int = 3 {quote} was: In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts. The Long is overflowed, and when cast to Int: {quote} scala (1125899906842597L*1).toInt res1: Int = -27 scala (1125899906842597L*2).toInt res2: Int = -54 scala (1125899906842597L*3).toInt res3: Int = -81 {quote} As the cast produce number that are multiplies of 3, the partition is not useable when partitioning to multiples of 3. for example when you partition to 6 or 9 parts: {quote} 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) {quote} I think solution is to cast after mod. {quote} scala (1125899906842597L*3) res4: Long = 3377699720527791 scala (1125899906842597L*3) % 9 res5: Long = 3 scala ((1125899906842597L*3) % 9).toInt res5: Int = 3 {quote} PartitionStrategy: VertexID hash overflow - Key: SPARK-2981 URL: https://issues.apache.org/jira/browse/SPARK-2981 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.0.2 Reporter: Larry Xiao Labels: newbie Original Estimate: 1h Remaining Estimate: 1h In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts. The Long is overflowed, and when cast to Int: {quote} scala (1125899906842597L*1).toInt res1: Int = -27 scala (1125899906842597L*2).toInt res2: Int = -54 scala (1125899906842597L*3).toInt res3: Int = -81 {quote} As the cast produce number that are multiplies of 3, the partition is not useable when partitioning to multiples of 3. for example when you partition to 6 or 9 parts: {quote} 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) so the vertices are partitioned to 0,3 for 6; and 0 for 9 {quote} I think solution is to cast after mod. {quote} scala (1125899906842597L*3) res4: Long = 3377699720527791 scala (1125899906842597L*3) % 9 res5: Long = 3 scala ((1125899906842597L*3) % 9).toInt res5: Int = 3 {quote} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2826) Reduce the Memory Copy for HashOuterJoin
[ https://issues.apache.org/jira/browse/SPARK-2826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2826. - Resolution: Fixed Fix Version/s: 1.1.0 Reduce the Memory Copy for HashOuterJoin Key: SPARK-2826 URL: https://issues.apache.org/jira/browse/SPARK-2826 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Priority: Minor Fix For: 1.1.0 This is actually a follow up for https://issues.apache.org/jira/browse/SPARK-2212 , the previous implementation has potential memory copy. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2934) Adding LogisticRegressionWithLBFGS for training with LBFGS Optimizer
[ https://issues.apache.org/jira/browse/SPARK-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-2934. -- Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1862 [https://github.com/apache/spark/pull/1862] Adding LogisticRegressionWithLBFGS for training with LBFGS Optimizer -- Key: SPARK-2934 URL: https://issues.apache.org/jira/browse/SPARK-2934 Project: Spark Issue Type: New Feature Components: MLlib Reporter: DB Tsai Assignee: DB Tsai Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2982) Glitch of spark streaming
dai zhiyuan created SPARK-2982: -- Summary: Glitch of spark streaming Key: SPARK-2982 URL: https://issues.apache.org/jira/browse/SPARK-2982 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.0 Reporter: dai zhiyuan spark streaming task startup time is very focused,It creates a problem which is network and cpu glitch, and cpu and network is in an idle state at lot of time,which is very wasteful for system resources. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093746#comment-14093746 ] Jianshi Huang commented on SPARK-2890: -- My use case: The result will be parsed into (id, type, start, end, properties) tuples. Properties might or might not contain any of (id, type, start end). So it's easier just to list them at the end and not to worry about duplicated names. Jianshi Spark SQL should allow SELECT with duplicated columns - Key: SPARK-2890 URL: https://issues.apache.org/jira/browse/SPARK-2890 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Jianshi Huang Spark reported error java.lang.IllegalArgumentException with messages: java.lang.IllegalArgumentException: requirement failed: Found fields with the same name. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317) at org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310) at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306) at org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83) at org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433) After trial and error, it seems it's caused by duplicated columns in my select clause. I made the duplication on purpose for my code to parse correctly. I think we should allow users to specify duplicated columns as return value. Jianshi -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2923) Implement some basic linalg operations in MLlib
[ https://issues.apache.org/jira/browse/SPARK-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-2923. -- Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1849 [https://github.com/apache/spark/pull/1849] Implement some basic linalg operations in MLlib --- Key: SPARK-2923 URL: https://issues.apache.org/jira/browse/SPARK-2923 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.1.0 We use breeze for linear algebra operations. Breeze operations are user-friendly but there are some concerns: 1. creating temp objects, e.g., `val z = a * x + b * y` 2. multi-method is not used in some operators, e.g., `axpy`. If we pass in SparseVector as a generic Vector, it will use activeIterator, which is slow 3. calling native BLAS if it is available, which might not be good for level-1 methods Having some basic BLAS operations implemented in MLlib can help simplify the current implementation and improve some performance. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org