[jira] [Updated] (SPARK-2963) There no documentation for building about SparkSQL

2014-08-11 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-2963:
--

Description: 
Currently, if we'd like to use ThriftServer or CLI for SparkSQL, we need to use 
-Phive-thriftserver option on building but it's implicit.
I think we need to describe how to build.

  was:
Currently, if we'd like to use SparkSQL, we need to use -Phive-thriftserver 
option on building but it's implicit.
I think we need to describe how to build.


 There no documentation for building about SparkSQL
 --

 Key: SPARK-2963
 URL: https://issues.apache.org/jira/browse/SPARK-2963
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 Currently, if we'd like to use ThriftServer or CLI for SparkSQL, we need to 
 use -Phive-thriftserver option on building but it's implicit.
 I think we need to describe how to build.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2963) There no documentation about building ThriftServer and CLI for SparkSQL

2014-08-11 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-2963:
--

Summary: There no documentation about building ThriftServer and CLI for 
SparkSQL  (was: There no documentation for building about SparkSQL)

 There no documentation about building ThriftServer and CLI for SparkSQL
 ---

 Key: SPARK-2963
 URL: https://issues.apache.org/jira/browse/SPARK-2963
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 Currently, if we'd like to use ThriftServer or CLI for SparkSQL, we need to 
 use -Phive-thriftserver option on building but it's implicit.
 I think we need to describe how to build.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2963) There no documentation about building to use HiveServer and CLI for SparkSQL

2014-08-11 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-2963:
--

Summary: There no documentation about building to use HiveServer and CLI 
for SparkSQL  (was: There no documentation about building ThriftServer and CLI 
for SparkSQL)

 There no documentation about building to use HiveServer and CLI for SparkSQL
 

 Key: SPARK-2963
 URL: https://issues.apache.org/jira/browse/SPARK-2963
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 Currently, if we'd like to use ThriftServer or CLI for SparkSQL, we need to 
 use -Phive-thriftserver option on building but it's implicit.
 I think we need to describe how to build.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2963) There no documentation about building to use HiveServer and CLI for SparkSQL

2014-08-11 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-2963:
--

Description: 
Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use 
-Phive-thriftserver option when building but it's implicit.
I think we need to describe how to build.

  was:
Currently, if we'd like to use ThriftServer or CLI for SparkSQL, we need to use 
-Phive-thriftserver option on building but it's implicit.
I think we need to describe how to build.


 There no documentation about building to use HiveServer and CLI for SparkSQL
 

 Key: SPARK-2963
 URL: https://issues.apache.org/jira/browse/SPARK-2963
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use 
 -Phive-thriftserver option when building but it's implicit.
 I think we need to describe how to build.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2204) Scheduler for Mesos in fine-grained mode launches tasks on wrong executors

2014-08-11 Thread Xu Zhongxing (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092376#comment-14092376
 ] 

Xu Zhongxing edited comment on SPARK-2204 at 8/11/14 6:49 AM:
--

I encountered this issue again when I use Spark 1.0.2, Mesos 0.18.1, 
spark-cassandra-connector master branch.

Maybe this is not fixed on some failure/exception paths.

I run spark in coarse-grained mode. There are some exceptions thrown at the 
executors. But the spark driver is waiting and printing repeatedly:

TRACE [spark-akka.actor.default-dispatcher-17] 2014-08-11 10:57:32,998 
Logging.scala (line 66) Checking for hosts with\
 no recent heart beats in BlockManagerMaster.

The mesos master WARNING log:
W0811 10:32:58.172175 1646 master.cpp:2103] Ignoring unknown exited executor 
20140808-113811-858302656-5050-1645-2 on slave 20140808-113811-858302656-505\
0-1645-2 (ndb9)
W0811 10:32:58.181217 1649 master.cpp:2103] Ignoring unknown exited executor 
20140808-113811-858302656-5050-1645-5 on slave 20140808-113811-858302656-505\
0-1645-5 (ndb5)
W0811 10:32:58.277014 1650 master.cpp:2103] Ignoring unknown exited executor 
20140808-113811-858302656-5050-1645-3 on slave 20140808-113811-858302656-505\
0-1645-3 (ndb6)
W0811 10:32:58.344130 1648 master.cpp:2103] Ignoring unknown exited executor 
20140808-113811-858302656-5050-1645-0 on slave 20140808-113811-858302656-505\
0-1645-0 (ndb0)
W0811 10:32:58.354117 1651 master.cpp:2103] Ignoring unknown exited executor 
20140804-095254-505981120-5050-20258-11 on slave 20140804-095254-505981120-5\
050-20258-11 (ndb2)
W0811 10:32:58.550233 1647 master.cpp:2103] Ignoring unknown exited executor 
20140804-172212-505981120-5050-26571-2 on slave 20140804-172212-505981120-50\
50-26571-2 (ndb3)
W0811 10:32:58.793258 1653 master.cpp:2103] Ignoring unknown exited executor 
20140804-095254-505981120-5050-20258-19 on slave 20140804-095254-505981120-5\
050-20258-19 (ndb1)
W0811 10:32:58.904842 1652 master.cpp:2103] Ignoring unknown exited executor 
20140804-172212-505981120-5050-26571-0 on slave 20140804-172212-505981120-50\
50-26571-0 (ndb4)

Some other logs are at: 
https://github.com/datastax/spark-cassandra-connector/issues/134



was (Author: xuzhongxing):
I encountered this issue again when I use Spark 1.0.2, Mesos 0.18.1, 
spark-cassandra-connector master branch.

I run spark in coarse-grained mode. There are some exceptions thrown at the 
executors. But the spark driver is waiting and printing repeatedly:

TRACE [spark-akka.actor.default-dispatcher-17] 2014-08-11 10:57:32,998 
Logging.scala (line 66) Checking for hosts with\
 no recent heart beats in BlockManagerMaster.

The mesos master WARNING log:
W0811 10:32:58.172175 1646 master.cpp:2103] Ignoring unknown exited executor 
20140808-113811-858302656-5050-1645-2 on slave 20140808-113811-858302656-505\
0-1645-2 (ndb9)
W0811 10:32:58.181217 1649 master.cpp:2103] Ignoring unknown exited executor 
20140808-113811-858302656-5050-1645-5 on slave 20140808-113811-858302656-505\
0-1645-5 (ndb5)
W0811 10:32:58.277014 1650 master.cpp:2103] Ignoring unknown exited executor 
20140808-113811-858302656-5050-1645-3 on slave 20140808-113811-858302656-505\
0-1645-3 (ndb6)
W0811 10:32:58.344130 1648 master.cpp:2103] Ignoring unknown exited executor 
20140808-113811-858302656-5050-1645-0 on slave 20140808-113811-858302656-505\
0-1645-0 (ndb0)
W0811 10:32:58.354117 1651 master.cpp:2103] Ignoring unknown exited executor 
20140804-095254-505981120-5050-20258-11 on slave 20140804-095254-505981120-5\
050-20258-11 (ndb2)
W0811 10:32:58.550233 1647 master.cpp:2103] Ignoring unknown exited executor 
20140804-172212-505981120-5050-26571-2 on slave 20140804-172212-505981120-50\
50-26571-2 (ndb3)
W0811 10:32:58.793258 1653 master.cpp:2103] Ignoring unknown exited executor 
20140804-095254-505981120-5050-20258-19 on slave 20140804-095254-505981120-5\
050-20258-19 (ndb1)
W0811 10:32:58.904842 1652 master.cpp:2103] Ignoring unknown exited executor 
20140804-172212-505981120-5050-26571-0 on slave 20140804-172212-505981120-50\
50-26571-0 (ndb4)

Some other logs are at: 
https://github.com/datastax/spark-cassandra-connector/issues/134


 Scheduler for Mesos in fine-grained mode launches tasks on wrong executors
 --

 Key: SPARK-2204
 URL: https://issues.apache.org/jira/browse/SPARK-2204
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
Reporter: Sebastien Rainville
Assignee: Sebastien Rainville
Priority: Blocker
 Fix For: 1.0.1, 1.1.0


 MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer]) is 
 assuming that TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning 
 task lists in the same order as the offers it was passed, but in 

[jira] [Created] (SPARK-2964) Wrong silent option in spark-sql script

2014-08-11 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-2964:
-

 Summary: Wrong silent option in spark-sql script
 Key: SPARK-2964
 URL: https://issues.apache.org/jira/browse/SPARK-2964
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Priority: Minor


In spark-sql script, -s option is handled as silent option but 
org.apache.hadoop.hive.cli.OptionProcessor interpret -S (large character) as 
silent mode option.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2965) Fix HashOuterJoin output nullabilities.

2014-08-11 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-2965:


 Summary: Fix HashOuterJoin output nullabilities.
 Key: SPARK-2965
 URL: https://issues.apache.org/jira/browse/SPARK-2965
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin


Output attributes of opposite side of {{OuterJoin}} should be nullable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2966) Add an approximation algorithm for hierarchical clustering to MLlib

2014-08-11 Thread Yu Ishikawa (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Ishikawa updated SPARK-2966:
---

Summary: Add an approximation algorithm for hierarchical clustering to 
MLlib  (was: Add an approximation algorithm for hierarchical clustering 
algorithm to MLlib)

 Add an approximation algorithm for hierarchical clustering to MLlib
 ---

 Key: SPARK-2966
 URL: https://issues.apache.org/jira/browse/SPARK-2966
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Yu Ishikawa
Priority: Minor

 A hierarchical clustering algorithm is a useful unsupervised learning method.
 Koga. et al. proposed highly scalable hierarchical clustering altgorithm in 
 (1).
 I would like to implement this method.
 I suggest adding an approximate hierarchical clustering algorithm to MLlib.
 I'd like this to be assigned to me.
 h3. Reference
 # Fast agglomerative hierarchical clustering algorithm using 
 Locality-Sensitive Hashing
 http://dl.acm.org/citation.cfm?id=1266811



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2967) Several SQL unit test failed when sort-based shuffle is enabled

2014-08-11 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-2967:
--

 Summary: Several SQL unit test failed when sort-based shuffle is 
enabled
 Key: SPARK-2967
 URL: https://issues.apache.org/jira/browse/SPARK-2967
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Saisai Shao


Several SQLQuerySuite unit test failed when sort-based shuffle is enabled. 
Seems SQL test uses GenericMutableRow  which will make ExternalSorter's 
internal buffer all refered to the same object finally because of object's 
mutability. Seems row should be copied when feeding into ExternalSorter.

The error shows below, though have many failures, I only pasted part of them:

{noformat}
 SQLQuerySuite:
 - SPARK-2041 column name equals tablename
 - SPARK-2407 Added Parser of SQL SUBSTR()
 - index into array
 - left semi greater than predicate
 - index into array of arrays
 - agg *** FAILED ***
   Results do not match for query:
   Aggregate ['a], ['a,SUM('b) AS c1#38]
UnresolvedRelation None, testData2, None
   
   == Analyzed Plan ==
   Aggregate [a#4], [a#4,SUM(CAST(b#5, LongType)) AS c1#38L]
SparkLogicalPlan (ExistingRdd [a#4,b#5], MapPartitionsRDD[7] at 
mapPartitions at basicOperators.scala:215)
   
   == Physical Plan ==
   Aggregate false, [a#4], [a#4,SUM(PartialSum#40L) AS c1#38L]
Exchange (HashPartitioning [a#4], 200)
 Aggregate true, [a#4], [a#4,SUM(CAST(b#5, LongType)) AS PartialSum#40L]
  ExistingRdd [a#4,b#5], MapPartitionsRDD[7] at mapPartitions at 
basicOperators.scala:215
   
   == Results ==
   !== Correct Answer - 3 ==   == Spark Answer - 3 ==
   !Vector(1, 3)   [1,3]
   !Vector(2, 3)   [1,3]
   !Vector(3, 3)   [1,3] (QueryTest.scala:53)
 - aggregates with nulls
 - select *
 - simple select
 - sorting *** FAILED ***
   Results do not match for query:
   Sort ['a ASC,'b ASC]
Project [*]
 UnresolvedRelation None, testData2, None
   
   == Analyzed Plan ==
   Sort [a#4 ASC,b#5 ASC]
Project [a#4,b#5]
 SparkLogicalPlan (ExistingRdd [a#4,b#5], MapPartitionsRDD[7] at 
mapPartitions at basicOperators.scala:215)
   
   == Physical Plan ==
   Sort [a#4 ASC,b#5 ASC], true
Exchange (RangePartitioning [a#4 ASC,b#5 ASC], 200)
 ExistingRdd [a#4,b#5], MapPartitionsRDD[7] at mapPartitions at 
basicOperators.scala:215
   
   == Results ==
   !== Correct Answer - 6 ==   == Spark Answer - 6 ==
   !Vector(1, 1)   [3,2]
   !Vector(1, 2)   [3,2]
   !Vector(2, 1)   [3,2]
   !Vector(2, 2)   [3,2]
   !Vector(3, 1)   [3,2]
   !Vector(3, 2)   [3,2] (QueryTest.scala:53)
 - limit
 - average
 - average overflow *** FAILED ***
   Results do not match for query:
   Aggregate ['b], [AVG('a) AS c0#90,'b]
UnresolvedRelation None, largeAndSmallInts, None
   
   == Analyzed Plan ==
   Aggregate [b#3], [AVG(CAST(a#2, LongType)) AS c0#90,b#3]
SparkLogicalPlan (ExistingRdd [a#2,b#3], MapPartitionsRDD[4] at 
mapPartitions at basicOperators.scala:215)
   
   == Physical Plan ==
   Aggregate false, [b#3], [(CAST(SUM(PartialSum#93L), DoubleType) / 
CAST(SUM(PartialCount#94L), DoubleType)) AS c0#90,b#3]
Exchange (HashPartitioning [b#3], 200)
 Aggregate true, [b#3], [b#3,COUNT(CAST(a#2, LongType)) AS 
PartialCount#94L,SUM(CAST(a#2, LongType)) AS PartialSum#93L]
  ExistingRdd [a#2,b#3], MapPartitionsRDD[4] at mapPartitions at 
basicOperators.scala:215
   
   == Results ==
   !== Correct Answer - 2 ==   == Spark Answer - 2 ==
   !Vector(2.0, 2) [2.147483645E9,1]
   !Vector(2.147483645E9, 1)   [2.147483645E9,1] (QueryTest.scala:53)
{noformat}





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2969) Make ScalaReflection be able to handle MapType.containsNull and MapType.valueContainsNull.

2014-08-11 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-2969:
-

Description: 
Make {{ScalaReflection}} be able to handle like:

- Seq\[Int] as ArrayType(IntegerType, containsNull = false)
- Seq\[java.lang.Integer] as ArrayType(IntegerType, containsNull = true)
- Map\[Int, Long] as MapType(IntegerType, LongType, valueContainsNull = false)
- Map\[Int, java.lang.Long] as MapType(IntegerType, LongType, valueContainsNull 
= true)

  was:
Make {{ScalaReflection}} be able to handle:

- Seq\[Int] as ArrayType(IntegerType, containsNull = false)
- Seq\[java.lang.Integer] as ArrayType(IntegerType, containsNull = true)
- Map\[Int, Long] as MapType(IntegerType, LongType, valueContainsNull = false)
- Map\[Int, java.lang.Long] as MapType(IntegerType, LongType, valueContainsNull 
= true)


 Make ScalaReflection be able to handle MapType.containsNull and 
 MapType.valueContainsNull.
 --

 Key: SPARK-2969
 URL: https://issues.apache.org/jira/browse/SPARK-2969
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin

 Make {{ScalaReflection}} be able to handle like:
 - Seq\[Int] as ArrayType(IntegerType, containsNull = false)
 - Seq\[java.lang.Integer] as ArrayType(IntegerType, containsNull = true)
 - Map\[Int, Long] as MapType(IntegerType, LongType, valueContainsNull = false)
 - Map\[Int, java.lang.Long] as MapType(IntegerType, LongType, 
 valueContainsNull = true)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2969) Make ScalaReflection be able to handle MapType.containsNull and MapType.valueContainsNull.

2014-08-11 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-2969:


 Summary: Make ScalaReflection be able to handle 
MapType.containsNull and MapType.valueContainsNull.
 Key: SPARK-2969
 URL: https://issues.apache.org/jira/browse/SPARK-2969
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin


Make {{ScalaReflection}} be able to handle:

- Seq\[Int] as ArrayType(IntegerType, containsNull = false)
- Seq\[java.lang.Integer] as ArrayType(IntegerType, containsNull = true)
- Map\[Int, Long] as MapType(IntegerType, LongType, valueContainsNull = false)
- Map\[Int, java.lang.Long] as MapType(IntegerType, LongType, valueContainsNull 
= true)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2969) Make ScalaReflection be able to handle MapType.containsNull and MapType.valueContainsNull.

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092671#comment-14092671
 ] 

Apache Spark commented on SPARK-2969:
-

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/1889

 Make ScalaReflection be able to handle MapType.containsNull and 
 MapType.valueContainsNull.
 --

 Key: SPARK-2969
 URL: https://issues.apache.org/jira/browse/SPARK-2969
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin

 Make {{ScalaReflection}} be able to handle like:
 - Seq\[Int] as ArrayType(IntegerType, containsNull = false)
 - Seq\[java.lang.Integer] as ArrayType(IntegerType, containsNull = true)
 - Map\[Int, Long] as MapType(IntegerType, LongType, valueContainsNull = false)
 - Map\[Int, java.lang.Long] as MapType(IntegerType, LongType, 
 valueContainsNull = true)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2878) Inconsistent Kryo serialisation with custom Kryo Registrator

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092677#comment-14092677
 ] 

Apache Spark commented on SPARK-2878:
-

User 'GrahamDennis' has created a pull request for this issue:
https://github.com/apache/spark/pull/1890

 Inconsistent Kryo serialisation with custom Kryo Registrator
 

 Key: SPARK-2878
 URL: https://issues.apache.org/jira/browse/SPARK-2878
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.2
 Environment: Linux RedHat EL 6, 4-node Spark cluster.
Reporter: Graham Dennis

 The custom Kryo Registrator (a class with the 
 org.apache.spark.serializer.KryoRegistrator trait) is not used with every 
 Kryo instance created, and this causes inconsistent serialisation and 
 deserialisation.
 The Kryo Registrator is sometimes not used because of a ClassNotFound 
 exception that only occurs if it *isn't* the Worker thread (of an Executor) 
 that tries to create the KryoRegistrator.
 A complete description of the problem and a project reproducing the problem 
 can be found at https://github.com/GrahamDennis/spark-kryo-serialisation
 I have currently only tested this with Spark 1.0.0, but will try to test 
 against 1.0.2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled

2014-08-11 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092705#comment-14092705
 ] 

Kousuke Saruta commented on SPARK-2970:
---

I noticed it's not caused by the reason above.
It's caused by shutdown hook of FileSystem.
I have already resolved it to execute shutdown hook for stopping 
SparkSQLContext before the shutdown hook for FileSystem.

 spark-sql script ends with IOException when EventLogging is enabled
 ---

 Key: SPARK-2970
 URL: https://issues.apache.org/jira/browse/SPARK-2970
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
 Environment: CDH5.1.0 (Hadoop 2.3.0)
Reporter: Kousuke Saruta

 When spark-sql script run with spark.eventLog.enabled set true, it ends with 
 IOException because FileLogger can not create APPLICATION_COMPLETE file in 
 HDFS.
 I think it's because FIleSystem is closed by HiveSessionImplWithUGI.
 It has a code as follows.
 {code}
   public void close() throws HiveSQLException {
 try {
 acquire();
 ShimLoader.getHadoopShims().closeAllForUGI(sessionUgi);
 cancelDelegationToken();
 } finally {
   release();
   super.close();
 }
   }
 {code}
 When using Hadoop 2.0+, ShimLoader.getHadoopShim above returns Hadoop23Shim 
 which extends HadoopShimSecure.
 HadoopShimSecure#closeAllForUGI is implemented as follows.
 {code}
   @Override
   public void closeAllForUGI(UserGroupInformation ugi) {
 try {
   FileSystem.closeAllForUGI(ugi);
 } catch (IOException e) {
   LOG.error(Could not clean up file-system handles for UGI:  + ugi, e);
 }
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092710#comment-14092710
 ] 

Apache Spark commented on SPARK-2970:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/1891

 spark-sql script ends with IOException when EventLogging is enabled
 ---

 Key: SPARK-2970
 URL: https://issues.apache.org/jira/browse/SPARK-2970
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
 Environment: CDH5.1.0 (Hadoop 2.3.0)
Reporter: Kousuke Saruta

 When spark-sql script run with spark.eventLog.enabled set true, it ends with 
 IOException because FileLogger can not create APPLICATION_COMPLETE file in 
 HDFS.
 I think it's because FIleSystem is closed by HiveSessionImplWithUGI.
 It has a code as follows.
 {code}
   public void close() throws HiveSQLException {
 try {
 acquire();
 ShimLoader.getHadoopShims().closeAllForUGI(sessionUgi);
 cancelDelegationToken();
 } finally {
   release();
   super.close();
 }
   }
 {code}
 When using Hadoop 2.0+, ShimLoader.getHadoopShim above returns Hadoop23Shim 
 which extends HadoopShimSecure.
 HadoopShimSecure#closeAllForUGI is implemented as follows.
 {code}
   @Override
   public void closeAllForUGI(UserGroupInformation ugi) {
 try {
   FileSystem.closeAllForUGI(ugi);
 } catch (IOException e) {
   LOG.error(Could not clean up file-system handles for UGI:  + ugi, e);
 }
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2971) Orphaned YARN ApplicationMaster lingers forever

2014-08-11 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-2971:


 Summary: Orphaned YARN ApplicationMaster lingers forever
 Key: SPARK-2971
 URL: https://issues.apache.org/jira/browse/SPARK-2971
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
 Environment: Python yarn client mode, Cloudera 5.1.0 on Ubuntu precise
Reporter: Shay Rojansky


We have cases where if CTRL-C is hit during a Spark job startup, a YARN 
ApplicationMaster is created but cannot connect to the driver (presumably 
because the driver has terminated). Once an AM enters this state it never exits 
it, and has to be manually killed in YARN.

Here's an excerpt from the AM logs:

{noformat}
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/yarn/nm/usercache/roji/filecache/40/spark-assembly-1.0.2-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14/08/11 16:29:39 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/08/11 16:29:39 INFO SecurityManager: Changing view acls to: roji
14/08/11 16:29:39 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(roji)
14/08/11 16:29:40 INFO Slf4jLogger: Slf4jLogger started
14/08/11 16:29:40 INFO Remoting: Starting remoting
14/08/11 16:29:40 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkyar...@g024.grid.eaglerd.local:34075]
14/08/11 16:29:40 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://sparkyar...@g024.grid.eaglerd.local:34075]
14/08/11 16:29:40 INFO RMProxy: Connecting to ResourceManager at 
master.grid.eaglerd.local/192.168.41.100:8030
14/08/11 16:29:40 INFO ExecutorLauncher: ApplicationAttemptId: 
appattempt_1407759736957_0014_01
14/08/11 16:29:40 INFO ExecutorLauncher: Registering the ApplicationMaster
14/08/11 16:29:40 INFO ExecutorLauncher: Waiting for Spark driver to be 
reachable.
14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
master.grid.eaglerd.local:44911, retrying ...
14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
master.grid.eaglerd.local:44911, retrying ...
14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
master.grid.eaglerd.local:44911, retrying ...
14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
master.grid.eaglerd.local:44911, retrying ...
14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
master.grid.eaglerd.local:44911, retrying ...
{noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2962) Suboptimal scheduling in spark

2014-08-11 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092807#comment-14092807
 ] 

Mridul Muralidharan commented on SPARK-2962:


On further investigation :

a) The primary issue is a combination of SPARK-2089 and current schedule 
behavior for pendingTasksWithNoPrefs.
SPARK-2089 leads to very bad allocation of nodes - particularly has an impact 
on bigger clusters.
It leads to a lot of block having no data or rack local executors - causing 
them to end up in pendingTasksWithNoPrefs.

While loading data off dfs, when an executor is being scheduled, even though 
there might be rack local schedules available for it (or, on waiting a while, 
data local too - see (b) below), because of current scheduler behavior, tasks 
from pendingTasksWithNoPrefs get scheduled : causing a large number of ANY 
tasks to be scheduled at the very onset.

The combination of these, with lack of marginal alleviation via (b) is what 
caused the performance impact.

b) spark.scheduler.minRegisteredExecutorsRatio was not yet been used in the 
workload - so that might alleviate some of the non deterministic waiting and 
ensuring adequate executors are allocated ! Thanks [~lirui]



 Suboptimal scheduling in spark
 --

 Key: SPARK-2962
 URL: https://issues.apache.org/jira/browse/SPARK-2962
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: All
Reporter: Mridul Muralidharan

 In findTask, irrespective of 'locality' specified, pendingTasksWithNoPrefs 
 are always scheduled with PROCESS_LOCAL
 pendingTasksWithNoPrefs contains tasks which currently do not have any alive 
 locations - but which could come in 'later' : particularly relevant when 
 spark app is just coming up and containers are still being added.
 This causes a large number of non node local tasks to be scheduled incurring 
 significant network transfers in the cluster when running with non trivial 
 datasets.
 The comment // Look for no-pref tasks after rack-local tasks since they can 
 run anywhere. is misleading in the method code : locality levels start from 
 process_local down to any, and so no prefs get scheduled much before rack.
 Also note that, currentLocalityIndex is reset to the taskLocality returned by 
 this method - so returning PROCESS_LOCAL as the level will trigger wait times 
 again. (Was relevant before recent change to scheduler, and might be again 
 based on resolution of this issue).
 Found as part of writing test for SPARK-2931
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1777) Pass cached blocks directly to disk if memory is not large enough

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092826#comment-14092826
 ] 

Apache Spark commented on SPARK-1777:
-

User 'liyezhang556520' has created a pull request for this issue:
https://github.com/apache/spark/pull/1892

 Pass cached blocks directly to disk if memory is not large enough
 ---

 Key: SPARK-1777
 URL: https://issues.apache.org/jira/browse/SPARK-1777
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Andrew Or
Priority: Critical
 Fix For: 1.1.0

 Attachments: spark-1777-design-doc.pdf


 Currently in Spark we entirely unroll a partition and then check whether it 
 will cause us to exceed the storage limit. This has an obvious problem - if 
 the partition itself is enough to push us over the storage limit (and 
 eventually over the JVM heap), it will cause an OOM.
 This can happen in cases where a single partition is very large or when 
 someone is running examples locally with a small heap.
 https://github.com/apache/spark/blob/f6ff2a61d00d12481bfb211ae13d6992daacdcc2/core/src/main/scala/org/apache/spark/CacheManager.scala#L148
 We should think a bit about the most elegant way to fix this - it shares some 
 similarities with the external aggregation code.
 A simple idea is to periodically check the size of the buffer as we are 
 unrolling and see if we are over the memory limit. If we are we could prepend 
 the existing buffer to the iterator and write that entire thing out to disk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled

2014-08-11 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092860#comment-14092860
 ] 

Cheng Lian commented on SPARK-2970:
---

[~sarutak] Would you mind to update the issue description? Otherwise it can be 
confusing for people that don't see your comments below. Thanks.

 spark-sql script ends with IOException when EventLogging is enabled
 ---

 Key: SPARK-2970
 URL: https://issues.apache.org/jira/browse/SPARK-2970
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
 Environment: CDH5.1.0 (Hadoop 2.3.0)
Reporter: Kousuke Saruta

 When spark-sql script run with spark.eventLog.enabled set true, it ends with 
 IOException because FileLogger can not create APPLICATION_COMPLETE file in 
 HDFS.
 I think it's because FIleSystem is closed by HiveSessionImplWithUGI.
 It has a code as follows.
 {code}
   public void close() throws HiveSQLException {
 try {
 acquire();
 ShimLoader.getHadoopShims().closeAllForUGI(sessionUgi);
 cancelDelegationToken();
 } finally {
   release();
   super.close();
 }
   }
 {code}
 When using Hadoop 2.0+, ShimLoader.getHadoopShim above returns Hadoop23Shim 
 which extends HadoopShimSecure.
 HadoopShimSecure#closeAllForUGI is implemented as follows.
 {code}
   @Override
   public void closeAllForUGI(UserGroupInformation ugi) {
 try {
   FileSystem.closeAllForUGI(ugi);
 } catch (IOException e) {
   LOG.error(Could not clean up file-system handles for UGI:  + ugi, e);
 }
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled

2014-08-11 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-2970:
--

Description: 
When spark-sql script run with spark.eventLog.enabled set true, it ends with 
IOException because FileLogger can not create APPLICATION_COMPLETE file in HDFS.

It's is because a shutdown hook of SparkSQLCLIDriver is executed after a 
shutdown hook of org.apache.hadoop.fs.FileSystem is executed.

When spark.eventLog.enabled is true, the hook of SparkSQLCLIDriver finally try 
to create a file to mark the application finished but the hook of FileSystem 
try to close FileSystem.

  was:
When spark-sql script run with spark.eventLog.enabled set true, it ends with 
IOException because FileLogger can not create APPLICATION_COMPLETE file in HDFS.
I think it's because FIleSystem is closed by HiveSessionImplWithUGI.
It has a code as follows.

{code}
  public void close() throws HiveSQLException {
try {
acquire();
ShimLoader.getHadoopShims().closeAllForUGI(sessionUgi);
cancelDelegationToken();
} finally {
  release();
  super.close();
}
  }
{code}

When using Hadoop 2.0+, ShimLoader.getHadoopShim above returns Hadoop23Shim 
which extends HadoopShimSecure.

HadoopShimSecure#closeAllForUGI is implemented as follows.

{code}
  @Override
  public void closeAllForUGI(UserGroupInformation ugi) {
try {
  FileSystem.closeAllForUGI(ugi);
} catch (IOException e) {
  LOG.error(Could not clean up file-system handles for UGI:  + ugi, e);
}
  }
{code}




 spark-sql script ends with IOException when EventLogging is enabled
 ---

 Key: SPARK-2970
 URL: https://issues.apache.org/jira/browse/SPARK-2970
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
 Environment: CDH5.1.0 (Hadoop 2.3.0)
Reporter: Kousuke Saruta

 When spark-sql script run with spark.eventLog.enabled set true, it ends with 
 IOException because FileLogger can not create APPLICATION_COMPLETE file in 
 HDFS.
 It's is because a shutdown hook of SparkSQLCLIDriver is executed after a 
 shutdown hook of org.apache.hadoop.fs.FileSystem is executed.
 When spark.eventLog.enabled is true, the hook of SparkSQLCLIDriver finally 
 try to create a file to mark the application finished but the hook of 
 FileSystem try to close FileSystem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled

2014-08-11 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092879#comment-14092879
 ] 

Kousuke Saruta commented on SPARK-2970:
---

[~liancheng] Thank you pointing my mistake. I've modified the description.

 spark-sql script ends with IOException when EventLogging is enabled
 ---

 Key: SPARK-2970
 URL: https://issues.apache.org/jira/browse/SPARK-2970
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
 Environment: CDH5.1.0 (Hadoop 2.3.0)
Reporter: Kousuke Saruta

 When spark-sql script run with spark.eventLog.enabled set true, it ends with 
 IOException because FileLogger can not create APPLICATION_COMPLETE file in 
 HDFS.
 It's is because a shutdown hook of SparkSQLCLIDriver is executed after a 
 shutdown hook of org.apache.hadoop.fs.FileSystem is executed.
 When spark.eventLog.enabled is true, the hook of SparkSQLCLIDriver finally 
 try to create a file to mark the application finished but the hook of 
 FileSystem try to close FileSystem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored

2014-08-11 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092881#comment-14092881
 ] 

Thomas Graves commented on SPARK-2089:
--

Sandy, just wondering if you have any ETA on fix for this?

 With YARN, preferredNodeLocalityData isn't honored 
 ---

 Key: SPARK-2089
 URL: https://issues.apache.org/jira/browse/SPARK-2089
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Critical

 When running in YARN cluster mode, apps can pass preferred locality data when 
 constructing a Spark context that will dictate where to request executor 
 containers.
 This is currently broken because of a race condition.  The Spark-YARN code 
 runs the user class and waits for it to start up a SparkContext.  During its 
 initialization, the SparkContext will create a YarnClusterScheduler, which 
 notifies a monitor in the Spark-YARN code that .  The Spark-Yarn code then 
 immediately fetches the preferredNodeLocationData from the SparkContext and 
 uses it to start requesting containers.
 But in the SparkContext constructor that takes the preferredNodeLocationData, 
 setting preferredNodeLocationData comes after the rest of the initialization, 
 so, if the Spark-YARN code comes around quickly enough after being notified, 
 the data that's fetched is the empty unset version.  The occurred during all 
 of my runs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2963) There no documentation about building to use HiveServer and CLI for SparkSQL

2014-08-11 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092889#comment-14092889
 ] 

Cheng Lian edited comment on SPARK-2963 at 8/11/14 3:31 PM:


Actually [there 
is|https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#running-the-thrift-jdbc-server],
 but the Spark CLI part is incomplete. Would you mind to update the Issue title 
and description? Thanks.


was (Author: lian cheng):
Actually [there 
is|https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#running-the-thrift-jdbc-server]
 but the Spark CLI part is incomplete. Would you mind to update the Issue title 
and description? Thanks.

 There no documentation about building to use HiveServer and CLI for SparkSQL
 

 Key: SPARK-2963
 URL: https://issues.apache.org/jira/browse/SPARK-2963
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use 
 -Phive-thriftserver option when building but it's implicit.
 I think we need to describe how to build.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2963) The description about building to use HiveServer and CLI is imcomplete

2014-08-11 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092894#comment-14092894
 ] 

Kousuke Saruta commented on SPARK-2963:
---

I've updated this title and Github's one.

 The description about building to use HiveServer and CLI is imcomplete
 --

 Key: SPARK-2963
 URL: https://issues.apache.org/jira/browse/SPARK-2963
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use 
 -Phive-thriftserver option when building but it's implicit.
 I think we need to describe how to build.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2963) The description about building to use HiveServer and CLI is imcomplete

2014-08-11 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-2963:
--

Description: Currently, if we'd like to use HiveServer or CLI for SparkSQL, 
we need to use -Phive-thriftserver option when building but it's description is 
incomplete.  (was: Currently, if we'd like to use HiveServer or CLI for 
SparkSQL, we need to use -Phive-thriftserver option when building but it's 
implicit.
I think we need to describe how to build.)

 The description about building to use HiveServer and CLI is imcomplete
 --

 Key: SPARK-2963
 URL: https://issues.apache.org/jira/browse/SPARK-2963
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use 
 -Phive-thriftserver option when building but it's description is incomplete.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-11 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-1297:
--

Attachment: spark-1297-v4.txt

Patch v4 adds two profiles to examples/pom.xml :

hbase-hadoop1 (default)
hbase-hadoop2

I verified that compilation passes with either profile active.

 Upgrade HBase dependency to 0.98.0
 --

 Key: SPARK-1297
 URL: https://issues.apache.org/jira/browse/SPARK-1297
 Project: Spark
  Issue Type: Task
Reporter: Ted Yu
Priority: Minor
 Attachments: spark-1297-v2.txt, spark-1297-v4.txt


 HBase 0.94.6 was released 11 months ago.
 Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092967#comment-14092967
 ] 

Sean Owen commented on SPARK-1297:
--

I think you may want to open a PR rather than post patches. Code reviews happen 
on github.com

I see what you did there by triggering one or the other profile with the 
hbase.profile property. Yeah, that may be the least disruptive way to play 
this. But don't the profiles need to select the hadoop-compat module 
appropriate for Hadoop 1 vs Hadoop 2?

 Upgrade HBase dependency to 0.98.0
 --

 Key: SPARK-1297
 URL: https://issues.apache.org/jira/browse/SPARK-1297
 Project: Spark
  Issue Type: Task
Reporter: Ted Yu
Priority: Minor
 Attachments: spark-1297-v2.txt, spark-1297-v4.txt


 HBase 0.94.6 was released 11 months ago.
 Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped

2014-08-11 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-2972:


 Summary: APPLICATION_COMPLETE not created in Python unless context 
explicitly stopped
 Key: SPARK-2972
 URL: https://issues.apache.org/jira/browse/SPARK-2972
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2
 Environment: Cloudera 5.1, yarn master on ubuntu precise
Reporter: Shay Rojansky


If you don't explicitly stop a SparkContext at the end of a Python application 
with sc.stop(), an APPLICATION_COMPLETE file isn't created and the job doesn't 
get picked up by the history server.

This can be easily reproduced with pyspark (but affects scripts as well).

The current workaround is to wrap the entire script with a try/finally and stop 
manually.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2973) Add a way to show tables without executing a job

2014-08-11 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-2973:
-

 Summary: Add a way to show tables without executing a job
 Key: SPARK-2973
 URL: https://issues.apache.org/jira/browse/SPARK-2973
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Aaron Davidson


Right now, sql(show tables).collect() will start a Spark job which shows up 
in the UI. There should be a way to get these without this step.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-11 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092988#comment-14092988
 ] 

Ted Yu commented on SPARK-1297:
---

HBase client doesn't need to specify dependency on hbase-hadoop1-compat or 
hbase-hadoop2-compat

I can open a PR once there is positive feedback on the approach - I came from a 
project where reviews mostly happen on JIRA :-)

Can someone assign this issue to me ?

 Upgrade HBase dependency to 0.98.0
 --

 Key: SPARK-1297
 URL: https://issues.apache.org/jira/browse/SPARK-1297
 Project: Spark
  Issue Type: Task
Reporter: Ted Yu
Priority: Minor
 Attachments: spark-1297-v2.txt, spark-1297-v4.txt


 HBase 0.94.6 was released 11 months ago.
 Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-11 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093012#comment-14093012
 ] 

Ted Yu commented on SPARK-1297:
---

https://github.com/apache/spark/pull/1893

 Upgrade HBase dependency to 0.98.0
 --

 Key: SPARK-1297
 URL: https://issues.apache.org/jira/browse/SPARK-1297
 Project: Spark
  Issue Type: Task
Reporter: Ted Yu
Priority: Minor
 Attachments: spark-1297-v2.txt, spark-1297-v4.txt


 HBase 0.94.6 was released 11 months ago.
 Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093018#comment-14093018
 ] 

Apache Spark commented on SPARK-1297:
-

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/1893

 Upgrade HBase dependency to 0.98.0
 --

 Key: SPARK-1297
 URL: https://issues.apache.org/jira/browse/SPARK-1297
 Project: Spark
  Issue Type: Task
Reporter: Ted Yu
Priority: Minor
 Attachments: spark-1297-v2.txt, spark-1297-v4.txt


 HBase 0.94.6 was released 11 months ago.
 Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2974) Utils.getLocalDir() may return non-existent spark.local.dir directory

2014-08-11 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-2974:
-

 Summary: Utils.getLocalDir() may return non-existent 
spark.local.dir directory
 Key: SPARK-2974
 URL: https://issues.apache.org/jira/browse/SPARK-2974
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Josh Rosen
Priority: Blocker


The patch for [SPARK-2324] modified Spark to ignore a certain number of invalid 
local directories.  Unfortunately, the {{Utils.getLocalDir()}} method returns 
the _first_ local directory from {{spark.local.dir}}, which might not exist.  
This can lead to confusing FileNotFound errors when executors attempt to fetch 
files. 

(I commented on this at 
https://github.com/apache/spark/pull/1274#issuecomment-51537965, but I'm 
opening a JIRA so we don't forget to fix it).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2717) BasicBlockFetchIterator#next should log when it gets stuck

2014-08-11 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2717:
---

Priority: Major  (was: Blocker)

 BasicBlockFetchIterator#next should log when it gets stuck
 --

 Key: SPARK-2717
 URL: https://issues.apache.org/jira/browse/SPARK-2717
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Josh Rosen

 If this is stuck for a long time waiting for blocks, we should log what nodes 
 it is waiting for to help debugging. One way to do this is to call take() 
 with a timeout (e.g. 60 seconds) and when the timeout expires log a message 
 for the blocks it is still waiting for. This could all happen in a loop so 
 that the wait just restarts after the message is logged.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2717) BasicBlockFetchIterator#next should log when it gets stuck

2014-08-11 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2717:
---

Priority: Critical  (was: Major)

 BasicBlockFetchIterator#next should log when it gets stuck
 --

 Key: SPARK-2717
 URL: https://issues.apache.org/jira/browse/SPARK-2717
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Josh Rosen
Priority: Critical

 If this is stuck for a long time waiting for blocks, we should log what nodes 
 it is waiting for to help debugging. One way to do this is to call take() 
 with a timeout (e.g. 60 seconds) and when the timeout expires log a message 
 for the blocks it is still waiting for. This could all happen in a loop so 
 that the wait just restarts after the message is logged.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2931) getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException

2014-08-11 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2931:
---

Target Version/s: 1.1.0

 getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException
 ---

 Key: SPARK-2931
 URL: https://issues.apache.org/jira/browse/SPARK-2931
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: Spark EC2, spark-1.1.0-snapshot1, sort-by-key spark-perf 
 benchmark
Reporter: Josh Rosen
Priority: Blocker
 Attachments: scala-sort-by-key.err, test.patch


 When running Spark Perf's sort-by-key benchmark on EC2 with v1.1.0-snapshot, 
 I get the following errors (one per task):
 {code}
 14/08/08 18:54:22 INFO scheduler.TaskSetManager: Starting task 39.0 in stage 
 0.0 (TID 39, ip-172-31-14-30.us-west-2.compute.internal, PROCESS_LOCAL, 1003 
 bytes)
 14/08/08 18:54:22 INFO cluster.SparkDeploySchedulerBackend: Registered 
 executor: 
 Actor[akka.tcp://sparkexecu...@ip-172-31-9-213.us-west-2.compute.internal:58901/user/Executor#1436065036]
  with ID 0
 14/08/08 18:54:22 ERROR actor.OneForOneStrategy: 1
 java.lang.ArrayIndexOutOfBoundsException: 1
   at 
 org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:475)
   at 
 org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:409)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:261)
   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:257)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:254)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:254)
   at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:153)
   at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:103)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}
 This causes the job to hang.
 I can deterministically reproduce this by re-running the test, either in 
 isolation or as part of the full performance testing suite.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2931) getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException

2014-08-11 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2931:
---

Fix Version/s: (was: 1.1.0)

 getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException
 ---

 Key: SPARK-2931
 URL: https://issues.apache.org/jira/browse/SPARK-2931
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: Spark EC2, spark-1.1.0-snapshot1, sort-by-key spark-perf 
 benchmark
Reporter: Josh Rosen
Priority: Blocker
 Attachments: scala-sort-by-key.err, test.patch


 When running Spark Perf's sort-by-key benchmark on EC2 with v1.1.0-snapshot, 
 I get the following errors (one per task):
 {code}
 14/08/08 18:54:22 INFO scheduler.TaskSetManager: Starting task 39.0 in stage 
 0.0 (TID 39, ip-172-31-14-30.us-west-2.compute.internal, PROCESS_LOCAL, 1003 
 bytes)
 14/08/08 18:54:22 INFO cluster.SparkDeploySchedulerBackend: Registered 
 executor: 
 Actor[akka.tcp://sparkexecu...@ip-172-31-9-213.us-west-2.compute.internal:58901/user/Executor#1436065036]
  with ID 0
 14/08/08 18:54:22 ERROR actor.OneForOneStrategy: 1
 java.lang.ArrayIndexOutOfBoundsException: 1
   at 
 org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:475)
   at 
 org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:409)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:261)
   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:257)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:254)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:254)
   at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:153)
   at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:103)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}
 This causes the job to hang.
 I can deterministically reproduce this by re-running the test, either in 
 isolation or as part of the full performance testing suite.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2963) The description about building to use HiveServer and CLI is incomplete

2014-08-11 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-2963:
--

Summary: The description about building to use HiveServer and CLI is 
incomplete  (was: The description about building to use HiveServer and CLI is 
imcomplete)

 The description about building to use HiveServer and CLI is incomplete
 --

 Key: SPARK-2963
 URL: https://issues.apache.org/jira/browse/SPARK-2963
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use 
 -Phive-thriftserver option when building but it's description is incomplete.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2976) There are too many tabs in some source files

2014-08-11 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-2976:
-

 Summary: There are too many tabs in some source files
 Key: SPARK-2976
 URL: https://issues.apache.org/jira/browse/SPARK-2976
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Priority: Minor


Currently, there are too many tabs in source file, which does not correspond to 
coding style.

I saw following 3 files have tabs.

* sorttable.js
* JavaPageRank.java
* JavaKinesisWordCountASL.java



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns

2014-08-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093119#comment-14093119
 ] 

Yin Huai commented on SPARK-2890:
-

What is the semantic when you have columns with same names?

 Spark SQL should allow SELECT with duplicated columns
 -

 Key: SPARK-2890
 URL: https://issues.apache.org/jira/browse/SPARK-2890
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jianshi Huang

 Spark reported error java.lang.IllegalArgumentException with messages:
 java.lang.IllegalArgumentException: requirement failed: Found fields with the 
 same name.
 at scala.Predef$.require(Predef.scala:233)
 at 
 org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317)
 at 
 org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83)
 at 
 org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57)
 at 
 org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85)
 at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433)
 After trial and error, it seems it's caused by duplicated columns in my 
 select clause.
 I made the duplication on purpose for my code to parse correctly. I think we 
 should allow users to specify duplicated columns as return value.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2790) PySpark zip() doesn't work properly if RDDs have different serializers

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093133#comment-14093133
 ] 

Apache Spark commented on SPARK-2790:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/1894

 PySpark zip() doesn't work properly if RDDs have different serializers
 --

 Key: SPARK-2790
 URL: https://issues.apache.org/jira/browse/SPARK-2790
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0, 1.1.0
Reporter: Josh Rosen
Assignee: Davies Liu
Priority: Critical

 In PySpark, attempting to {{zip()}} two RDDs may fail if the RDDs have 
 different serializers (e.g. batched vs. unbatched), even if those RDDs have 
 the same number of partitions and same numbers of elements.  This problem 
 occurs in the MLlib Python APIs, where we might want to zip a JavaRDD of 
 LabelledPoints with a JavaRDD of batch-serialized Python objects.
 This is problematic because whether zip() succeeds or errors depends on the 
 partitioning / batching strategy, and we don't want to surface the 
 serialization details to users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1284) pyspark hangs after IOError on Executor

2014-08-11 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093137#comment-14093137
 ] 

Davies Liu commented on SPARK-1284:
---

[~jblomo], could you reproduce this on master or 1.1 branch?

Maybe the pyspark did not hange after this error message, the take() had 
finished successfully before the error message pop up. The noisy error messages 
had been fixed in PR https://github.com/apache/spark/pull/1625 

 pyspark hangs after IOError on Executor
 ---

 Key: SPARK-1284
 URL: https://issues.apache.org/jira/browse/SPARK-1284
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Jim Blomo
Assignee: Davies Liu

 When running a reduceByKey over a cached RDD, Python fails with an exception, 
 but the failure is not detected by the task runner.  Spark and the pyspark 
 shell hang waiting for the task to finish.
 The error is:
 {code}
 PySpark worker failed with exception:
 Traceback (most recent call last):
   File /home/hadoop/spark/python/pyspark/worker.py, line 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /home/hadoop/spark/python/pyspark/serializers.py, line 182, in 
 dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /home/hadoop/spark/python/pyspark/serializers.py, line 118, in 
 dump_stream
 self._write_with_length(obj, stream)
   File /home/hadoop/spark/python/pyspark/serializers.py, line 130, in 
 _write_with_length
 stream.write(serialized)
 IOError: [Errno 104] Connection reset by peer
 14/03/19 22:48:15 INFO scheduler.TaskSetManager: Serialized task 4.0:0 as 
 4257 bytes in 47 ms
 Traceback (most recent call last):
   File /home/hadoop/spark/python/pyspark/daemon.py, line 117, in 
 launch_worker
 worker(listen_sock)
   File /home/hadoop/spark/python/pyspark/daemon.py, line 107, in worker
 outfile.flush()
 IOError: [Errno 32] Broken pipe
 {code}
 I can reproduce the error by running take(10) on the cached RDD before 
 running reduceByKey (which looks at the whole input file).
 Affects Version 1.0.0-SNAPSHOT (4d88030486)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2700) Hidden files (such as .impala_insert_staging) should be filtered out by sqlContext.parquetFile

2014-08-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093150#comment-14093150
 ] 

Yin Huai commented on SPARK-2700:
-

Can we resolve it?

 Hidden files (such as .impala_insert_staging) should be filtered out by 
 sqlContext.parquetFile
 --

 Key: SPARK-2700
 URL: https://issues.apache.org/jira/browse/SPARK-2700
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.0.1
Reporter: Teng Qiu
 Fix For: 1.1.0


 when creating a table in impala, a hidden folder .impala_insert_staging will 
 be created in the folder of table.
 if we want to load such a table using Spark SQL API sqlContext.parquetFile, 
 this hidden folder makes trouble, spark try to get metadata from this folder, 
 you will see the exception:
 {code:borderStyle=solid}
 Caused by: java.io.IOException: Could not read footer for file 
 FileStatus{path=hdfs://xxx:8020/user/hive/warehouse/parquet_strings/.impala_insert_staging;
  isDirectory=true; modification_time=1406333729252; access_time=0; 
 owner=hdfs; group=hdfs; permission=rwxr-xr-x; isSymlink=false}
 ...
 ...
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path is 
 not a file: /user/hive/warehouse/parquet_strings/.impala_insert_staging
 {code}
 and impala side do not think this is their problem: 
 https://issues.cloudera.org/browse/IMPALA-837 (IMPALA-837 Delete 
 .impala_insert_staging directory after INSERT)
 so maybe we should filter out these hidden folder/file by reading parquet 
 tables



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2948) PySpark doesn't work on Python 2.6

2014-08-11 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2948.
---

   Resolution: Fixed
Fix Version/s: 1.1.0

 PySpark doesn't work on Python 2.6
 --

 Key: SPARK-2948
 URL: https://issues.apache.org/jira/browse/SPARK-2948
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: CentOS 6.5 / Python 2.6.6
Reporter: Kousuke Saruta
Assignee: Josh Rosen
Priority: Blocker
 Fix For: 1.1.0


 In serializser.py, collections.namedtuple is redefined as follows.
 {code}
 def namedtuple(name, fields, verbose=False, rename=False):
   
   
 cls = _old_namedtuple(name, fields, verbose, rename)  
   
   
 return _hack_namedtuple(cls)  
   
   
  
 {code}
 The number of arguments is 4 but the number of arguments of namedtuple for 
 Python 2.6 is 3 so mismatch is occurred.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2954) PySpark MLlib serialization tests fail on Python 2.6

2014-08-11 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2954.
---

   Resolution: Fixed
Fix Version/s: 1.1.0

 PySpark MLlib serialization tests fail on Python 2.6
 

 Key: SPARK-2954
 URL: https://issues.apache.org/jira/browse/SPARK-2954
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Josh Rosen
Assignee: Josh Rosen
 Fix For: 1.1.0


 The PySpark MLlib tests currently fail on Python 2.6 due to problems 
 unpacking data from bytearray using struct.unpack:
 {code}
 **
 File pyspark/mllib/_common.py, line 181, in __main__._deserialize_double
 Failed example:
 _deserialize_double(_serialize_double(1L)) == 1.0
 Exception raised:
 Traceback (most recent call last):
   File 
 /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py,
  line 1253, in __run
 compileflags, 1) in test.globs
   File doctest __main__._deserialize_double[4], line 1, in module
 _deserialize_double(_serialize_double(1L)) == 1.0
   File pyspark/mllib/_common.py, line 194, in _deserialize_double
 return struct.unpack(d, ba[offset:])[0]
 error: unpack requires a string argument of length 8
 **
 File pyspark/mllib/_common.py, line 184, in __main__._deserialize_double
 Failed example:
 _deserialize_double(_serialize_double(sys.float_info.max)) == x
 Exception raised:
 Traceback (most recent call last):
   File 
 /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py,
  line 1253, in __run
 compileflags, 1) in test.globs
   File doctest __main__._deserialize_double[6], line 1, in module
 _deserialize_double(_serialize_double(sys.float_info.max)) == x
   File pyspark/mllib/_common.py, line 194, in _deserialize_double
 return struct.unpack(d, ba[offset:])[0]
 error: unpack requires a string argument of length 8
 **
 File pyspark/mllib/_common.py, line 187, in __main__._deserialize_double
 Failed example:
 _deserialize_double(_serialize_double(sys.float_info.max)) == y
 Exception raised:
 Traceback (most recent call last):
   File 
 /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py,
  line 1253, in __run
 compileflags, 1) in test.globs
   File doctest __main__._deserialize_double[8], line 1, in module
 _deserialize_double(_serialize_double(sys.float_info.max)) == y
   File pyspark/mllib/_common.py, line 194, in _deserialize_double
 return struct.unpack(d, ba[offset:])[0]
 error: unpack requires a string argument of length 8
 **
 {code}
 It looks like one solution is to wrap the {{bytearray}} with {{buffer()}}: 
 http://stackoverflow.com/a/15467046/590203



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2977) Fix handling of short shuffle manager names in ShuffleBlockManager

2014-08-11 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-2977:
-

 Summary: Fix handling of short shuffle manager names in 
ShuffleBlockManager
 Key: SPARK-2977
 URL: https://issues.apache.org/jira/browse/SPARK-2977
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Josh Rosen


Since we allow short names for {{spark.shuffle.manager}}, all code that reads 
that configuration property should be prepared to handle the short names.

See my comment at 
https://github.com/apache/spark/pull/1799#discussion_r16029607 (opening this as 
a JIRA so we don't forget to fix it).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2101) Python unit tests fail on Python 2.6 because of lack of unittest.skipIf()

2014-08-11 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2101.
---

   Resolution: Fixed
Fix Version/s: 1.1.0

 Python unit tests fail on Python 2.6 because of lack of unittest.skipIf()
 -

 Key: SPARK-2101
 URL: https://issues.apache.org/jira/browse/SPARK-2101
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Uri Laserson
Assignee: Josh Rosen
 Fix For: 1.1.0


 PySpark tests fail with Python 2.6 because they currently depend on 
 {{unittest.skipIf}}, which was only introduced in Python 2.7.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2976) There are too many tabs in some source files

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093175#comment-14093175
 ] 

Apache Spark commented on SPARK-2976:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/1895

 There are too many tabs in some source files
 

 Key: SPARK-2976
 URL: https://issues.apache.org/jira/browse/SPARK-2976
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Priority: Minor

 Currently, there are too many tabs in source file, which does not correspond 
 to coding style.
 I saw following 3 files have tabs.
 * sorttable.js
 * JavaPageRank.java
 * JavaKinesisWordCountASL.java



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2420) Dependency changes for compatibility with Hive

2014-08-11 Thread Brock Noland (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brock Noland updated SPARK-2420:


Labels: Hive  (was: )

 Dependency changes for compatibility with Hive
 --

 Key: SPARK-2420
 URL: https://issues.apache.org/jira/browse/SPARK-2420
 Project: Spark
  Issue Type: Wish
  Components: Build
Affects Versions: 1.0.0
Reporter: Xuefu Zhang
  Labels: Hive
 Attachments: spark_1.0.0.patch


 During the prototyping of HIVE-7292, many library conflicts showed up because 
 Spark build contains versions of libraries that's vastly different from 
 current major Hadoop version. It would be nice if we can choose versions 
 that's in line with Hadoop or shading them in the assembly. Here are the wish 
 list:
 1. Upgrade protobuf version to 2.5.0 from current 2.4.1
 2. Shading Spark's jetty and servlet dependency in the assembly.
 3. guava version difference. Spark is using a higher version. I'm not sure 
 what's the best solution for this.
 The list may grow as HIVE-7292 proceeds.
 For information only, the attached is a patch that we applied on Spark in 
 order to make Spark work with Hive. It gives an idea of the scope of changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1284) pyspark hangs after IOError on Executor

2014-08-11 Thread Jim Blomo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093219#comment-14093219
 ] 

Jim Blomo commented on SPARK-1284:
--

I will try to reproduce on the 1.1 branch later this week, thanks for the 
update!

 pyspark hangs after IOError on Executor
 ---

 Key: SPARK-1284
 URL: https://issues.apache.org/jira/browse/SPARK-1284
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Jim Blomo
Assignee: Davies Liu

 When running a reduceByKey over a cached RDD, Python fails with an exception, 
 but the failure is not detected by the task runner.  Spark and the pyspark 
 shell hang waiting for the task to finish.
 The error is:
 {code}
 PySpark worker failed with exception:
 Traceback (most recent call last):
   File /home/hadoop/spark/python/pyspark/worker.py, line 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /home/hadoop/spark/python/pyspark/serializers.py, line 182, in 
 dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /home/hadoop/spark/python/pyspark/serializers.py, line 118, in 
 dump_stream
 self._write_with_length(obj, stream)
   File /home/hadoop/spark/python/pyspark/serializers.py, line 130, in 
 _write_with_length
 stream.write(serialized)
 IOError: [Errno 104] Connection reset by peer
 14/03/19 22:48:15 INFO scheduler.TaskSetManager: Serialized task 4.0:0 as 
 4257 bytes in 47 ms
 Traceback (most recent call last):
   File /home/hadoop/spark/python/pyspark/daemon.py, line 117, in 
 launch_worker
 worker(listen_sock)
   File /home/hadoop/spark/python/pyspark/daemon.py, line 107, in worker
 outfile.flush()
 IOError: [Errno 32] Broken pipe
 {code}
 I can reproduce the error by running take(10) on the cached RDD before 
 running reduceByKey (which looks at the whole input file).
 Affects Version 1.0.0-SNAPSHOT (4d88030486)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2891) Daemon failed to launch worker

2014-08-11 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-2891.
---

   Resolution: Duplicate
Fix Version/s: 1.1.0

duplicated to 2898

 Daemon failed to launch worker
 --

 Key: SPARK-2891
 URL: https://issues.apache.org/jira/browse/SPARK-2891
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Davies Liu
Priority: Critical
 Fix For: 1.1.0


 daviesliu@dm:~/work/spark-perf$ /Users/daviesliu/work/spark/bin/spark-submit 
 --master spark://dm:7077 pyspark-tests/tests.py SchedulerThroughputTest 
 --num-tasks=1 --num-trials=4 --inter-trial-wait=1
 14/08/06 17:58:04 WARN JettyUtils: Failed to create UI on port 4040. Trying 
 again on port 4041. - Failure(java.net.BindException: Address already in use)
 Daemon failed to fork PySpark worker: [Errno 35] Resource temporarily 
 unavailable
 14/08/06 17:59:25 ERROR Executor: Exception in task 9777.0 in stage 1.0 (TID 
 19777)
 java.lang.IllegalStateException: Python daemon failed to launch worker
   at 
 org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:71)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55)
   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:101)
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Daemon failed to fork PySpark worker: [Errno 35] Resource temporarily 
 unavailable
 14/08/06 17:59:25 ERROR Executor: Exception in task 9781.0 in stage 1.0 (TID 
 19781)
 java.lang.IllegalStateException: Python daemon failed to launch worker
   at 
 org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:71)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55)
   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:101)
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 14/08/06 17:59:25 WARN TaskSetManager: Lost task 9777.0 in stage 1.0 (TID 
 19777, localhost): java.lang.IllegalStateException: Python daemon failed to 
 launch worker
 
 org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:71)
 
 org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83)
 
 org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82)
 
 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55)
 org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:101)
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 

[jira] [Commented] (SPARK-2931) getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093295#comment-14093295
 ] 

Apache Spark commented on SPARK-2931:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/1896

 getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException
 ---

 Key: SPARK-2931
 URL: https://issues.apache.org/jira/browse/SPARK-2931
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: Spark EC2, spark-1.1.0-snapshot1, sort-by-key spark-perf 
 benchmark
Reporter: Josh Rosen
Priority: Blocker
 Attachments: scala-sort-by-key.err, test.patch


 When running Spark Perf's sort-by-key benchmark on EC2 with v1.1.0-snapshot, 
 I get the following errors (one per task):
 {code}
 14/08/08 18:54:22 INFO scheduler.TaskSetManager: Starting task 39.0 in stage 
 0.0 (TID 39, ip-172-31-14-30.us-west-2.compute.internal, PROCESS_LOCAL, 1003 
 bytes)
 14/08/08 18:54:22 INFO cluster.SparkDeploySchedulerBackend: Registered 
 executor: 
 Actor[akka.tcp://sparkexecu...@ip-172-31-9-213.us-west-2.compute.internal:58901/user/Executor#1436065036]
  with ID 0
 14/08/08 18:54:22 ERROR actor.OneForOneStrategy: 1
 java.lang.ArrayIndexOutOfBoundsException: 1
   at 
 org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:475)
   at 
 org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:409)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:261)
   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:257)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:254)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:254)
   at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:153)
   at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:103)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}
 This causes the job to hang.
 I can deterministically reproduce this by re-running the test, either in 
 isolation or as part of the full performance testing suite.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1065) PySpark runs out of memory with large broadcast variables

2014-08-11 Thread Vlad Frolov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093413#comment-14093413
 ] 

Vlad Frolov commented on SPARK-1065:


I am facing the same issue in my project, where I use PySpark. As a proof of 
that the big objects I have could easily fit into nodes' memory, I am going to 
use dummy solution of saving my big objects into HDFS and load them on Python 
nodes.

Does anybody have an idea how to fix the issue in a better way? I don't have 
enough either Scala nor Java knowledge to fix this in Spark core. However, I 
feel like broadcast variables could be reimplemented on Python side though it 
seems a bit dangerous idea because we don't want to have separate 
implementations of one thing in both languages. That will also save memory, 
because while we use broadcasts through Scala we have 1 copy in JVM, 1 pickled 
copy in Python and 1 constructed object copy in Python.

 PySpark runs out of memory with large broadcast variables
 -

 Key: SPARK-1065
 URL: https://issues.apache.org/jira/browse/SPARK-1065
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.7.3, 0.8.1, 0.9.0
Reporter: Josh Rosen

 PySpark's driver components may run out of memory when broadcasting large 
 variables (say 1 gigabyte).
 Because PySpark's broadcast is implemented on top of Java Spark's broadcast 
 by broadcasting a pickled Python as a byte array, we may be retaining 
 multiple copies of the large object: a pickled copy in the JVM and a 
 deserialized copy in the Python driver.
 The problem could also be due to memory requirements during pickling.
 PySpark is also affected by broadcast variables not being garbage collected.  
 Adding an unpersist() method to broadcast variables may fix this: 
 https://github.com/apache/incubator-spark/pull/543.
 As a first step to fixing this, we should write a failing test to reproduce 
 the error.
 This was discovered by [~sandy]: [trouble with broadcast variables on 
 pyspark|http://apache-spark-user-list.1001560.n3.nabble.com/trouble-with-broadcast-variables-on-pyspark-tp1301.html].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-11 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093466#comment-14093466
 ] 

Ted Yu commented on SPARK-1297:
---

w.r.t. build, by default, hbase-hadoop1 would be used.
If user specifies any of the hadoop-2 profiles, hbase-hadoop2 should be 
specified as well.

 Upgrade HBase dependency to 0.98.0
 --

 Key: SPARK-1297
 URL: https://issues.apache.org/jira/browse/SPARK-1297
 Project: Spark
  Issue Type: Task
Reporter: Ted Yu
Priority: Minor
 Attachments: spark-1297-v2.txt, spark-1297-v4.txt


 HBase 0.94.6 was released 11 months ago.
 Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093468#comment-14093468
 ] 

Sean Owen commented on SPARK-1297:
--

Yes I think you'd need to reflect that in changes to the build instructions. 
They are under docs/

 Upgrade HBase dependency to 0.98.0
 --

 Key: SPARK-1297
 URL: https://issues.apache.org/jira/browse/SPARK-1297
 Project: Spark
  Issue Type: Task
Reporter: Ted Yu
Priority: Minor
 Attachments: spark-1297-v2.txt, spark-1297-v4.txt


 HBase 0.94.6 was released 11 months ago.
 Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2975) SPARK_LOCAL_DIRS may cause problems when running in local mode

2014-08-11 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2975:
--

Priority: Critical  (was: Minor)

I'm raising the priority of this issue to 'critical', since it causes problems 
when running on a cluster if some tasks are small enough to be run locally on 
the driver.

Here's an example exception:

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 21 in 
stage 0.0 failed 1 times, most recent failure: Lost task 21.0 in stage 0.0 (TID 
21, localhost): java.io.IOException: No such file or directory
java.io.UnixFileSystem.createFileExclusively(Native Method)
java.io.File.createNewFile(File.java:1006)
java.io.File.createTempFile(File.java:1989)
org.apache.spark.util.Utils$.fetchFile(Utils.scala:335)

org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:342)

org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:340)

scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)

scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)

scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
scala.collection.mutable.HashMap.foreach(HashMap.scala:98)

scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)

org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:340)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:180)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1153)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1142)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1141)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1141)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:682)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:682)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:682)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1359)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

 SPARK_LOCAL_DIRS may cause problems when running in local mode
 --

 Key: SPARK-2975
 URL: https://issues.apache.org/jira/browse/SPARK-2975
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Josh Rosen
Priority: Critical

 If we're running Spark in local mode and {{SPARK_LOCAL_DIRS}} is set, the 
 {{Executor}} modifies SparkConf so that this value overrides 
 {{spark.local.dir}}.  Normally, this is safe because the modification takes 
 place before SparkEnv is created.  In local mode, the Executor uses an 
 existing SparkEnv rather than creating a new one, so it winds up with a 
 DiskBlockManager that created local directories with the original 
 {{spark.local.dir}} setting, but other components attempt to use directories 
 specified in the _new_ {{spark.local.dir}}, 

[jira] [Created] (SPARK-2978) Provide an MR-style shuffle transformation

2014-08-11 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-2978:
-

 Summary: Provide an MR-style shuffle transformation
 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza


For Hive on Spark in particular, and running legacy MR code in general, I think 
it would be useful to provide an MR-style shuffle transformation, i.e. one that
* groups by key: provides (Key, Iterator[Value])
* within each partition, provides keys in sorted order

A couple ways that could make sense to expose this:
* Add a new operator.  groupAndSortByKey, groupByKeyAndSortWithinPartition, 
hadoopStyleShuffle
* Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2978) Provide an MR-style shuffle transformation

2014-08-11 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-2978:
--

Description: 
For Hive on Spark joins in particular, and for running legacy MR code in 
general, I think it would be useful to provide an MR-style shuffle 
transformation, i.e. one that
* groups by key: provides (Key, Iterator[Value])
* within each partition, provides keys in sorted order

A couple ways that could make sense to expose this:
* Add a new operator.  groupAndSortByKey, groupByKeyAndSortWithinPartition, 
hadoopStyleShuffle
* Allow groupByKey to take an ordering param for keys within a partition

  was:
For Hive on Spark in particular, and running legacy MR code in general, I think 
it would be useful to provide an MR-style shuffle transformation, i.e. one that
* groups by key: provides (Key, Iterator[Value])
* within each partition, provides keys in sorted order

A couple ways that could make sense to expose this:
* Add a new operator.  groupAndSortByKey, groupByKeyAndSortWithinPartition, 
hadoopStyleShuffle
* Allow groupByKey to take an ordering param for keys within a partition


 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide an MR-style shuffle 
 transformation, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2978) Provide an MR-style shuffle transformation

2014-08-11 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-2978:
--

Description: 
For Hive on Spark joins in particular, and for running legacy MR code in 
general, I think it would be useful to provide a transformation with the 
semantics of the Hadoop MR shuffle, i.e. one that
* groups by key: provides (Key, Iterator[Value])
* within each partition, provides keys in sorted order

A couple ways that could make sense to expose this:
* Add a new operator.  groupAndSortByKey, groupByKeyAndSortWithinPartition, 
hadoopStyleShuffle, maybe?
* Allow groupByKey to take an ordering param for keys within a partition

  was:
For Hive on Spark joins in particular, and for running legacy MR code in 
general, I think it would be useful to provide a transformation with the 
semantics of the Hadoop MR shuffle, i.e. one that
* groups by key: provides (Key, Iterator[Value])
* within each partition, provides keys in sorted order

A couple ways that could make sense to expose this:
* Add a new operator.  groupAndSortByKey, groupByKeyAndSortWithinPartition, 
hadoopStyleShuffle
* Allow groupByKey to take an ordering param for keys within a partition


 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide a transformation with the 
 semantics of the Hadoop MR shuffle, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe?
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2978) Provide an MR-style shuffle transformation

2014-08-11 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-2978:
--

Description: 
For Hive on Spark joins in particular, and for running legacy MR code in 
general, I think it would be useful to provide a transformation with the 
semantics of the Hadoop MR shuffle, i.e. one that
* groups by key: provides (Key, Iterator[Value])
* within each partition, provides keys in sorted order

A couple ways that could make sense to expose this:
* Add a new operator.  groupAndSortByKey, groupByKeyAndSortWithinPartition, 
hadoopStyleShuffle
* Allow groupByKey to take an ordering param for keys within a partition

  was:
For Hive on Spark joins in particular, and for running legacy MR code in 
general, I think it would be useful to provide an MR-style shuffle 
transformation, i.e. one that
* groups by key: provides (Key, Iterator[Value])
* within each partition, provides keys in sorted order

A couple ways that could make sense to expose this:
* Add a new operator.  groupAndSortByKey, groupByKeyAndSortWithinPartition, 
hadoopStyleShuffle
* Allow groupByKey to take an ordering param for keys within a partition


 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide a transformation with the 
 semantics of the Hadoop MR shuffle, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2979) Improve the convergence rate by minimize the condition number in LOR with LBFGS

2014-08-11 Thread DB Tsai (JIRA)
DB Tsai created SPARK-2979:
--

 Summary: Improve the convergence rate by minimize the condition 
number in LOR with LBFGS
 Key: SPARK-2979
 URL: https://issues.apache.org/jira/browse/SPARK-2979
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: DB Tsai


Scaling to minimize the condition number:

During the optimization process, the convergence (rate) depends on the 
condition number of the training dataset. Scaling the variables often reduces 
this condition number, thus mproving the convergence rate dramatically. Without 
reducing the condition number, some training datasets mixing the columns with 
different scales may not be able to converge.
 
GLMNET and LIBSVM packages perform the scaling to reduce the condition number, 
and return the weights in the original scale.

See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
 
Here, if useFeatureScaling is enabled, we will standardize the training 
features by dividing the variance of each column (without subtracting the 
mean), and train the model in the scaled space. Then we transform the 
coefficients from the scaled space to the original scale as GLMNET and LIBSVM 
do.
   
Currently, it's only enabled in LogisticRegressionWithLBFGS




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2979) Improve the convergence rate by minimize the condition number in LOR with LBFGS

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093604#comment-14093604
 ] 

Apache Spark commented on SPARK-2979:
-

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/1897

 Improve the convergence rate by minimize the condition number in LOR with 
 LBFGS
 ---

 Key: SPARK-2979
 URL: https://issues.apache.org/jira/browse/SPARK-2979
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: DB Tsai

 Scaling to minimize the condition number:
 
 During the optimization process, the convergence (rate) depends on the 
 condition number of the training dataset. Scaling the variables often reduces 
 this condition number, thus mproving the convergence rate dramatically. 
 Without reducing the condition number, some training datasets mixing the 
 columns with different scales may not be able to converge.
  
 GLMNET and LIBSVM packages perform the scaling to reduce the condition 
 number, and return the weights in the original scale.
 See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
  
 Here, if useFeatureScaling is enabled, we will standardize the training 
 features by dividing the variance of each column (without subtracting the 
 mean), and train the model in the scaled space. Then we transform the 
 coefficients from the scaled space to the original scale as GLMNET and LIBSVM 
 do.

 Currently, it's only enabled in LogisticRegressionWithLBFGS



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2979) Improve the convergence rate by minimizing the condition number in LOR with LBFGS

2014-08-11 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-2979:
---

Summary: Improve the convergence rate by minimizing the condition number in 
LOR with LBFGS  (was: Improve the convergence rate by minimize the condition 
number in LOR with LBFGS)

 Improve the convergence rate by minimizing the condition number in LOR with 
 LBFGS
 -

 Key: SPARK-2979
 URL: https://issues.apache.org/jira/browse/SPARK-2979
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: DB Tsai

 Scaling to minimize the condition number:
 
 During the optimization process, the convergence (rate) depends on the 
 condition number of the training dataset. Scaling the variables often reduces 
 this condition number, thus mproving the convergence rate dramatically. 
 Without reducing the condition number, some training datasets mixing the 
 columns with different scales may not be able to converge.
  
 GLMNET and LIBSVM packages perform the scaling to reduce the condition 
 number, and return the weights in the original scale.
 See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
  
 Here, if useFeatureScaling is enabled, we will standardize the training 
 features by dividing the variance of each column (without subtracting the 
 mean), and train the model in the scaled space. Then we transform the 
 coefficients from the scaled space to the original scale as GLMNET and LIBSVM 
 do.

 Currently, it's only enabled in LogisticRegressionWithLBFGS



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2515) Hypothesis testing

2014-08-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2515:
-

Fix Version/s: 1.1.0

 Hypothesis testing
 --

 Key: SPARK-2515
 URL: https://issues.apache.org/jira/browse/SPARK-2515
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Doris Xin
 Fix For: 1.1.0


 Support common statistical tests in Spark MLlib.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2515) Chi-squared test

2014-08-11 Thread Doris Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doris Xin updated SPARK-2515:
-

Summary: Chi-squared test  (was: Hypothesis testing)

 Chi-squared test
 

 Key: SPARK-2515
 URL: https://issues.apache.org/jira/browse/SPARK-2515
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Doris Xin
 Fix For: 1.1.0


 Support common statistical tests in Spark MLlib.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2515) Chi-squared test

2014-08-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-2515.


  Resolution: Implemented
Target Version/s: 1.1.0

 Chi-squared test
 

 Key: SPARK-2515
 URL: https://issues.apache.org/jira/browse/SPARK-2515
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Doris Xin
 Fix For: 1.1.0


 Support common statistical tests in Spark MLlib.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2980) Python support for chi-squared test

2014-08-11 Thread Doris Xin (JIRA)
Doris Xin created SPARK-2980:


 Summary: Python support for chi-squared test
 Key: SPARK-2980
 URL: https://issues.apache.org/jira/browse/SPARK-2980
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Doris Xin






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2980) Python support for chi-squared test

2014-08-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2980:
-

Assignee: (was: Doris Xin)

 Python support for chi-squared test
 ---

 Key: SPARK-2980
 URL: https://issues.apache.org/jira/browse/SPARK-2980
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Doris Xin





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2934) Adding LogisticRegressionWithLBFGS for training with LBFGS Optimizer

2014-08-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2934:
-

Assignee: DB Tsai

 Adding LogisticRegressionWithLBFGS for training with LBFGS Optimizer  
 --

 Key: SPARK-2934
 URL: https://issues.apache.org/jira/browse/SPARK-2934
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: DB Tsai
Assignee: DB Tsai





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2844) Existing JVM Hive Context not correctly used in Python Hive Context

2014-08-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2844.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

 Existing JVM Hive Context not correctly used in Python Hive Context
 ---

 Key: SPARK-2844
 URL: https://issues.apache.org/jira/browse/SPARK-2844
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Reporter: Ahir Reddy
Assignee: Ahir Reddy
 Fix For: 1.1.0


 Unlike the SQLContext, assing an existing JVM HiveContext object into the 
 Python HiveContext constructor does not actually re-use that object. Instead 
 it will create a new HiveContext.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2590) Add config property to disable incremental collection used in Thrift server

2014-08-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2590.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

 Add config property to disable incremental collection used in Thrift server
 ---

 Key: SPARK-2590
 URL: https://issues.apache.org/jira/browse/SPARK-2590
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.1.0


 {{SparkSQLOperationManager}} uses {{RDD.toLocalIterator}} to collect the 
 result set one partition at a time. This is useful to avoid OOM when the 
 result is large, but introduces extra job scheduling costs as each partition 
 is collected with a separate job. Users may want to disable this when the 
 result set is expected to be small.
 *UPDATE* Incremental collection hurts performance because tasks of the last 
 stage of the RDD DAG generated from the SQL query plan are executed 
 sequentially. Thus we decided to disable it by default.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2965) Fix HashOuterJoin output nullabilities.

2014-08-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2965.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
 Assignee: Takuya Ueshin

 Fix HashOuterJoin output nullabilities.
 ---

 Key: SPARK-2965
 URL: https://issues.apache.org/jira/browse/SPARK-2965
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin
Assignee: Takuya Ueshin
 Fix For: 1.1.0


 Output attributes of opposite side of {{OuterJoin}} should be nullable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2968) Fix nullabilities of Explode.

2014-08-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2968.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
 Assignee: Takuya Ueshin

 Fix nullabilities of Explode.
 -

 Key: SPARK-2968
 URL: https://issues.apache.org/jira/browse/SPARK-2968
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin
Assignee: Takuya Ueshin
 Fix For: 1.1.0


 Output nullabilities of {{Explode}} could be detemined by 
 {{ArrayType.containsNull}} or {{MapType.valueContainsNull}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2650) Caching tables larger than memory causes OOMs

2014-08-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2650.
-

  Resolution: Fixed
   Fix Version/s: 1.1.0
Assignee: Michael Armbrust  (was: Cheng Lian)
Target Version/s: 1.1.0  (was: 1.2.0)

 Caching tables larger than memory causes OOMs
 -

 Key: SPARK-2650
 URL: https://issues.apache.org/jira/browse/SPARK-2650
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0, 1.0.1
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Critical
 Fix For: 1.1.0


 The logic for setting up the initial column buffers is different for Spark 
 SQL compared to Shark and I'm seeing OOMs when caching tables that are larger 
 than available memory (where shark was okay).
 Two suspicious things: the intialSize is always set to 0 so we always go with 
 the default.  The default looks like it was copied from code like 10 * 1024 * 
 1024... but in Spark SQL its 10 * 102 * 1024.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2981) PartitionStrategy: VertexID hash overflow

2014-08-11 Thread Larry Xiao (JIRA)
Larry Xiao created SPARK-2981:
-

 Summary: PartitionStrategy: VertexID hash overflow
 Key: SPARK-2981
 URL: https://issues.apache.org/jira/browse/SPARK-2981
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.2
Reporter: Larry Xiao


In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId 
with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts.

The Long is overflowed, and when cast to Int:

{quote}
scala (1125899906842597L*1).toInt
res1: Int = -27

scala (1125899906842597L*2).toInt
res2: Int = -54

scala (1125899906842597L*3).toInt
res3: Int = -81
{quote}
As the cast produce number that are multiplies of 3, the partition is not 
useable when partitioning to multiples of 3.

for example when you partition to 6 or 9 parts:
{quote}
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0))
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0)) 

14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0))
14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 
{quote}
I think solution is to cast after mod.
{quote}
scala (1125899906842597L*3)
res4: Long = 3377699720527791

scala (1125899906842597L*3) % 9
res5: Long = 3

scala ((1125899906842597L*3) % 9).toInt
res5: Int = 3
{quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2981) PartitionStrategy: VertexID hash overflow

2014-08-11 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-2981:
--

Description: 
In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId 
with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts.

The Long is overflowed, and when cast to Int:

{quote}
scala (1125899906842597L*1).toInt
res1: Int = -27

scala (1125899906842597L*2).toInt
res2: Int = -54

scala (1125899906842597L*3).toInt
res3: Int = -81
{quote}
As the cast produce number that are multiplies of 3, the partition is not 
useable when partitioning to multiples of 3.

for example when you partition to 6 or 9 parts:
{quote}
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0))
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0)) 

14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0))
14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 

so the vertices are partitioned to 0,3 for 6; and 0 for 9
{quote}

I think solution is to cast after mod.
{quote}
scala (1125899906842597L*3)
res4: Long = 3377699720527791

scala (1125899906842597L*3) % 9
res5: Long = 3

scala ((1125899906842597L*3) % 9).toInt
res5: Int = 3
{quote}

  was:
In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId 
with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts.

The Long is overflowed, and when cast to Int:

{quote}
scala (1125899906842597L*1).toInt
res1: Int = -27

scala (1125899906842597L*2).toInt
res2: Int = -54

scala (1125899906842597L*3).toInt
res3: Int = -81
{quote}
As the cast produce number that are multiplies of 3, the partition is not 
useable when partitioning to multiples of 3.

for example when you partition to 6 or 9 parts:
{quote}
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0))
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0)) 

14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0))
14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 
{quote}
I think solution is to cast after mod.
{quote}
scala (1125899906842597L*3)
res4: Long = 3377699720527791

scala (1125899906842597L*3) % 9
res5: Long = 3

scala ((1125899906842597L*3) % 9).toInt
res5: Int = 3
{quote}


 PartitionStrategy: VertexID hash overflow
 -

 Key: SPARK-2981
 URL: https://issues.apache.org/jira/browse/SPARK-2981
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.2
Reporter: Larry Xiao
  Labels: newbie
   Original Estimate: 1h
  Remaining Estimate: 1h

 In PartitionStrategy.scala a PartitionID is calculated by multiplying 
 VertexId with a mixingPrime (1125899906842597L) then cast to Int, and mod 
 numParts.
 The Long is overflowed, and when cast to Int:
 {quote}
 scala (1125899906842597L*1).toInt
 res1: Int = -27
 scala (1125899906842597L*2).toInt
 res2: Int = -54
 scala (1125899906842597L*3).toInt
 res3: Int = -81
 {quote}
 As the cast produce number that are multiplies of 3, the partition is not 
 useable when partitioning to multiples of 3.
 for example when you partition to 6 or 9 parts:
 {quote}
 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), 
 (1,0), (2,0), (3,3832578), (4,0), (5,0))
 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), 
 (1,0), (2,0), (3,3832578), (4,0), (5,0)) 
 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), 
 (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0))
 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), 
 (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 
 so the vertices are partitioned to 0,3 for 6; and 0 for 9
 {quote}
 I think solution is to cast after mod.
 {quote}
 scala (1125899906842597L*3)
 res4: Long = 3377699720527791
 scala (1125899906842597L*3) % 9
 res5: Long = 3
 scala ((1125899906842597L*3) % 9).toInt
 res5: Int = 3
 {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2826) Reduce the Memory Copy for HashOuterJoin

2014-08-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2826.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

 Reduce the Memory Copy for HashOuterJoin
 

 Key: SPARK-2826
 URL: https://issues.apache.org/jira/browse/SPARK-2826
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Minor
 Fix For: 1.1.0


 This is actually a follow up for 
 https://issues.apache.org/jira/browse/SPARK-2212 , the previous 
 implementation has potential memory copy.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2934) Adding LogisticRegressionWithLBFGS for training with LBFGS Optimizer

2014-08-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-2934.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1862
[https://github.com/apache/spark/pull/1862]

 Adding LogisticRegressionWithLBFGS for training with LBFGS Optimizer  
 --

 Key: SPARK-2934
 URL: https://issues.apache.org/jira/browse/SPARK-2934
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: DB Tsai
Assignee: DB Tsai
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2982) Glitch of spark streaming

2014-08-11 Thread dai zhiyuan (JIRA)
dai zhiyuan created SPARK-2982:
--

 Summary: Glitch of spark streaming
 Key: SPARK-2982
 URL: https://issues.apache.org/jira/browse/SPARK-2982
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.0
Reporter: dai zhiyuan


spark streaming task startup time is very focused,It creates a problem which is 
network and cpu glitch, and cpu and network  is in an idle state at lot of 
time,which is very wasteful for system resources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns

2014-08-11 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093746#comment-14093746
 ] 

Jianshi Huang commented on SPARK-2890:
--

My use case:

The result will be parsed into (id, type, start, end, properties) tuples. 
Properties might or might not contain any of (id, type, start end). So it's 
easier just to list them at the end and not to worry about duplicated names.

Jianshi

 Spark SQL should allow SELECT with duplicated columns
 -

 Key: SPARK-2890
 URL: https://issues.apache.org/jira/browse/SPARK-2890
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jianshi Huang

 Spark reported error java.lang.IllegalArgumentException with messages:
 java.lang.IllegalArgumentException: requirement failed: Found fields with the 
 same name.
 at scala.Predef$.require(Predef.scala:233)
 at 
 org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317)
 at 
 org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83)
 at 
 org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57)
 at 
 org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85)
 at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433)
 After trial and error, it seems it's caused by duplicated columns in my 
 select clause.
 I made the duplication on purpose for my code to parse correctly. I think we 
 should allow users to specify duplicated columns as return value.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2923) Implement some basic linalg operations in MLlib

2014-08-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-2923.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1849
[https://github.com/apache/spark/pull/1849]

 Implement some basic linalg operations in MLlib
 ---

 Key: SPARK-2923
 URL: https://issues.apache.org/jira/browse/SPARK-2923
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.1.0


 We use breeze for linear algebra operations. Breeze operations are 
 user-friendly but there are some concerns:
 1. creating temp objects, e.g., `val z = a * x + b * y`
 2. multi-method is not used in some operators, e.g., `axpy`. If we pass in 
 SparseVector as a generic Vector, it will use activeIterator, which is slow
 3. calling native BLAS if it is available, which might not be good for 
 level-1 methods
 Having some basic BLAS operations implemented in MLlib can help simplify the 
 current implementation and improve some performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org