date:20140717

Xiangrui Meng created SPARK-2548:


 Summary: JavaRecoverableWordCount is missing
 Key: SPARK-2548
 URL: https://issues.apache.org/jira/browse/SPARK-2548
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Streaming
Affects Versions: 0.9.2, 1.0.1
Reporter: Xiangrui Meng


JavaRecoverableWordCount was mentioned in the doc but not in the codebase. We 
need to rewrite the example because the code was lost during the migration from 
spark/spark-incubating to apache/spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2549) Functions defined inside of other functions trigger failures

Patrick Wendell created SPARK-2549:
--

 Summary: Functions defined inside of other functions trigger 
failures
 Key: SPARK-2549
 URL: https://issues.apache.org/jira/browse/SPARK-2549
 Project: Spark
  Issue Type: Sub-task
Reporter: Patrick Wendell
Assignee: Prashant Sharma


If we have a function reference inside of another function, it still triggers 
mima failures. We should look at how that is implemented in byte code and just 
always exclude functions like that.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2549) Functions defined inside of other functions trigger failures


 [ 
https://issues.apache.org/jira/browse/SPARK-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2549:
---

Description: 
If we have a function declaration inside of another function, it still triggers 
mima failures. We should look at how that is implemented in byte code and just 
always exclude functions like that.

{code}
def a() = {
  /* Changing b() should not trigger failures, but it does. */
  def b() = {}
}
{code}

  was:If we have a function reference inside of another function, it still 
triggers mima failures. We should look at how that is implemented in byte code 
and just always exclude functions like that.


 Functions defined inside of other functions trigger failures
 

 Key: SPARK-2549
 URL: https://issues.apache.org/jira/browse/SPARK-2549
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Reporter: Patrick Wendell
Assignee: Prashant Sharma
 Fix For: 1.1.0


 If we have a function declaration inside of another function, it still 
 triggers mima failures. We should look at how that is implemented in byte 
 code and just always exclude functions like that.
 {code}
 def a() = {
   /* Changing b() should not trigger failures, but it does. */
   def b() = {}
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2549) Functions defined inside of other functions trigger failures


 [ 
https://issues.apache.org/jira/browse/SPARK-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2549:
---

Description: 
If we have a function declaration inside of another function, it still triggers 
mima failures. We should look at how that is implemented in byte code and just 
always exclude functions like that.

{code}
def a() = {
  /* Changing b() should not trigger failures, but it does. */
  def b() = {}
}
{code}

I dug into the byte code for inner functions a bit more. I noticed that they 
tend to use `$$` before the function name.

There is more information on that string sequence here:
https://github.com/scala/scala/blob/2.10.x/src/reflect/scala/reflect/internal/StdNames.scala#L286

I did a cursory look and it appears that symbol is mostly (exclusively?) used 
for anonymous or inner functions:
{code}
# in RDD package classes
$ ls *.class | xargs -I {} javap {} |grep \\$\\$ 
  public final java.lang.Object 
org$apache$spark$rdd$PairRDDFunctions$$createZero$1(scala.reflect.ClassTag, 
byte[], scala.runtime.ObjectRef, scala.runtime.VolatileByteRef);
  public final java.lang.Object 
org$apache$spark$rdd$PairRDDFunctions$$createZero$2(byte[], 
scala.runtime.ObjectRef, scala.runtime.VolatileByteRef);
  public final scala.collection.Iterator 
org$apache$spark$rdd$PairRDDFunctions$$reducePartition$1(scala.collection.Iterator,
 scala.Function2);
  public final java.util.HashMap 
org$apache$spark$rdd$PairRDDFunctions$$mergeMaps$1(java.util.HashMap, 
java.util.HashMap, scala.Function2);
...
public final class org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$1 
extends scala.runtime.AbstractFunction0$mcJ$sp implements scala.Serializable {
  public 
org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$1(org.apache.spark.rdd.AsyncRDDActionsT);
public final class org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$2 
extends scala.runtime.AbstractFunction2$mcVIJ$sp implements scala.Serializable {
  public 
org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$2(org.apache.spark.rdd.AsyncRDDActionsT);
{code}

  was:
If we have a function declaration inside of another function, it still triggers 
mima failures. We should look at how that is implemented in byte code and just 
always exclude functions like that.

{code}
def a() = {
  /* Changing b() should not trigger failures, but it does. */
  def b() = {}
}
{code}


 Functions defined inside of other functions trigger failures
 

 Key: SPARK-2549
 URL: https://issues.apache.org/jira/browse/SPARK-2549
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Reporter: Patrick Wendell
Assignee: Prashant Sharma
 Fix For: 1.1.0


 If we have a function declaration inside of another function, it still 
 triggers mima failures. We should look at how that is implemented in byte 
 code and just always exclude functions like that.
 {code}
 def a() = {
   /* Changing b() should not trigger failures, but it does. */
   def b() = {}
 }
 {code}
 I dug into the byte code for inner functions a bit more. I noticed that they 
 tend to use `$$` before the function name.
 There is more information on that string sequence here:
 https://github.com/scala/scala/blob/2.10.x/src/reflect/scala/reflect/internal/StdNames.scala#L286
 I did a cursory look and it appears that symbol is mostly (exclusively?) used 
 for anonymous or inner functions:
 {code}
 # in RDD package classes
 $ ls *.class | xargs -I {} javap {} |grep \\$\\$ 
   public final java.lang.Object 
 org$apache$spark$rdd$PairRDDFunctions$$createZero$1(scala.reflect.ClassTag, 
 byte[], scala.runtime.ObjectRef, scala.runtime.VolatileByteRef);
   public final java.lang.Object 
 org$apache$spark$rdd$PairRDDFunctions$$createZero$2(byte[], 
 scala.runtime.ObjectRef, scala.runtime.VolatileByteRef);
   public final scala.collection.Iterator 
 org$apache$spark$rdd$PairRDDFunctions$$reducePartition$1(scala.collection.Iterator,
  scala.Function2);
   public final java.util.HashMap 
 org$apache$spark$rdd$PairRDDFunctions$$mergeMaps$1(java.util.HashMap, 
 java.util.HashMap, scala.Function2);
 ...
 public final class org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$1 
 extends scala.runtime.AbstractFunction0$mcJ$sp implements scala.Serializable {
   public 
 org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$1(org.apache.spark.rdd.AsyncRDDActionsT);
 public final class org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$2 
 extends scala.runtime.AbstractFunction2$mcVIJ$sp implements 
 scala.Serializable {
   public 
 org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$2(org.apache.spark.rdd.AsyncRDDActionsT);
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2412) CoalescedRDD throws exception with certain pref locs


 [ 
https://issues.apache.org/jira/browse/SPARK-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2412.


   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.2

Issue resolved by pull request 1337
[https://github.com/apache/spark/pull/1337]

 CoalescedRDD throws exception with certain pref locs
 

 Key: SPARK-2412
 URL: https://issues.apache.org/jira/browse/SPARK-2412
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
 Fix For: 1.0.2, 1.1.0


 If the first pass of CoalescedRDD does not find the target number of 
 locations AND the second pass finds new locations, an exception is thrown, as 
 groupHash.get(nxt_replica).get is not valid.
 The fix is just to add an ArrayBuffer to groupHash for that replica if it 
 didn't already exist.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2526) Simplify make-distribution.sh to just pass through Maven options


 [ 
https://issues.apache.org/jira/browse/SPARK-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2526.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1445
[https://github.com/apache/spark/pull/1445]

 Simplify make-distribution.sh to just pass through Maven options
 

 Key: SPARK-2526
 URL: https://issues.apache.org/jira/browse/SPARK-2526
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell
 Fix For: 1.1.0


 There is a some complexity make-distribution.sh around selecting profiles. 
 This is both annoying to maintain and also limits the number of ways that 
 packagers can use this. For instance, it's not possible to build with 
 separate HDFS and YARN versions, and supporting this with our current flags 
 would get pretty complicated. We should just allow the user to pass a list of 
 profiles directly to make-distribution.sh - the Maven build itself is already 
 parameterized to support this. We also now have good docs explaining the use 
 of profiles in the Maven build.
 All of this logic was more necessary when we used SBT for the package build, 
 but we haven't done that for several versions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2550) Support regularization and intercept in pyspark's linear methods

Xiangrui Meng created SPARK-2550:


 Summary: Support regularization and intercept in pyspark's linear 
methods
 Key: SPARK-2550
 URL: https://issues.apache.org/jira/browse/SPARK-2550
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.0.0
Reporter: Xiangrui Meng


Python API doesn't provide options to set regularization parameter and 
intercept in linear methods, which should be fixed in v1.1.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2551) Cleanup FilteringParquetRowInputFormat


 [ 
https://issues.apache.org/jira/browse/SPARK-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-2551:
--

Description: 
To workaround [PARQUET-16|https://issues.apache.org/jira/browse/PARQUET-16] and 
fix [SPARK-2119|https://issues.apache.org/jira/browse/SPARK-2119], we did some 
reflection hacks in {{FilteringParquetRowInputFormat}}. This should be cleaned 
up once PARQUET-16 is fixed.

A PR for PARQUET-16 is 
[here|https://github.com/apache/incubator-parquet-mr/pull/17].

  was:
To workaround [PARQUET-16|https://issues.apache.org/jira/browse/PARQUET-16], we 
did some reflection hacks in {{FilteringParquetRowInputFormat}}. This should be 
cleaned up once PARQUET-16 is fixed.

A PR for PARQUET-16 is 
[here|https://github.com/apache/incubator-parquet-mr/pull/17].


 Cleanup FilteringParquetRowInputFormat
 --

 Key: SPARK-2551
 URL: https://issues.apache.org/jira/browse/SPARK-2551
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.1, 1.0.2
Reporter: Cheng Lian
Priority: Minor

 To workaround [PARQUET-16|https://issues.apache.org/jira/browse/PARQUET-16] 
 and fix [SPARK-2119|https://issues.apache.org/jira/browse/SPARK-2119], we did 
 some reflection hacks in {{FilteringParquetRowInputFormat}}. This should be 
 cleaned up once PARQUET-16 is fixed.
 A PR for PARQUET-16 is 
 [here|https://github.com/apache/incubator-parquet-mr/pull/17].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2551) Cleanup FilteringParquetRowInputFormat

Cheng Lian created SPARK-2551:
-

 Summary: Cleanup FilteringParquetRowInputFormat
 Key: SPARK-2551
 URL: https://issues.apache.org/jira/browse/SPARK-2551
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.1, 1.0.2
Reporter: Cheng Lian
Priority: Minor


To workaround [PARQUET-16|https://issues.apache.org/jira/browse/PARQUET-16], we 
did some reflection hacks in {{FilteringParquetRowInputFormat}}. This should be 
cleaned up once PARQUET-16 is fixed.

A PR for PARQUET-16 is 
[here|https://github.com/apache/incubator-parquet-mr/pull/17].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2552) Stabilize the computation of logistic function in pyspark

Xiangrui Meng created SPARK-2552:


 Summary: Stabilize the computation of logistic function in pyspark
 Key: SPARK-2552
 URL: https://issues.apache.org/jira/browse/SPARK-2552
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Reporter: Xiangrui Meng


exp(1000) throws an error in python. For logistic function, we can use either 1 
/ ( 1 + exp(-x) ) or 1 - 1 / (1 + exp(x) ) to compute its value which ensuring 
exp always takes a negative value.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2552) Stabilize the computation of logistic function in pyspark


 [ 
https://issues.apache.org/jira/browse/SPARK-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2552:
-

Description: exp(1000) throws an error in python. For logistic function, we 
can use either 1 / ( 1 + exp( -x ) ) or 1 - 1 / (1 + exp( x ) ) to compute its 
value which ensuring exp always takes a negative value.  (was: exp(1000) throws 
an error in python. For logistic function, we can use either 1 / ( 1 + exp(-x) 
) or 1 - 1 / (1 + exp(x) ) to compute its value which ensuring exp always takes 
a negative value.)

 Stabilize the computation of logistic function in pyspark
 -

 Key: SPARK-2552
 URL: https://issues.apache.org/jira/browse/SPARK-2552
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
  Labels: Starter

 exp(1000) throws an error in python. For logistic function, we can use either 
 1 / ( 1 + exp( -x ) ) or 1 - 1 / (1 + exp( x ) ) to compute its value which 
 ensuring exp always takes a negative value.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2423) Clean up SparkSubmit for readability


 [ 
https://issues.apache.org/jira/browse/SPARK-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2423:
---

Assignee: Andrew Or

 Clean up SparkSubmit for readability
 

 Key: SPARK-2423
 URL: https://issues.apache.org/jira/browse/SPARK-2423
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Andrew Or
 Fix For: 1.1.0


 It is currently not trivial to trace through how different combinations of 
 cluster managers (e.g. yarn) and deploy modes (e.g. cluster) are processed in 
 SparkSubmit. We should clean up the logic a little if we want to extend it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2552) Stabilize the computation of logistic function in pyspark


 [ 
https://issues.apache.org/jira/browse/SPARK-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2552:
-

Labels: Starter  (was: )

 Stabilize the computation of logistic function in pyspark
 -

 Key: SPARK-2552
 URL: https://issues.apache.org/jira/browse/SPARK-2552
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
  Labels: Starter

 exp(1000) throws an error in python. For logistic function, we can use either 
 1 / ( 1 + exp(-x) ) or 1 - 1 / (1 + exp(x) ) to compute its value which 
 ensuring exp always takes a negative value.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2119) Reading Parquet InputSplits dominates query execution time when reading off S3


[ 
https://issues.apache.org/jira/browse/SPARK-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064701#comment-14064701
 ] 

Cheng Lian commented on SPARK-2119:
---

Agree. 
Created SPARK-2551 for removing those hacks.

 Reading Parquet InputSplits dominates query execution time when reading off S3
 --

 Key: SPARK-2119
 URL: https://issues.apache.org/jira/browse/SPARK-2119
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Critical
 Fix For: 1.1.0


 Here's the relevant stack trace where things are hanging:
 {code}
   at 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:326)
   at 
 parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:370)
   at 
 parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344)
   at 
 org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:90)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
 {code}
 We should parallelize or cache or something here.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2476) Have sbt-assembly include runtime dependencies in jar


 [ 
https://issues.apache.org/jira/browse/SPARK-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2476:
---

Priority: Minor  (was: Major)

 Have sbt-assembly include runtime dependencies in jar
 -

 Key: SPARK-2476
 URL: https://issues.apache.org/jira/browse/SPARK-2476
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Reporter: Patrick Wendell
Assignee: Prashant Sharma
Priority: Minor

 If possible, we should try to contribute the ability to include 
 runtime-scoped dependencies in the assembly jar created with sbt-assembly.
 Currently in only reads compile-scoped dependencies:
 https://github.com/sbt/sbt-assembly/blob/master/src/main/scala/sbtassembly/Plugin.scala#L495



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2553) CoGroupedRDD unnecessarily allocates a Tuple2 per dep per key

Sandy Ryza created SPARK-2553:
-

 Summary: CoGroupedRDD unnecessarily allocates a Tuple2 per dep per 
key
 Key: SPARK-2553
 URL: https://issues.apache.org/jira/browse/SPARK-2553
 Project: Spark
  Issue Type: Sub-task
Affects Versions: 1.0.0
Reporter: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2553) CoGroupedRDD unnecessarily allocates a Tuple2 per dep per key


[ 
https://issues.apache.org/jira/browse/SPARK-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064709#comment-14064709
 ] 

Sandy Ryza commented on SPARK-2553:
---

https://github.com/apache/spark/pull/1461

 CoGroupedRDD unnecessarily allocates a Tuple2 per dep per key
 -

 Key: SPARK-2553
 URL: https://issues.apache.org/jira/browse/SPARK-2553
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2554) CountDistinct and SumDistinct should do partial aggregation

Cheng Lian created SPARK-2554:
-

 Summary: CountDistinct and SumDistinct should do partial 
aggregation
 Key: SPARK-2554
 URL: https://issues.apache.org/jira/browse/SPARK-2554
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.1, 1.0.2
Reporter: Cheng Lian


{{CountDistinct}} and {{SumDistinct}} should first do a partial aggregation and 
return unique value sets in each partition as partial results. Shuffle IO can 
be greatly reduced in in cases that there are only a few unique values.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2551) Cleanup FilteringParquetRowInputFormat


 [ 
https://issues.apache.org/jira/browse/SPARK-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-2551:
--

Issue Type: Improvement  (was: Bug)

 Cleanup FilteringParquetRowInputFormat
 --

 Key: SPARK-2551
 URL: https://issues.apache.org/jira/browse/SPARK-2551
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.1, 1.0.2
Reporter: Cheng Lian
Priority: Minor

 To workaround [PARQUET-16|https://issues.apache.org/jira/browse/PARQUET-16] 
 and fix [SPARK-2119|https://issues.apache.org/jira/browse/SPARK-2119], we did 
 some reflection hacks in {{FilteringParquetRowInputFormat}}. This should be 
 cleaned up once PARQUET-16 is fixed.
 A PR for PARQUET-16 is 
 [here|https://github.com/apache/incubator-parquet-mr/pull/17].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2492) KafkaReceiver minor changes to align with Kafka 0.8

2014-07-17 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064721#comment-14064721
 ] 

Saisai Shao commented on SPARK-2492:


Hi TD, 

Also I did some experiments on the previous code. In previous code, zookeeper 
group metadata will be cleaned if auto.offset.reset is set, no matter it is 
smallest or largest, this will lead to two results:

1. smallest: we will always read data from the beginning of partition no matter 
the groupid is new or old.
2. largest: we will always read data from the end of partition no matter the 
groupid is new or old.

I think the reason is that we delete the group metadata in zookeeper, so Kafka 
can only relies on auto.offset.reset to position the offset.

If we do not remove zookeeper metadata, the result will turn to:

1. smallest: we will read from the beginning of the partition for new groupid, 
and for old groupid, the start point is the last commit offset.
2. largest: we will read from the end of the partition for new groupid, and for 
old groupid, the start point is the last commit offset.

So I think in the previous code, auto.offset.reset is not a hint for 
out-range seeking, it is a immediate enforcement for offset to seek to the 
beginning or end of the partition, I'm not sure what's the purpose of previous 
design ?

I think directly seeking to the beginning or end of the partition when 
auto.offset.reset is set may has the different purpose of Kafka's own 
behavior, and will lead to unwanted result when people set this parameter 
(because of different from Kafka's predefined meaning). So I'd prefer to remove 
this code path.

What's your thought and concern ?




 KafkaReceiver minor changes to align with Kafka 0.8 
 

 Key: SPARK-2492
 URL: https://issues.apache.org/jira/browse/SPARK-2492
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Saisai Shao
Assignee: Saisai Shao
Priority: Minor
 Fix For: 1.1.0


 Update to delete Zookeeper metadata when Kafka's parameter 
 auto.offset.reset is set to smallest, which is aligned with Kafka 0.8's 
 ConsoleConsumer.
 Also use Kafka offered API without directly using zkClient.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2555) Support configuration spark.scheduler.minRegisteredExecutorsRatio in Mesos mode.

Zhihui created SPARK-2555:
-

 Summary: Support configuration 
spark.scheduler.minRegisteredExecutorsRatio in Mesos mode.
 Key: SPARK-2555
 URL: https://issues.apache.org/jira/browse/SPARK-2555
 Project: Spark
  Issue Type: Improvement
  Components: Mesos, Spark Core
Affects Versions: 1.0.0
Reporter: Zhihui






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2555) Support configuration spark.scheduler.minRegisteredExecutorsRatio in Mesos mode.


 [ 
https://issues.apache.org/jira/browse/SPARK-2555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihui updated SPARK-2555:
--

Description: In SPARK-1946, 

 Support configuration spark.scheduler.minRegisteredExecutorsRatio in Mesos 
 mode.
 

 Key: SPARK-2555
 URL: https://issues.apache.org/jira/browse/SPARK-2555
 Project: Spark
  Issue Type: Improvement
  Components: Mesos, Spark Core
Affects Versions: 1.0.0
Reporter: Zhihui

 In SPARK-1946, 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2555) Support configuration spark.scheduler.minRegisteredExecutorsRatio in Mesos mode.


 [ 
https://issues.apache.org/jira/browse/SPARK-2555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihui updated SPARK-2555:
--

Description: 
In SPARK-1946, configuration spark.scheduler.minRegisteredExecutorsRatio was 
introduced, but it only support  Standalone and Yarn mode.
This jira ticket try to introduce the configuration to Mesos mode.

  was:In SPARK-1946, 


 Support configuration spark.scheduler.minRegisteredExecutorsRatio in Mesos 
 mode.
 

 Key: SPARK-2555
 URL: https://issues.apache.org/jira/browse/SPARK-2555
 Project: Spark
  Issue Type: Improvement
  Components: Mesos, Spark Core
Affects Versions: 1.0.0
Reporter: Zhihui

 In SPARK-1946, configuration spark.scheduler.minRegisteredExecutorsRatio was 
 introduced, but it only support  Standalone and Yarn mode.
 This jira ticket try to introduce the configuration to Mesos mode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2555) Support configuration spark.scheduler.minRegisteredExecutorsRatio in Mesos mode.


[ 
https://issues.apache.org/jira/browse/SPARK-2555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064743#comment-14064743
 ] 

Zhihui commented on SPARK-2555:
---

I submit a PR https://github.com/apache/spark/pull/1462


 Support configuration spark.scheduler.minRegisteredExecutorsRatio in Mesos 
 mode.
 

 Key: SPARK-2555
 URL: https://issues.apache.org/jira/browse/SPARK-2555
 Project: Spark
  Issue Type: Improvement
  Components: Mesos, Spark Core
Affects Versions: 1.0.0
Reporter: Zhihui

 In SPARK-1946, configuration spark.scheduler.minRegisteredExecutorsRatio was 
 introduced, but it only support  Standalone and Yarn mode.
 This jira ticket try to introduce the configuration to Mesos mode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2555) Support configuration spark.scheduler.minRegisteredExecutorsRatio in Mesos mode.


 [ 
https://issues.apache.org/jira/browse/SPARK-2555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihui updated SPARK-2555:
--

Description: 
In SPARK-1946, configuration spark.scheduler.minRegisteredExecutorsRatio was 
introduced, but it only support  Standalone and Yarn mode.
This is try to introduce the configuration to Mesos mode.

  was:
In SPARK-1946, configuration spark.scheduler.minRegisteredExecutorsRatio was 
introduced, but it only support  Standalone and Yarn mode.
This jira ticket try to introduce the configuration to Mesos mode.


 Support configuration spark.scheduler.minRegisteredExecutorsRatio in Mesos 
 mode.
 

 Key: SPARK-2555
 URL: https://issues.apache.org/jira/browse/SPARK-2555
 Project: Spark
  Issue Type: Improvement
  Components: Mesos, Spark Core
Affects Versions: 1.0.0
Reporter: Zhihui

 In SPARK-1946, configuration spark.scheduler.minRegisteredExecutorsRatio was 
 introduced, but it only support  Standalone and Yarn mode.
 This is try to introduce the configuration to Mesos mode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2556) Multiple SparkContexts can coexist in one process

2014-07-17 Thread YanTang Zhai (JIRA)

YanTang Zhai created SPARK-2556:
---

 Summary: Multiple SparkContexts can coexist in one process
 Key: SPARK-2556
 URL: https://issues.apache.org/jira/browse/SPARK-2556
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: YanTang Zhai
Priority: Minor


Multiple SparkContexts could not coexist in one process at present since 
different SparkContexts share same global variables. 
These global variables and objects will be modified to local in order that 
multiple SparkContexts can coexist.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2491) When an OOM is thrown,the executor does not stop properly.

2014-07-17 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2491:
---

Summary: When an OOM is thrown,the executor does not stop properly.  (was: 
When an OOM is thrown,the executor is not properly stopped.)

 When an OOM is thrown,the executor does not stop properly.
 --

 Key: SPARK-2491
 URL: https://issues.apache.org/jira/browse/SPARK-2491
 Project: Spark
  Issue Type: Bug
Reporter: Guoqiang Li

 The executor log:
 {code}
 #
 # java.lang.OutOfMemoryError: Java heap space
 # -XX:OnOutOfMemoryError=kill %p
 #   Executing /bin/sh -c kill 44942...
 14/07/15 10:38:29 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: 
 SIGTERM
 14/07/15 10:38:29 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception 
 in thread Thread[Connection manager future execution context-6,5,main]
 java.lang.OutOfMemoryError: Java heap space
 at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57)
 at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
 at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94)
 at 
 org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176)
 at 
 org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63)
 at 
 org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109)
 at 
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:125)
 at 
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:122)
 at 
 scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117)
 at 
 scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115)
 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 14/07/15 10:38:29 WARN HadoopRDD: Exception in RecordReader.close()
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
 at 
 org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
 at java.io.FilterInputStream.close(FilterInputStream.java:181)
 at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
 at 
 org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:243)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
 at 
 org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
 at 
 org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 -
 14/07/15 10:38:30 INFO Executor: Running task ID 969
 14/07/15 10:38:30 INFO BlockManager: Found block broadcast_0 locally
 14/07/15 10:38:30 INFO HadoopRDD: Input split: 
 hdfs://10dian72.domain.test:8020/input/lbs/recommend/toona/rating/20140712/part-7:0+68016537
 14/07/15 10:38:30 ERROR Executor: Exception in task ID 969
 java.io.FileNotFoundException: 
 /yarn/nm/usercache/spark/appcache/application_1404728465401_0070/spark-local-20140715103235-ffda/2e/merged_shuffle_4_85_0
  (No such file or directory)
 at java.io.FileOutputStream.open(Native Method)
 at java.io.FileOutputStream.init(FileOutputStream.java:221)
 at 
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116)
 at

[jira] [Commented] (SPARK-2156) When the size of serialized results for one partition is slightly smaller than 10MB (the default akka.frameSize), the execution blocks

2014-07-17 Thread DjvuLee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064965#comment-14064965
 ] 

DjvuLee commented on SPARK-2156:


I see this fixed in the spark branch-0.9 in the github, but does it updated in 
the spark v0.9.1 in the http://spark.apache.org/ site?

 When the size of serialized results for one partition is slightly smaller 
 than 10MB (the default akka.frameSize), the execution blocks
 --

 Key: SPARK-2156
 URL: https://issues.apache.org/jira/browse/SPARK-2156
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.1, 1.0.0
 Environment: AWS EC2 1 master 2 slaves with the instance type of 
 r3.2xlarge
Reporter: Chen Jin
Assignee: Xiangrui Meng
Priority: Blocker
 Fix For: 0.9.2, 1.0.1, 1.1.0

   Original Estimate: 504h
  Remaining Estimate: 504h

  I have done some experiments when the frameSize is around 10MB .
 1) spark.akka.frameSize = 10
 If one of the partition size is very close to 10MB, say 9.97MB, the execution 
 blocks without any exception or warning. Worker finished the task to send the 
 serialized result, and then throw exception saying hadoop IPC client 
 connection stops (changing the logging to debug level). However, the master 
 never receives the results and the program just hangs.
 But if sizes for all the partitions less than some number btw 9.96MB amd 
 9.97MB, the program works fine.
 2) spark.akka.frameSize = 9
 when the partition size is just a little bit smaller than 9MB, it fails as 
 well.
 This bug behavior is not exactly what spark-1112 is about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-17 Thread Ken Carlile (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064996#comment-14064996
 ] 

Ken Carlile commented on SPARK-2282:


So we've just given this a try with a 32 node cluster. Without the two sysctl 
commands, it obviously failed, using this code in pyspark: 

{code}
data = sc.parallelize(range(0,3000), 2000).map(lambda x: range(0,300))
data.cache()
data.count()
for i in range(0,20): data.count()
{code}

Unfortunately, with the two sysctls implemented on all nodes in the cluster, it 
also failed. Here's the java errors we see: 
{code:java}
14/07/17 10:55:37 ERROR DAGSchedulerActorSupervisor: eventProcesserActor 
failed; shutting down SparkContext
java.net.NoRouteToHostException: Cannot assign requested address
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at java.net.Socket.connect(Socket.java:528)
at java.net.Socket.init(Socket.java:425)
at java.net.Socket.init(Socket.java:208)
at 
org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:404)
at 
org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387)
at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:72)
at 
org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:280)
at 
org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:278)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.Accumulators$.add(Accumulators.scala:278)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:820)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1226)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Traceback (most recent call last):
  File stdin, line 1, in module
  File /usr/local/spark-current/python/pyspark/rdd.py, line 708, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File /usr/local/spark-current/python/pyspark/rdd.py, line 699, in sum
return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File /usr/local/spark-current/python/pyspark/rdd.py, line 619, in reduce
vals = self.mapPartitions(func).collect()
  File /usr/local/spark-current/python/pyspark/rdd.py, line 583, in collect
bytesInJava = self._jrdd.collect().iterator()
  File 
/usr/local/spark-current/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, 
line 537, in __call__
  File 
/usr/local/spark-current/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o158.collect.
: org.apache.spark.SparkException: Job 14 cancelled as part of cancellation of 
all jobs
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:1009)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply$mcVI$sp(DAGScheduler.scala:499)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:499)
at

[jira] [Created] (SPARK-2557) createTaskScheduler should be consistent between local and local-n-failures

2014-07-17 Thread Ye Xianjin (JIRA)

Ye Xianjin created SPARK-2557:
-

 Summary: createTaskScheduler should be consistent between local 
and local-n-failures 
 Key: SPARK-2557
 URL: https://issues.apache.org/jira/browse/SPARK-2557
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Ye Xianjin
Priority: Minor


In SparkContext.createTaskScheduler, we can use {code}local[*]{code} to 
estimates the number of cores on the machine. I think we should also be able to 
use * in the local-n-failures mode.

And according to the code in the LOCAL_N_REGEX pattern matching code, I believe 
the regular expression of LOCAL_N_REGEX is wrong. LOCAL_N_REFEX should be 
{code}
local\[([0-9]+|\*)\].r
{code} 
rather than
{code}
 local\[([0-9\*]+)\].r
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2557) createTaskScheduler should be consistent between local and local-n-failures

2014-07-17 Thread Ye Xianjin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065001#comment-14065001
 ] 

Ye Xianjin commented on SPARK-2557:
---

I will send a pr for this.

 createTaskScheduler should be consistent between local and local-n-failures 
 

 Key: SPARK-2557
 URL: https://issues.apache.org/jira/browse/SPARK-2557
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Ye Xianjin
Priority: Minor
  Labels: starter
   Original Estimate: 2h
  Remaining Estimate: 2h

 In SparkContext.createTaskScheduler, we can use {code}local[*]{code} to 
 estimates the number of cores on the machine. I think we should also be able 
 to use * in the local-n-failures mode.
 And according to the code in the LOCAL_N_REGEX pattern matching code, I 
 believe the regular expression of LOCAL_N_REGEX is wrong. LOCAL_N_REFEX 
 should be 
 {code}
 local\[([0-9]+|\*)\].r
 {code} 
 rather than
 {code}
  local\[([0-9\*]+)\].r
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2494) Hash of None is different cross machines in CPython


[ 
https://issues.apache.org/jira/browse/SPARK-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065006#comment-14065006
 ] 

Matthew Farrellee commented on SPARK-2494:
--

[~davies] will you provide an example that demonstrates the issue?

 Hash of None is different cross machines in CPython
 ---

 Key: SPARK-2494
 URL: https://issues.apache.org/jira/browse/SPARK-2494
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0, 1.0.1
 Environment: CPython 2.x 
Reporter: Davies Liu
Priority: Blocker
  Labels: pyspark, shuffle
 Fix For: 1.0.0, 1.0.1

   Original Estimate: 24h
  Remaining Estimate: 24h

 The hash of None, also tuple with None in it, is different cross machines, so 
 the result will be wrong if None appear in the key of partitionBy().
 It should use an portable hash function as the default partition function, 
 which generate same hash for all the builtin immutable types, especially 
 tuple.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-17 Thread Ken Carlile (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065025#comment-14065025
 ] 

Ken Carlile commented on SPARK-2282:


A little more info: 
Nodes are running Scientific Linux 6.3 (Linux 2.6.32-279.el6.x86_64 #1 SMP Thu 
Jun 21 07:08:44 CDT 2012 x86_64 x86_64 x86_64 GNU/Linux)
Spark is run against Python 2.7.6, Java 1.7.0.25, and Scala 2.10.3. 

spark-env.sh
{code}
#!/usr/bin/env bash
ulimit -n 65535
export SCALA_HOME=/usr/local/scala-2.10.3
export SPARK_WORKER_DIR=/scratch/spark/work
export JAVA_HOME=/usr/local/jdk1.7.0_25
export SPARK_LOG_DIR=~/.spark/logs/$JOB_ID/
export SPARK_EXECUTOR_MEMORY=100g
export SPARK_DRIVER_MEMORY=100g
export SPARK_WORKER_MEMORY=100g
export SPARK_LOCAL_DIRS=/scratch/spark/tmp
export PYSPARK_PYTHON=/usr/local/python-2.7.6/bin/python
export SPARK_SLAVES=/scratch/spark/tmp/slaves
{code}

spark-defaults.conf:
{code}
spark.akka.timeout=300 
spark.storage.blockManagerHeartBeatMs=3 
spark.akka.retry.wait=30 
spark.akka.frameSize=1
{code}

 PySpark crashes if too many tasks complete quickly
 --

 Key: SPARK-2282
 URL: https://issues.apache.org/jira/browse/SPARK-2282
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.1, 1.0.0, 1.0.1
Reporter: Aaron Davidson
Assignee: Aaron Davidson
 Fix For: 0.9.2, 1.0.0, 1.0.1


 Upon every task completion, PythonAccumulatorParam constructs a new socket to 
 the Accumulator server running inside the pyspark daemon. This can cause a 
 buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
 stage, which will cause the SparkContext to crash if too many tasks complete 
 too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
 This bug can be fixed outside of Spark by ensuring these properties are set 
 (on a linux server);
 echo 1  /proc/sys/net/ipv4/tcp_tw_reuse
 echo 1  /proc/sys/net/ipv4/tcp_tw_recycle
 or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2557) createTaskScheduler should be consistent between local and local-n-failures

2014-07-17 Thread Ye Xianjin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065029#comment-14065029
 ] 

Ye Xianjin commented on SPARK-2557:
---

Github pr: https://github.com/apache/spark/pull/1464

 createTaskScheduler should be consistent between local and local-n-failures 
 

 Key: SPARK-2557
 URL: https://issues.apache.org/jira/browse/SPARK-2557
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Ye Xianjin
Priority: Minor
  Labels: starter
   Original Estimate: 2h
  Remaining Estimate: 2h

 In SparkContext.createTaskScheduler, we can use {code}local[*]{code} to 
 estimates the number of cores on the machine. I think we should also be able 
 to use * in the local-n-failures mode.
 And according to the code in the LOCAL_N_REGEX pattern matching code, I 
 believe the regular expression of LOCAL_N_REGEX is wrong. LOCAL_N_REFEX 
 should be 
 {code}
 local\[([0-9]+|\*)\].r
 {code} 
 rather than
 {code}
  local\[([0-9\*]+)\].r
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2523) Potential Bugs if SerDe is not the identical among partitions and table

2014-07-17 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-2523:


Target Version/s: 1.1.0

 Potential Bugs if SerDe is not the identical among partitions and table
 ---

 Key: SPARK-2523
 URL: https://issues.apache.org/jira/browse/SPARK-2523
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao

 In HiveTableScan.scala, ObjectInspector was created for all of the partition 
 based records, which probably causes ClassCastException if the object 
 inspector is not identical among table  partitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2256) pyspark: RDD.take doesn't work ... sometimes ...


[ 
https://issues.apache.org/jira/browse/SPARK-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065032#comment-14065032
 ] 

Matthew Farrellee commented on SPARK-2256:
--

[~angel2014] i've tried this using a local file and line lengths from 1 to 64K 
(by powers of 2) and have not been able to reproduce this. how frequently does 
this fail? are you still seeing this issue on the tip of master?

 pyspark: RDD.take doesn't work ... sometimes ...
 --

 Key: SPARK-2256
 URL: https://issues.apache.org/jira/browse/SPARK-2256
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
 Environment: local file/remote HDFS
Reporter: Ángel Álvarez
  Labels: RDD, pyspark, take

 If I try to take some lines from a file, sometimes it doesn't work
 Code: 
 myfile = sc.textFile(A_ko)
 print myfile.take(10)
 Stacktrace:
 14/06/24 09:29:27 INFO DAGScheduler: Failed to run take at mytest.py:19
 Traceback (most recent call last):
   File mytest.py, line 19, in module
 print myfile.take(10)
   File spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py, line 868, in take
 iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator()
   File 
 spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py, 
 line 537, in __call__
   File 
 spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py, 
 line 300, in get_return_value
 Test data:
 START TEST DATA
 A
 A
 A

[jira] [Commented] (SPARK-2523) Potential Bugs if SerDe is not the identical among partitions and table

2014-07-17 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065035#comment-14065035
 ] 

Yin Huai commented on SPARK-2523:
-

I see. Although we are using the right SerDe to deserialize a row, we are using 
the wrong ObjectInspector to extract fields (in attributeFunctions)... Also, 
creating Rows in TableReader makes sense. Will finish my review soon.

 Potential Bugs if SerDe is not the identical among partitions and table
 ---

 Key: SPARK-2523
 URL: https://issues.apache.org/jira/browse/SPARK-2523
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao

 In HiveTableScan.scala, ObjectInspector was created for all of the partition 
 based records, which probably causes ClassCastException if the object 
 inspector is not identical among table  partitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2021) External hashing in PySpark


[ 
https://issues.apache.org/jira/browse/SPARK-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065040#comment-14065040
 ] 

Matthew Farrellee commented on SPARK-2021:
--

[~matei][~prashant_] what do you mean by external hashing?

 External hashing in PySpark
 ---

 Key: SPARK-2021
 URL: https://issues.apache.org/jira/browse/SPARK-2021
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Matei Zaharia
Assignee: Prashant Sharma





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1670) PySpark Fails to Create SparkContext Due To Debugging Options in conf/java-opts


[ 
https://issues.apache.org/jira/browse/SPARK-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065047#comment-14065047
 ] 

Matthew Farrellee commented on SPARK-1670:
--

SPARK-2313 is the root cause of this. a workaround for this would be complex 
because the extra text on stdout is coming from the same jvm that should 
produce the py4j port.

 PySpark Fails to Create SparkContext Due To Debugging Options in 
 conf/java-opts
 ---

 Key: SPARK-1670
 URL: https://issues.apache.org/jira/browse/SPARK-1670
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
 Environment: pats-air:spark pat$ IPYTHON=1 bin/pyspark
 Python 2.7.5 (default, Aug 25 2013, 00:04:04) 
 ...
 IPython 1.1.0
 ...
 Spark version 1.0.0-SNAPSHOT
 Using Python version 2.7.5 (default, Aug 25 2013 00:04:04)
Reporter: Pat McDonough

 When JVM debugging options are in conf/java-opts, it causes pyspark to fail 
 when creating the SparkContext. The java-opts file looks like the following:
 {code}-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005
 {code}
 Here's the error:
 {code}---
 ValueErrorTraceback (most recent call last)
 /Library/Python/2.7/site-packages/IPython/utils/py3compat.pyc in 
 execfile(fname, *where)
 202 else:
 203 filename = fname
 -- 204 __builtin__.execfile(filename, *where)
 /Users/pat/Projects/spark/python/pyspark/shell.py in module()
  41 SparkContext.setSystemProperty(spark.executor.uri, 
 os.environ[SPARK_EXECUTOR_URI])
  42 
 --- 43 sc = SparkContext(os.environ.get(MASTER, local[*]), 
 PySparkShell, pyFiles=add_files)
  44 
  45 print(Welcome to
 /Users/pat/Projects/spark/python/pyspark/context.pyc in __init__(self, 
 master, appName, sparkHome, pyFiles, environment, batchSize, serializer, 
 conf, gateway)
  92 tempNamedTuple = namedtuple(Callsite, function file 
 linenum)
  93 self._callsite = tempNamedTuple(function=None, file=None, 
 linenum=None)
 --- 94 SparkContext._ensure_initialized(self, gateway=gateway)
  95 
  96 self.environment = environment or {}
 /Users/pat/Projects/spark/python/pyspark/context.pyc in 
 _ensure_initialized(cls, instance, gateway)
 172 with SparkContext._lock:
 173 if not SparkContext._gateway:
 -- 174 SparkContext._gateway = gateway or launch_gateway()
 175 SparkContext._jvm = SparkContext._gateway.jvm
 176 SparkContext._writeToFile = 
 SparkContext._jvm.PythonRDD.writeToFile
 /Users/pat/Projects/spark/python/pyspark/java_gateway.pyc in launch_gateway()
  44 proc = Popen(command, stdout=PIPE, stdin=PIPE)
  45 # Determine which ephemeral port the server started on:
 --- 46 port = int(proc.stdout.readline())
  47 # Create a thread to echo output from the GatewayServer, which is 
 required
  48 # for Java log output to show up:
 ValueError: invalid literal for int() with base 10: 'Listening for transport 
 dt_socket at address: 5005\n'
 {code}
 Note that when you use JVM debugging, the very first line of output (e.g. 
 when running spark-shell) looks like this:
 {code}Listening for transport dt_socket at address: 5005{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2256) pyspark: RDD.take doesn't work ... sometimes ...

2014-07-17 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ángel Álvarez updated SPARK-2256:
-

Attachment: A_test.zip

I've tried with different files and sizes ... but I can't figure out the
reason why it doesn't work ...

If I try with the files downloaded from
https://github.com/richardbishop/PerformanceTestData ... everything works
OK.







 pyspark: RDD.take doesn't work ... sometimes ...
 --

 Key: SPARK-2256
 URL: https://issues.apache.org/jira/browse/SPARK-2256
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
 Environment: local file/remote HDFS
Reporter: Ángel Álvarez
  Labels: RDD, pyspark, take
 Attachments: A_test.zip


 If I try to take some lines from a file, sometimes it doesn't work
 Code: 
 myfile = sc.textFile(A_ko)
 print myfile.take(10)
 Stacktrace:
 14/06/24 09:29:27 INFO DAGScheduler: Failed to run take at mytest.py:19
 Traceback (most recent call last):
   File mytest.py, line 19, in module
 print myfile.take(10)
   File spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py, line 868, in take
 iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator()
   File 
 spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py, 
 line 537, in __call__
   File 
 spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py, 
 line 300, in get_return_value
 Test data:
 START TEST DATA
 A
 A
 A

[jira] [Commented] (SPARK-1662) PySpark fails if python class is used as a data container


[ 
https://issues.apache.org/jira/browse/SPARK-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065059#comment-14065059
 ] 

Matthew Farrellee commented on SPARK-1662:
--

[~nrchandan] and [~pwendell] - i recommend you close this as not a bug. it's 
not pyspark's fault that the user-defined class is not able to be pickled. you 
can change the Point class in the example to make it pickleable and the example 
program will work. see 
https://docs.python.org/2/library/pickle.html#what-can-be-pickled-and-unpickled

original gist for posterity -
{code}
import pyspark

class Point(object):
'''this class being used as container'''
pass

def to_point_obj(point_as_dict):
'''convert a dict representation of a point to Point object'''
p = Point()
p.x = point_as_dict['x']
p.y = point_as_dict['y']
return p

def add_two_points(point_obj1, point_obj2):
print type(point_obj1), type(point_obj2)
point_obj1.x += point_obj2.x
point_obj1.y += point_obj2.y
return point_obj1

def zero_point():
p = Point()
p.x = p.y = 0
return p

sc = pyspark.SparkContext('local', 'test_app')

a = sc.parallelize([{'x':1, 'y':1}, {'x':2, 'y':2}, {'x':3, 'y':3}])
b = a.map(to_point_obj)   # convert to an RDD of Point objects
c = b.fold(zero_point(), add_two_points)
{code}

 PySpark fails if python class is used as a data container
 -

 Key: SPARK-1662
 URL: https://issues.apache.org/jira/browse/SPARK-1662
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
 Environment: Ubuntu 14, Python 2.7.6
Reporter: Chandan Kumar
Priority: Minor

 PySpark fails if RDD operations are performed on data encapsulated in Python 
 objects (rare use case where plain python objects are used as data containers 
 instead of regular dict or tuples).
 I have written a small piece of code to reproduce the bug:
 https://gist.github.com/nrchandan/11394440
 script src=https://gist.github.com/nrchandan/11394440.js;/script



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2256) pyspark: RDD.take doesn't work ... sometimes ...


[ 
https://issues.apache.org/jira/browse/SPARK-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065066#comment-14065066
 ] 

Matthew Farrellee commented on SPARK-2256:
--

are you using a local master, mesos, yarn?

for me -

{code}
./dist/bin/pyspark
...
[repeat this a bunch, w/ while True]
sc.textFile(A_ko).take(10)
sc.textFile(A_ko).take(50)
...
{code}

and i cannot reproduce

 pyspark: RDD.take doesn't work ... sometimes ...
 --

 Key: SPARK-2256
 URL: https://issues.apache.org/jira/browse/SPARK-2256
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
 Environment: local file/remote HDFS
Reporter: Ángel Álvarez
  Labels: RDD, pyspark, take
 Attachments: A_test.zip


 If I try to take some lines from a file, sometimes it doesn't work
 Code: 
 myfile = sc.textFile(A_ko)
 print myfile.take(10)
 Stacktrace:
 14/06/24 09:29:27 INFO DAGScheduler: Failed to run take at mytest.py:19
 Traceback (most recent call last):
   File mytest.py, line 19, in module
 print myfile.take(10)
   File spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py, line 868, in take
 iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator()
   File 
 spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py, 
 line 537, in __call__
   File 
 spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py, 
 line 300, in get_return_value
 Test data:
 START TEST DATA
 A
 A
 A

[jira] [Created] (SPARK-2558) Mention --queue argument in YARN documentation

2014-07-17 Thread Matei Zaharia (JIRA)

Matei Zaharia created SPARK-2558:


 Summary: Mention --queue argument in YARN documentation 
 Key: SPARK-2558
 URL: https://issues.apache.org/jira/browse/SPARK-2558
 Project: Spark
  Issue Type: Documentation
  Components: YARN
Reporter: Matei Zaharia
Priority: Trivial


The docs about it went away when we updated the page to spark-submit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2558) Mention --queue argument in YARN documentation

2014-07-17 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2558:
-

Labels: Starter  (was: )

 Mention --queue argument in YARN documentation 
 ---

 Key: SPARK-2558
 URL: https://issues.apache.org/jira/browse/SPARK-2558
 Project: Spark
  Issue Type: Documentation
  Components: YARN
Reporter: Matei Zaharia
Priority: Trivial
  Labels: Starter

 The docs about it went away when we updated the page to spark-submit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-17 Thread Aaron Davidson (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065121#comment-14065121
]

Aaron Davidson commented on SPARK-2282:
---

This problem does look identical. I think I gave you the wrong netstat command,
as -l only show listening sockets. Try with -a instead to see all open
connections to confirm this, but the rest of your symptoms align perfectly.

I did a little Googling around for your specific kernel version, and it turns
out [someone else|http://lists.openwall.net/netdev/2011/07/13/39] has had
success with tcp_tw_recycle on 2.6.32. Could you try to make absolutely sure
that the sysctl is taking effect? Perhaps you can try adding
net.ipv4.tcp_tw_recycle = 1 to /etc/sysctl.conf and then running a sysctl
-p before restarting pyspark.

PySpark crashes if too many tasks complete quickly
--

Key: SPARK-2282
URL: https://issues.apache.org/jira/browse/SPARK-2282
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 0.9.1, 1.0.0, 1.0.1
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Fix For: 0.9.2, 1.0.0, 1.0.1

Upon every task completion, PythonAccumulatorParam constructs a new socket to
the Accumulator server running inside the pyspark daemon. This can cause a
buildup of used ephemeral ports from sockets in the TIME_WAIT termination
stage, which will cause the SparkContext to crash if too many tasks complete
too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
This bug can be fixed outside of Spark by ensuring these properties are set
(on a linux server);
echo 1 /proc/sys/net/ipv4/tcp_tw_reuse
echo 1 /proc/sys/net/ipv4/tcp_tw_recycle
or by adding the SO_REUSEADDR option to the Socket creation within Spark.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2083) Allow local task to retry after failure.

2014-07-17 Thread Bill Havanki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065143#comment-14065143
 ] 

Bill Havanki commented on SPARK-2083:
-

Pull request available: https://github.com/apache/spark/pull/1465

(Please feel free to assign this ticket to me - I don't have that permission.)

 Allow local task to retry after failure.
 

 Key: SPARK-2083
 URL: https://issues.apache.org/jira/browse/SPARK-2083
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Peng Cheng
Priority: Trivial
  Labels: easyfix
   Original Estimate: 1h
  Remaining Estimate: 1h

 If a job is submitted to run locally using masterURL = local[X], spark will 
 not retry a failed task regardless of your spark.task.maxFailures setting. 
 This design is to facilitate debugging and QA of spark application where all 
 tasks are expected to succeed and yield a results. Unfortunately, such 
 setting will prevent a local job from finished if any of its task cannot 
 guarantee a result (e.g. visiting an external resouce/API), and retrying 
 inside the task is less favoured (e.g. the task needs to be executed on a 
 different computer on production).
 User however can still set masterURL =local[X,Y] to override this (where Y 
 is the local maxFailures), but it is not documented and hard to manage. A 
 quick fix to this can be to add a new configuration property 
 spark.local.maxFailures with a default value of 1. So user knows exactly 
 where to change when reading the documentation



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2559) Add A Link to Download the Application Events Log for Offline Analysis

2014-07-17 Thread Pat McDonough (JIRA)

Pat McDonough created SPARK-2559:


 Summary: Add A Link to Download the Application Events Log for 
Offline Analysis
 Key: SPARK-2559
 URL: https://issues.apache.org/jira/browse/SPARK-2559
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Pat McDonough


To analyze application issues offline (eg. on another machine while supporting 
an end user), provide end users a link to download an archive of the 
application event logs. The archive can then by opened via an offline History 
Server.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2256) pyspark: RDD.take doesn't work ... sometimes ...

2014-07-17 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065230#comment-14065230
 ] 

Ángel Álvarez commented on SPARK-2256:
--

I've tried using local and master spark in standalone mode.



 pyspark: RDD.take doesn't work ... sometimes ...
 --

 Key: SPARK-2256
 URL: https://issues.apache.org/jira/browse/SPARK-2256
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
 Environment: local file/remote HDFS
Reporter: Ángel Álvarez
  Labels: RDD, pyspark, take
 Attachments: A_test.zip


 If I try to take some lines from a file, sometimes it doesn't work
 Code: 
 myfile = sc.textFile(A_ko)
 print myfile.take(10)
 Stacktrace:
 14/06/24 09:29:27 INFO DAGScheduler: Failed to run take at mytest.py:19
 Traceback (most recent call last):
   File mytest.py, line 19, in module
 print myfile.take(10)
   File spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py, line 868, in take
 iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator()
   File 
 spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py, 
 line 537, in __call__
   File 
 spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py, 
 line 300, in get_return_value
 Test data:
 START TEST DATA
 A
 A
 A

[jira] [Commented] (SPARK-2494) Hash of None is different cross machines in CPython

2014-07-17 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065240#comment-14065240
 ] 

Davies Liu commented on SPARK-2494:
---

This bug only happen in cluster mode, so it's can not be reproduced in unit 
tests.

In cluster mode (workers on different machines), it will happen:
 
 rdd = sc.parallelize([(None, 1), (None, 2)], 2)
 rdd.groupByKey(2).collect()
((None, [1]), (None, [2]))

The same key `None` will be put into different partitions and can not be 
aggregated 
together.

 Hash of None is different cross machines in CPython
 ---

 Key: SPARK-2494
 URL: https://issues.apache.org/jira/browse/SPARK-2494
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0, 1.0.1
 Environment: CPython 2.x 
Reporter: Davies Liu
Priority: Blocker
  Labels: pyspark, shuffle
 Fix For: 1.0.0, 1.0.1

   Original Estimate: 24h
  Remaining Estimate: 24h

 The hash of None, also tuple with None in it, is different cross machines, so 
 the result will be wrong if None appear in the key of partitionBy().
 It should use an portable hash function as the default partition function, 
 which generate same hash for all the builtin immutable types, especially 
 tuple.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2256) pyspark: RDD.take doesn't work ... sometimes ...


[ 
https://issues.apache.org/jira/browse/SPARK-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065249#comment-14065249
 ] 

Matthew Farrellee commented on SPARK-2256:
--

maybe there's an issue in the platform?

i'm on -
{code}
$ head -n1 /etc/issue
Fedora release 20 (Heisenbug)
$ python --version 
Python 2.7.5
$ java -version
openjdk version 1.8.0_05
OpenJDK Runtime Environment (build 1.8.0_05-b13)
OpenJDK 64-Bit Server VM (build 25.5-b02, mixed mode)
{code}

 pyspark: RDD.take doesn't work ... sometimes ...
 --

 Key: SPARK-2256
 URL: https://issues.apache.org/jira/browse/SPARK-2256
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
 Environment: local file/remote HDFS
Reporter: Ángel Álvarez
  Labels: RDD, pyspark, take
 Attachments: A_test.zip


 If I try to take some lines from a file, sometimes it doesn't work
 Code: 
 myfile = sc.textFile(A_ko)
 print myfile.take(10)
 Stacktrace:
 14/06/24 09:29:27 INFO DAGScheduler: Failed to run take at mytest.py:19
 Traceback (most recent call last):
   File mytest.py, line 19, in module
 print myfile.take(10)
   File spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py, line 868, in take
 iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator()
   File 
 spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py, 
 line 537, in __call__
   File 
 spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py, 
 line 300, in get_return_value
 Test data:
 START TEST DATA
 A
 A
 A

[jira] [Commented] (SPARK-2316) StorageStatusListener should avoid O(blocks) operations

2014-07-17 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065256#comment-14065256
 ] 

Shivaram Venkataraman commented on SPARK-2316:
--

I'd just like to add that in cases where we have many thousands of blocks, this 
stack trace occupies one core constantly on the Master and is probably one of 
the reasons why the WebUI stops functioning after a certain point. 

 StorageStatusListener should avoid O(blocks) operations
 ---

 Key: SPARK-2316
 URL: https://issues.apache.org/jira/browse/SPARK-2316
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.0.0
Reporter: Patrick Wendell
Assignee: Andrew Or

 In the case where jobs are frequently causing dropped blocks the storage 
 status listener can bottleneck. This is slow for a few reasons, one being 
 that we use Scala collection operations, the other being that we operations 
 that are O(number of blocks). I think using a few indices here could make 
 this much faster.
 {code}
  at java.lang.Integer.valueOf(Integer.java:642)
 at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70)
 at 
 org.apache.spark.storage.StorageUtils$$anonfun$9.apply(StorageUtils.scala:82)
 at 
 scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:328)
 at 
 scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:327)
 at 
 scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
 at 
 scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
 at 
 scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
 at 
 scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
 at 
 scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:327)
 at scala.collection.AbstractTraversable.groupBy(Traversable.scala:105)
 at 
 org.apache.spark.storage.StorageUtils$.rddInfoFromStorageStatus(StorageUtils.scala:82)
 at 
 org.apache.spark.ui.storage.StorageListener.updateRDDInfo(StorageTab.scala:56)
 at 
 org.apache.spark.ui.storage.StorageListener.onTaskEnd(StorageTab.scala:67)
 - locked 0xa27ebe30 (a 
 org.apache.spark.ui.storage.StorageListener)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-17 Thread Aaron Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065306#comment-14065306
 ] 

Aaron Davidson commented on SPARK-2282:
---

This problem is kinda silly because we're accumulating these updates from a 
single thread in the DAGScheduler, so we should only really have one socket 
open at a time, but it's very short lived. We could just reuse the connection 
with a relatively minor refactor of accumulators.py and PythonAccumulatorParam.

 PySpark crashes if too many tasks complete quickly
 --

 Key: SPARK-2282
 URL: https://issues.apache.org/jira/browse/SPARK-2282
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.1, 1.0.0, 1.0.1
Reporter: Aaron Davidson
Assignee: Aaron Davidson
 Fix For: 0.9.2, 1.0.0, 1.0.1


 Upon every task completion, PythonAccumulatorParam constructs a new socket to 
 the Accumulator server running inside the pyspark daemon. This can cause a 
 buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
 stage, which will cause the SparkContext to crash if too many tasks complete 
 too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
 This bug can be fixed outside of Spark by ensuring these properties are set 
 (on a linux server);
 echo 1  /proc/sys/net/ipv4/tcp_tw_reuse
 echo 1  /proc/sys/net/ipv4/tcp_tw_recycle
 or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-07-17 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2447:
-

Assignee: Ted Malaska

 Add common solution for sending upsert actions to HBase (put, deletes, and 
 increment)
 -

 Key: SPARK-2447
 URL: https://issues.apache.org/jira/browse/SPARK-2447
 Project: Spark
  Issue Type: New Feature
Reporter: Ted Malaska
Assignee: Ted Malaska

 Going to review the design with Tdas today.  
 But first thoughts is to have an extension of VoidFunction that handles the 
 connection to HBase and allows for options such as turning auto flush off for 
 higher through put.
 Need to answer the following questions first.
 - Can it be written in Java or should it be written in Scala?
 - What is the best way to add the HBase dependency? (will review how Flume 
 does this as the first option)
 - What is the best way to do testing? (will review how Flume does this as the 
 first option)
 - How to support python? (python may be a different Jira it is unknown at 
 this time)
 Goals:
 - Simple to use
 - Stable
 - Supports high load
 - Documented (May be in a separate Jira need to ask Tdas)
 - Supports Java, Scala, and hopefully Python
 - Supports Streaming and normal Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2494) Hash of None is different cross machines in CPython

2014-07-17 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065317#comment-14065317
 ] 

Davies Liu commented on SPARK-2494:
---

The tip version already handle hash of None, but it can not handle hash of 
tuple with None in it.

Here is the updated test cases, sorry for that:

 rdd = sc.parallelize([((None, 1), 1),] *100 , 100)
 assert rdd.groupByKey(10).collect() == 1

 Hash of None is different cross machines in CPython
 ---

 Key: SPARK-2494
 URL: https://issues.apache.org/jira/browse/SPARK-2494
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0, 1.0.1
 Environment: CPython 2.x 
Reporter: Davies Liu
Priority: Blocker
  Labels: pyspark, shuffle
 Fix For: 1.0.0, 1.0.1

   Original Estimate: 24h
  Remaining Estimate: 24h

 The hash of None, also tuple with None in it, is different cross machines, so 
 the result will be wrong if None appear in the key of partitionBy().
 It should use an portable hash function as the default partition function, 
 which generate same hash for all the builtin immutable types, especially 
 tuple.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2528) spark-ec2 security group permissions are too open


 [ 
https://issues.apache.org/jira/browse/SPARK-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-2528:


Description: 
{{spark-ec2}} configures EC2 security groups with ports [open to the world | 
https://github.com/apache/spark/blob/9c73822a08848a0cde545282d3eb1c3f1a4c2a82/ec2/spark_ec2.py#L280].
 This is an unnecessary security risk, even for a short-lived cluster.

Wherever possible, it would be better if, when launching a new cluster, 
{{spark-ec2}} detects the host's external IP address (e.g. via 
{{icanhazip.com}}) and grants access specifically to that IP address.

  was:
{{spark-ec2}} configures EC2 security groups with ports [open to the world | 
https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L280]. This is an 
unnecessary security risk, even for a short-lived cluster.

Wherever possible, it would be better if, when launching a new cluster, 
{{spark-ec2}} detects the host's external IP address (e.g. via 
{{icanhazip.com}}) and grants access specifically to that IP address.


 spark-ec2 security group permissions are too open
 -

 Key: SPARK-2528
 URL: https://issues.apache.org/jira/browse/SPARK-2528
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.0.0
Reporter: Nicholas Chammas
Priority: Minor

 {{spark-ec2}} configures EC2 security groups with ports [open to the world | 
 https://github.com/apache/spark/blob/9c73822a08848a0cde545282d3eb1c3f1a4c2a82/ec2/spark_ec2.py#L280].
  This is an unnecessary security risk, even for a short-lived cluster.
 Wherever possible, it would be better if, when launching a new cluster, 
 {{spark-ec2}} detects the host's external IP address (e.g. via 
 {{icanhazip.com}}) and grants access specifically to that IP address.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2501) Handle stage re-submissions properly in the UI

2014-07-17 Thread Masayoshi TSUZUKI (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065353#comment-14065353
 ] 

Masayoshi TSUZUKI commented on SPARK-2501:
--

Yes, this ticket covers it. I think that problem (like 1010/1000) is caused 
because we only use stageId (not stageId + attemptId) as the key in 
JobProgressListener.

 Handle stage re-submissions properly in the UI
 --

 Key: SPARK-2501
 URL: https://issues.apache.org/jira/browse/SPARK-2501
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Patrick Wendell
Assignee: Masayoshi TSUZUKI
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2560) Create Spark SQL syntax reference

Nicholas Chammas created SPARK-2560:
---

 Summary: Create Spark SQL syntax reference
 Key: SPARK-2560
 URL: https://issues.apache.org/jira/browse/SPARK-2560
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Reporter: Nicholas Chammas






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2560) Create Spark SQL syntax reference


 [ 
https://issues.apache.org/jira/browse/SPARK-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-2560:


Description: 
Does Spark SQL support {{LEN()}}? How about {{LIMIT}}? And what about {{MY 
FAVOURITE SYNTAX}}?

Right now there is no reference page to document this. [Hive has one.| 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select] Spark 
SQL should have one, too.

 Create Spark SQL syntax reference
 -

 Key: SPARK-2560
 URL: https://issues.apache.org/jira/browse/SPARK-2560
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Reporter: Nicholas Chammas

 Does Spark SQL support {{LEN()}}? How about {{LIMIT}}? And what about {{MY 
 FAVOURITE SYNTAX}}?
 Right now there is no reference page to document this. [Hive has one.| 
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select] Spark 
 SQL should have one, too.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2542) Exit Code Class should be renamed and placed package properly

2014-07-17 Thread Kousuke Saruta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065528#comment-14065528
 ] 

Kousuke Saruta commented on SPARK-2542:
---

PR: https://github.com/apache/spark/pull/1467

 Exit Code Class should be renamed and placed package properly
 -

 Key: SPARK-2542
 URL: https://issues.apache.org/jira/browse/SPARK-2542
 Project: Spark
  Issue Type: Bug
Reporter: Kousuke Saruta

 org.apache.spark.executor.ExecutorExitCode represents some of Exit Codes.
 The name of the class associates the set of exit code of Executor.
 But, the exit codes defined in the class can be used not only Executor (e.g 
 Driver).
 Actually, DiskBlockManager uses 
 ExecutorExitCode.DISK_STORE_FAILED_TO_CREATE_DIR and DiskBlockManager can be 
 used Driver.
 We should rename and move the class to new package.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-17 Thread Ken Carlile (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065549#comment-14065549
 ] 

Ken Carlile commented on SPARK-2282:


Awesome. I was afraid we were trying to chase down something else here. Glad to 
hear that it's a known issue and that you've got a good idea how to fix it. 
Thanks for the quick response!

--Ken

 PySpark crashes if too many tasks complete quickly
 --

 Key: SPARK-2282
 URL: https://issues.apache.org/jira/browse/SPARK-2282
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.1, 1.0.0, 1.0.1
Reporter: Aaron Davidson
Assignee: Aaron Davidson
 Fix For: 0.9.2, 1.0.0, 1.0.1


 Upon every task completion, PythonAccumulatorParam constructs a new socket to 
 the Accumulator server running inside the pyspark daemon. This can cause a 
 buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
 stage, which will cause the SparkContext to crash if too many tasks complete 
 too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
 This bug can be fixed outside of Spark by ensuring these properties are set 
 (on a linux server);
 echo 1  /proc/sys/net/ipv4/tcp_tw_reuse
 echo 1  /proc/sys/net/ipv4/tcp_tw_recycle
 or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1215) Clustering: Index out of bounds error


 [ 
https://issues.apache.org/jira/browse/SPARK-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1215.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1468
[https://github.com/apache/spark/pull/1468]

 Clustering: Index out of bounds error
 -

 Key: SPARK-1215
 URL: https://issues.apache.org/jira/browse/SPARK-1215
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: dewshick
Assignee: Joseph K. Bradley
 Fix For: 1.1.0

 Attachments: test.csv


 code:
 import org.apache.spark.mllib.clustering._
 val test = sc.makeRDD(Array(4,4,4,4,4).map(e = Array(e.toDouble)))
 val kmeans = new KMeans().setK(4)
 kmeans.run(test) evals with java.lang.ArrayIndexOutOfBoundsException
 error:
 14/01/17 12:35:54 INFO scheduler.DAGScheduler: Stage 25 (collectAsMap at 
 KMeans.scala:243) finished in 0.047 s
 14/01/17 12:35:54 INFO spark.SparkContext: Job finished: collectAsMap at 
 KMeans.scala:243, took 16.389537116 s
 Exception in thread main java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at com.simontuffs.onejar.Boot.run(Boot.java:340)
   at com.simontuffs.onejar.Boot.main(Boot.java:166)
 Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
   at 
 org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:47)
   at 
 org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:247)
   at 
 org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
   at scala.collection.immutable.Range.foreach(Range.scala:81)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
   at scala.collection.immutable.Range.map(Range.scala:46)
   at 
 org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:244)
   at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:124)
   at Clustering$$anonfun$1.apply$mcDI$sp(Clustering.scala:21)
   at Clustering$$anonfun$1.apply(Clustering.scala:19)
   at Clustering$$anonfun$1.apply(Clustering.scala:19)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
   at scala.collection.immutable.Range.foreach(Range.scala:78)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
   at scala.collection.immutable.Range.map(Range.scala:46)
   at Clustering$.main(Clustering.scala:19)
   at Clustering.main(Clustering.scala)
   ... 6 more



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2470) Fix PEP 8 violations


[ 
https://issues.apache.org/jira/browse/SPARK-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065655#comment-14065655
 ] 

Reynold Xin commented on SPARK-2470:


That PR only covers a small fraction of the changes required.

 Fix PEP 8 violations
 

 Key: SPARK-2470
 URL: https://issues.apache.org/jira/browse/SPARK-2470
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Reynold Xin
Assignee: Prashant Sharma

 Let's fix all our pep8 violations so we can turn the pep8 checker on in 
 continuous integration. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2494) Hash of None is different cross machines in CPython


[ 
https://issues.apache.org/jira/browse/SPARK-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065671#comment-14065671
 ] 

Matthew Farrellee commented on SPARK-2494:
--

thank you. i've confirmed this:

{code}
 rdd.groupByKey(10).collect()
[((None, 1), pyspark.resultiterable.ResultIterable object at 0x19d4410), 
((None, 1), pyspark.resultiterable.ResultIterable object at 0x19d4310), 
((None, 1), pyspark.resultiterable.ResultIterable object at 0x19d7290)]
{code}

i have 3 workers in my cluster

 Hash of None is different cross machines in CPython
 ---

 Key: SPARK-2494
 URL: https://issues.apache.org/jira/browse/SPARK-2494
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0, 1.0.1
 Environment: CPython 2.x 
Reporter: Davies Liu
Priority: Blocker
  Labels: pyspark, shuffle
 Fix For: 1.0.0, 1.0.1

   Original Estimate: 24h
  Remaining Estimate: 24h

 The hash of None, also tuple with None in it, is different cross machines, so 
 the result will be wrong if None appear in the key of partitionBy().
 It should use an portable hash function as the default partition function, 
 which generate same hash for all the builtin immutable types, especially 
 tuple.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2562) Add Date datatype support to Spark SQL

2014-07-17 Thread Zongheng Yang (JIRA)

Zongheng Yang created SPARK-2562:


 Summary: Add Date datatype support to Spark SQL
 Key: SPARK-2562
 URL: https://issues.apache.org/jira/browse/SPARK-2562
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.1
Reporter: Zongheng Yang
Priority: Minor


Spark SQL currently supports Timestamp, but not Date. Hive introduced support 
for Date in [HIVE-4055|https://issues.apache.org/jira/browse/HIVE-4055], where 
the underlying representation is {{java.sql.Date}}.

(Thanks to user Rindra Ramamonjison for reporting this.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1458) Expose sc.version in PySpark


[ 
https://issues.apache.org/jira/browse/SPARK-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065693#comment-14065693
 ] 

Nicholas Chammas commented on SPARK-1458:
-

Perhaps that could also be some kind of unit test: Check that certain 
hand-inputted values are identical across Scala and Python, like the shell 
banner and {{sc.version}}.

 Expose sc.version in PySpark
 

 Key: SPARK-1458
 URL: https://issues.apache.org/jira/browse/SPARK-1458
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core
Affects Versions: 0.9.0
Reporter: Nicholas Chammas
Priority: Minor

 As discussed 
 [here|http://apache-spark-user-list.1001560.n3.nabble.com/programmatic-way-to-tell-Spark-version-td1929.html],
  I think it would be nice if there was a way to programmatically determine 
 what version of Spark you are running. 
 The potential use cases are not that important, but they include:
 # Branching your code based on what version of Spark is running.
 # Checking your version without having to quit and restart the Spark shell.
 Right now in PySpark, I believe the only way to determine your version is by 
 firing up the Spark shell and looking at the startup banner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store

2014-07-17 Thread Ankur Dave (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ankur Dave updated SPARK-2365:
--

Attachment: 2014-07-07-IndexedRDD-design-review.pdf

Slides explaining the motivation, design, and performance of IndexedRDD.

Add IndexedRDD, an efficient updatable key-value store
--

Key: SPARK-2365
URL: https://issues.apache.org/jira/browse/SPARK-2365
Project: Spark
Issue Type: New Feature
Components: GraphX, Spark Core
Reporter: Ankur Dave
Assignee: Ankur Dave
Attachments: 2014-07-07-IndexedRDD-design-review.pdf

RDDs currently provide a bulk-updatable, iterator-based interface. This
imposes minimal requirements on the storage layer, which only needs to
support sequential access, enabling on-disk and serialized storage.
However, many applications would benefit from a richer interface. Efficient
support for point lookups would enable serving data out of RDDs, but it
currently requires iterating over an entire partition to find the desired
element. Point updates similarly require copying an entire iterator. Joins
are also expensive, requiring a shuffle and local hash joins.
To address these problems, we propose IndexedRDD, an efficient key-value
store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key
uniqueness and pre-indexing the entries for efficient joins and point
lookups, updates, and deletions.
It would be implemented by (1) hash-partitioning the entries by key, (2)
maintaining a hash index within each partition, and (3) using purely
functional (immutable and efficiently updatable) data structures to enable
efficient modifications and deletions.
GraphX would be the first user of IndexedRDD, since it currently implements a
limited form of this functionality in VertexRDD. We envision a variety of
other uses for IndexedRDD, including streaming updates to RDDs, direct
serving from RDDs, and as an execution strategy for Spark SQL.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store

2014-07-17 Thread Ankur Dave (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065694#comment-14065694
]

Ankur Dave edited comment on SPARK-2365 at 7/17/14 10:31 PM:
-

Added slides explaining the motivation, design, and performance of IndexedRDD.

was (Author: ankurd):
Slides explaining the motivation, design, and performance of IndexedRDD.

Add IndexedRDD, an efficient updatable key-value store
--

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-872) Should revive offer after tasks finish in Mesos fine-grained mode

2014-07-17 Thread Timothy Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065703#comment-14065703
 ] 

Timothy Chen commented on SPARK-872:


I'm not quite understanding your statement where Mesos master will call 
resourceOffer until 4 cores are free? Can you elaborate what that means?

 Should revive offer after tasks finish in Mesos fine-grained mode 
 --

 Key: SPARK-872
 URL: https://issues.apache.org/jira/browse/SPARK-872
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 0.8.0
Reporter: xiajunluan

 when running spark on latest Mesos release, I notice that spark on mesos 
 fine-grained could not schedule spark tasks effectively, for example, if 
 slave has 4 cpu cores resource, mesos master will call resourceOffer function 
 of spark until 4 cpu cores are all free. but In my points like standalone 
 scheduler mode, if one task finished and one cpus core is free, Mesos master 
 should call spark resourceOffer to allocate resource to tasks. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2454) Separate driver spark home from executor spark home

2014-07-17 Thread Andrew Or (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Or updated SPARK-2454:
-

Description:
The driver may not always share the same directory structure as the executors.
It makes little sense to always re-use the driver's spark home on the executors.

https://github.com/apache/spark/pull/1244/ is an open effort to fix this.
However, this still requires us to set SPARK_HOME on all the executor nodes.
Really we should separate this out into something like `spark.executor.home`
and `spark.driver.home` rather than re-using SPARK_HOME everywhere.

was:
The driver may not always share the same directory structure as the executors.
It makes little sense to always re-use the driver's spark home on the executors.

Separate driver spark home from executor spark home
---

Key: SPARK-2454
URL: https://issues.apache.org/jira/browse/SPARK-2454
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
Fix For: 1.1.0

The driver may not always share the same directory structure as the
executors. It makes little sense to always re-use the driver's spark home on
the executors.
https://github.com/apache/spark/pull/1244/ is an open effort to fix this.
However, this still requires us to set SPARK_HOME on all the executor nodes.
Really we should separate this out into something like `spark.executor.home`
and `spark.driver.home` rather than re-using SPARK_HOME everywhere.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1702) Mesos executor won't start because of a ClassNotFoundException

2014-07-17 Thread Timothy Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065706#comment-14065706
 ] 

Timothy Chen commented on SPARK-1702:
-

The PR is merged and closed already, is this still an issue?

 Mesos executor won't start because of a ClassNotFoundException
 --

 Key: SPARK-1702
 URL: https://issues.apache.org/jira/browse/SPARK-1702
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
Reporter: Bouke van der Bijl
  Labels: executors, mesos, spark

 Some discussion here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html
 Fix here (which is probably not the right fix): 
 https://github.com/apache/spark/pull/620
 This was broken in v0.9.0, was fixed in v0.9.1 and is now broken again.
 Error in Mesos executor stderr:
 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0502 17:31:42.672224 14688 exec.cpp:131] Version: 0.18.0
 I0502 17:31:42.674959 14707 exec.cpp:205] Executor registered on slave 
 20140501-182306-16842879-5050-10155-0
 14/05/02 17:31:42 INFO MesosExecutorBackend: Using Spark's default log4j 
 profile: org/apache/spark/log4j-defaults.properties
 14/05/02 17:31:42 INFO MesosExecutorBackend: Registered with Mesos as 
 executor ID 20140501-182306-16842879-5050-10155-0
 14/05/02 17:31:43 INFO SecurityManager: Changing view acls to: vagrant
 14/05/02 17:31:43 INFO SecurityManager: SecurityManager, is authentication 
 enabled: false are ui acls enabled: false users with view permissions: 
 Set(vagrant)
 14/05/02 17:31:43 INFO Slf4jLogger: Slf4jLogger started
 14/05/02 17:31:43 INFO Remoting: Starting remoting
 14/05/02 17:31:43 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://spark@localhost:50843]
 14/05/02 17:31:43 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://spark@localhost:50843]
 java.lang.ClassNotFoundException: org/apache/spark/serializer/JavaSerializer
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:270)
 at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:165)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:176)
 at org.apache.spark.executor.Executor.init(Executor.scala:106)
 at 
 org.apache.spark.executor.MesosExecutorBackend.registered(MesosExecutorBackend.scala:56)
 Exception in thread Thread-0 I0502 17:31:43.710039 14707 exec.cpp:412] 
 Deactivating the executor libprocess
 The problem is that it can't find the class. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1764) EOF reached before Python server acknowledged

2014-07-17 Thread Timothy Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065709#comment-14065709
 ] 

Timothy Chen commented on SPARK-1764:
-

I'm not sure how this is related to Mesos, is this reproable using YARN or 
standalone?

 EOF reached before Python server acknowledged
 -

 Key: SPARK-1764
 URL: https://issues.apache.org/jira/browse/SPARK-1764
 Project: Spark
  Issue Type: Bug
  Components: Mesos, PySpark
Affects Versions: 1.0.0
Reporter: Bouke van der Bijl
Priority: Blocker
  Labels: mesos, pyspark

 I'm getting EOF reached before Python server acknowledged while using 
 PySpark on Mesos. The error manifests itself in multiple ways. One is:
 14/05/08 18:10:40 ERROR DAGSchedulerActorSupervisor: eventProcesserActor 
 failed due to the error EOF reached before Python server acknowledged; 
 shutting down SparkContext
 And the other has a full stacktrace:
 14/05/08 18:03:06 ERROR OneForOneStrategy: EOF reached before Python server 
 acknowledged
 org.apache.spark.SparkException: EOF reached before Python server acknowledged
   at 
 org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416)
   at 
 org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387)
   at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71)
   at 
 org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279)
   at 
 org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277)
   at 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
   at 
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
   at 
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
   at 
 scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
   at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
   at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
   at org.apache.spark.Accumulators$.add(Accumulators.scala:277)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 This error causes the SparkContext to shutdown. I have not been able to 
 reliably reproduce this bug, it seems to happen randomly, but if you run 
 enough tasks on a SparkContext it'll hapen eventually



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2563) Make number of connection retries configurable

2014-07-17 Thread Shivaram Venkataraman (JIRA)

Shivaram Venkataraman created SPARK-2563:


 Summary: Make number of connection retries configurable
 Key: SPARK-2563
 URL: https://issues.apache.org/jira/browse/SPARK-2563
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shivaram Venkataraman
Priority: Minor


In a large EC2 cluster, I often see the first shuffle stage in a job fail due 
to connection timeout exceptions. We should make the number of retries before 
failing configurable to handle these cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2563) Make number of connection retries configurable

2014-07-17 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065735#comment-14065735
 ] 

Shivaram Venkataraman commented on SPARK-2563:
--

https://github.com/apache/spark/pull/1471

 Make number of connection retries configurable
 --

 Key: SPARK-2563
 URL: https://issues.apache.org/jira/browse/SPARK-2563
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shivaram Venkataraman
Priority: Minor

 In a large EC2 cluster, I often see the first shuffle stage in a job fail due 
 to connection timeout exceptions. We should make the number of retries before 
 failing configurable to handle these cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2491) When an OOM is thrown,the executor does not stop properly.

2014-07-17 Thread Kousuke Saruta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065757#comment-14065757
 ] 

Kousuke Saruta commented on SPARK-2491:
---

Hi [~gq]
I found the issue related to you reported.
https://issues.apache.org/jira/browse/SPARK-1667

I know 3 situation executor does not stop properly at least.

1) At the time executor cannot write files to spark-local-*
2) At the time executor cannot fetch locally from spark-local-*
3) At the time executor cannot fetch from remote because remote executor cannot 
read from spark-local-*

I think if those case occurred, executor should shutdown.

I'm trying to solve this issue at https://github.com/apache/spark/pull/1383

 When an OOM is thrown,the executor does not stop properly.
 --

 Key: SPARK-2491
 URL: https://issues.apache.org/jira/browse/SPARK-2491
 Project: Spark
  Issue Type: Bug
Reporter: Guoqiang Li

 The executor log:
 {code}
 #
 # java.lang.OutOfMemoryError: Java heap space
 # -XX:OnOutOfMemoryError=kill %p
 #   Executing /bin/sh -c kill 44942...
 14/07/15 10:38:29 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: 
 SIGTERM
 14/07/15 10:38:29 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception 
 in thread Thread[Connection manager future execution context-6,5,main]
 java.lang.OutOfMemoryError: Java heap space
 at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57)
 at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
 at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94)
 at 
 org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176)
 at 
 org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63)
 at 
 org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109)
 at 
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:125)
 at 
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:122)
 at 
 scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117)
 at 
 scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115)
 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 14/07/15 10:38:29 WARN HadoopRDD: Exception in RecordReader.close()
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
 at 
 org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
 at java.io.FilterInputStream.close(FilterInputStream.java:181)
 at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
 at 
 org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:243)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
 at 
 org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
 at 
 org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 -
 14/07/15 10:38:30 INFO Executor: Running task ID 969
 14/07/15 10:38:30 INFO BlockManager: Found block broadcast_0 locally
 14/07/15 10:38:30 INFO HadoopRDD: Input split: 
 hdfs://10dian72.domain.test:8020/input/lbs/recommend/toona/rating/20140712/part-7:0+68016537
 14/07/15 10:38:30 ERROR Executor: Exception in task ID 969
 java.io.FileNotFoundException:

[jira] [Resolved] (SPARK-2534) Avoid pulling in the entire RDD or PairRDDFunctions in various operators


 [ 
https://issues.apache.org/jira/browse/SPARK-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2534.


   Resolution: Fixed
Fix Version/s: 1.0.2
   1.1.0

 Avoid pulling in the entire RDD or PairRDDFunctions in various operators
 

 Key: SPARK-2534
 URL: https://issues.apache.org/jira/browse/SPARK-2534
 Project: Spark
  Issue Type: Bug
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Fix For: 1.1.0, 1.0.2


 The way groupByKey is written actually pulls the entire PairRDDFunctions into 
 the 3 closures, sometimes resulting in gigantic task sizes:
 {code}
   def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = {
 // groupByKey shouldn't use map side combine because map side combine 
 does not
 // reduce the amount of data shuffled and requires all map side data be 
 inserted
 // into a hash table, leading to more objects in the old gen.
 def createCombiner(v: V) = ArrayBuffer(v)
 def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v
 def mergeCombiners(c1: ArrayBuffer[V], c2: ArrayBuffer[V]) = c1 ++ c2
 val bufs = combineByKey[ArrayBuffer[V]](
   createCombiner _, mergeValue _, mergeCombiners _, partitioner, 
 mapSideCombine=false)
 bufs.mapValues(_.toIterable)
   }
 {code}
 Changing the functions from def to val would solve it. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2564) ShuffleReadMetrics.totalBlocksFetched is redundant

Sandy Ryza created SPARK-2564:
-

 Summary: ShuffleReadMetrics.totalBlocksFetched is redundant
 Key: SPARK-2564
 URL: https://issues.apache.org/jira/browse/SPARK-2564
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza


We already track remoteBlocksFetched and localBlocksFetched



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2565) Update ShuffleReadMetrics as blocks are fetched

Sandy Ryza created SPARK-2565:
-

 Summary: Update ShuffleReadMetrics as blocks are fetched
 Key: SPARK-2565
 URL: https://issues.apache.org/jira/browse/SPARK-2565
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza


Updating ShuffleReadMetrics as a task progresses will allow reporting 
incremental progress after SPARK-2099.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2566) Update ShuffleWriteMetrics as data is written

Sandy Ryza created SPARK-2566:
-

 Summary: Update ShuffleWriteMetrics as data is written
 Key: SPARK-2566
 URL: https://issues.apache.org/jira/browse/SPARK-2566
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Sandy Ryza


This will allow reporting incremental progress once we have SPARK-2099.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2564) ShuffleReadMetrics.totalBlocksFetched is redundant


[ 
https://issues.apache.org/jira/browse/SPARK-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065826#comment-14065826
 ] 

Sandy Ryza commented on SPARK-2564:
---

https://github.com/apache/spark/pull/1474

 ShuffleReadMetrics.totalBlocksFetched is redundant
 --

 Key: SPARK-2564
 URL: https://issues.apache.org/jira/browse/SPARK-2564
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza

 We already track remoteBlocksFetched and localBlocksFetched



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2567) Resubmitted stage sometimes remains as active stage in the web UI

2014-07-17 Thread Masayoshi TSUZUKI (JIRA)

Masayoshi TSUZUKI created SPARK-2567:


 Summary: Resubmitted stage sometimes remains as active stage in 
the web UI
 Key: SPARK-2567
 URL: https://issues.apache.org/jira/browse/SPARK-2567
 Project: Spark
  Issue Type: Bug
Reporter: Masayoshi TSUZUKI


When a stage is resubmitted because of executor lost for example, sometimes 
more than one resubmitted task appears in the web UI and one stage remains as 
active even after the job has finished.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2567) Resubmitted stage sometimes remains as active stage in the web UI

2014-07-17 Thread Masayoshi TSUZUKI (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masayoshi TSUZUKI updated SPARK-2567:
-

Attachment: SPARK-2567.png

 Resubmitted stage sometimes remains as active stage in the web UI
 -

 Key: SPARK-2567
 URL: https://issues.apache.org/jira/browse/SPARK-2567
 Project: Spark
  Issue Type: Bug
Reporter: Masayoshi TSUZUKI
 Attachments: SPARK-2567.png


 When a stage is resubmitted because of executor lost for example, sometimes 
 more than one resubmitted task appears in the web UI and one stage remains as 
 active even after the job has finished.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2568) RangePartitioner should go through the data only once

Reynold Xin created SPARK-2568:
--

 Summary: RangePartitioner should go through the data only once
 Key: SPARK-2568
 URL: https://issues.apache.org/jira/browse/SPARK-2568
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Reynold Xin


As of Spark 1.0, RangePartitioner goes through data twice: once to compute the 
count and once to do sampling. As a result, to do sortByKey, Spark goes through 
data 3 times (once to count, once to sample, and once to sort).

RangePartitioner should go through data only once (remove the count step).






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2299) Consolidate various stageIdTo* hash maps


 [ 
https://issues.apache.org/jira/browse/SPARK-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2299.


   Resolution: Fixed
Fix Version/s: 1.1.0

 Consolidate various stageIdTo* hash maps
 

 Key: SPARK-2299
 URL: https://issues.apache.org/jira/browse/SPARK-2299
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.1.0


 In JobProgressListener:
 {code}
   val stageIdToTime = HashMap[Int, Long]()
   val stageIdToShuffleRead = HashMap[Int, Long]()
   val stageIdToShuffleWrite = HashMap[Int, Long]()
   val stageIdToMemoryBytesSpilled = HashMap[Int, Long]()
   val stageIdToDiskBytesSpilled = HashMap[Int, Long]()
   val stageIdToTasksActive = HashMap[Int, HashMap[Long, TaskInfo]]()
   val stageIdToTasksComplete = HashMap[Int, Int]()
   val stageIdToTasksFailed = HashMap[Int, Int]()
   val stageIdToTaskData = HashMap[Int, HashMap[Long, TaskUIData]]()
   val stageIdToExecutorSummaries = HashMap[Int, HashMap[String, 
 ExecutorSummary]]()
   val stageIdToPool = HashMap[Int, String]()
   val stageIdToDescription = HashMap[Int, String]()
 {code}
 We should consolidate them to reduce memory  be less error prone. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2569) Customized UDFs in hive not running with Spark SQL

2014-07-17 Thread jacky hung (JIRA)

jacky hung created SPARK-2569:
-

 Summary: Customized UDFs in hive not running with Spark SQL
 Key: SPARK-2569
 URL: https://issues.apache.org/jira/browse/SPARK-2569
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
 Environment: linux or mac, hive 0.9.0 and hive 0.13.0 with hadoop 
1.0.4, scala 2.10.3, spark 1.0.0
Reporter: jacky hung


start spark-shell,
init (like create hiveContext, import ._ ect, make sure the jar including the 
UDFs is in classpath)

hql(CREATE TEMPORARY FUNCTION t_ts AS 'udf.Timestamp'), which is successful. 

then i tried hql(select t_ts(time) from data_common where  limit 
1).collect().foreach(println), which failed with NullPointException 

we had discussion about it in the mail list.
http://apache-spark-user-list.1001560.n3.nabble.com/run-sparksql-hiveudf-error-throw-NPE-td.html#a9006

java.lang.NullPointerException 
org.apache.spark.sql.hive.HiveFunctionFactory$class.getFunctionClass(hiveUdfs.scala:117)
 org.apache.spark.sql.hive.HiveUdf.getFunctionClass(hiveUdfs.scala:157) 
org.apache.spark.sql.hive.HiveFunctionFactory$class.createFunction(hiveUdfs.scala:119)
 org.apache.spark.sql.hive.HiveUdf.createFunction(hiveUdfs.scala:157) 
org.apache.spark.sql.hive.HiveUdf.function$lzycompute(hiveUdfs.scala:170) 
org.apache.spark.sql.hive.HiveUdf.function(hiveUdfs.scala:170) 
org.apache.spark.sql.hive.HiveSimpleUdf.method$lzycompute(hiveUdfs.scala:181) 
org.apache.spark.sql.hive.HiveSimpleUdf.method(hiveUdfs.scala:180) 
org.apache.spark.sql.hive.HiveSimpleUdf.wrappers$lzycompute(hiveUdfs.scala:186) 
org.apache.spark.sql.hive.HiveSimpleUdf.wrappers(hiveUdfs.scala:186) 
org.apache.spark.sql.hive.HiveSimpleUdf.eval(hiveUdfs.scala:220) 
org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:64)
 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:160)
 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:153)
 org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) 
org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) 
org.apache.spark.rdd.RDD.iterator(RDD.scala:228) 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2570) ClassCastException from HiveFromSpark(examples)

2014-07-17 Thread Cheng Hao (JIRA)

Cheng Hao created SPARK-2570:


 Summary: ClassCastException from HiveFromSpark(examples)
 Key: SPARK-2570
 URL: https://issues.apache.org/jira/browse/SPARK-2570
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Minor


The Exception is thrown when run the example of HiveFromSpark
Exception in thread main java.lang.ClassCastException: java.lang.Long cannot 
be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:145)
at 
org.apache.spark.examples.sql.hive.HiveFromSpark$.main(HiveFromSpark.scala:45)
at 
org.apache.spark.examples.sql.hive.HiveFromSpark.main(HiveFromSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2570) ClassCastException from HiveFromSpark(examples)

2014-07-17 Thread Cheng Hao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065905#comment-14065905
 ] 

Cheng Hao commented on SPARK-2570:
--

https://github.com/apache/spark/pull/1475

 ClassCastException from HiveFromSpark(examples)
 ---

 Key: SPARK-2570
 URL: https://issues.apache.org/jira/browse/SPARK-2570
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Minor

 The Exception is thrown when run the example of HiveFromSpark
 Exception in thread main java.lang.ClassCastException: java.lang.Long 
 cannot be cast to java.lang.Integer
   at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
   at 
 org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:145)
   at 
 org.apache.spark.examples.sql.hive.HiveFromSpark$.main(HiveFromSpark.scala:45)
   at 
 org.apache.spark.examples.sql.hive.HiveFromSpark.main(HiveFromSpark.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2571) Shuffle read bytes are reported incorrectly for stages with multiple shuffle dependencies

2014-07-17 Thread Kay Ousterhout (JIRA)

Kay Ousterhout created SPARK-2571:
-

 Summary: Shuffle read bytes are reported incorrectly for stages 
with multiple shuffle dependencies
 Key: SPARK-2571
 URL: https://issues.apache.org/jira/browse/SPARK-2571
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.1, 0.9.3
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout


In BlockStoreShuffleFetcher, we set the shuffle metrics for a task to include 
information about data fetched from one BlockFetcherIterator.  When tasks have 
multiple shuffle dependencies (e.g., a stage that joins two datasets together), 
the metrics will get set based on data fetched from the last 
BlockFetcherIterator to complete, rather than the sum of all data fetched from 
all BlockFetcherIterators.  This can lead to dramatically underreporting the 
shuffle read bytes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2571) Shuffle read bytes are reported incorrectly for stages with multiple shuffle dependencies

2014-07-17 Thread Kay Ousterhout (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Kay Ousterhout updated SPARK-2571:
--

Description:
In BlockStoreShuffleFetcher, we set the shuffle metrics for a task to include
information about data fetched from one BlockFetcherIterator. When tasks have
multiple shuffle dependencies (e.g., a stage that joins two datasets together),
the metrics will get set based on data fetched from the last
BlockFetcherIterator to complete, rather than the sum of all data fetched from
all BlockFetcherIterators. This can lead to dramatically underreporting the
shuffle read bytes.

Thanks [~andrewor14] and [~rxin] for helping to diagnose this issue.

was:In BlockStoreShuffleFetcher, we set the shuffle metrics for a task to
include information about data fetched from one BlockFetcherIterator. When
tasks have multiple shuffle dependencies (e.g., a stage that joins two datasets
together), the metrics will get set based on data fetched from the last
BlockFetcherIterator to complete, rather than the sum of all data fetched from
all BlockFetcherIterators. This can lead to dramatically underreporting the
shuffle read bytes.

Shuffle read bytes are reported incorrectly for stages with multiple shuffle
dependencies
-

Key: SPARK-2571
URL: https://issues.apache.org/jira/browse/SPARK-2571
Project: Spark
Issue Type: Bug
Components: Web UI
Affects Versions: 1.0.1, 0.9.3
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout

In BlockStoreShuffleFetcher, we set the shuffle metrics for a task to include
information about data fetched from one BlockFetcherIterator. When tasks
have multiple shuffle dependencies (e.g., a stage that joins two datasets
together), the metrics will get set based on data fetched from the last
BlockFetcherIterator to complete, rather than the sum of all data fetched
from all BlockFetcherIterators. This can lead to dramatically underreporting
the shuffle read bytes.
Thanks [~andrewor14] and [~rxin] for helping to diagnose this issue.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1458) Expose sc.version in PySpark


[ 
https://issues.apache.org/jira/browse/SPARK-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065929#comment-14065929
 ] 

Patrick Wendell commented on SPARK-1458:


Isn't it possible to just have the python function call the Java/Scala one? 
That way there is no new version number to be updated anywhere.

 Expose sc.version in PySpark
 

 Key: SPARK-1458
 URL: https://issues.apache.org/jira/browse/SPARK-1458
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core
Affects Versions: 0.9.0
Reporter: Nicholas Chammas
Priority: Minor

 As discussed 
 [here|http://apache-spark-user-list.1001560.n3.nabble.com/programmatic-way-to-tell-Spark-version-td1929.html],
  I think it would be nice if there was a way to programmatically determine 
 what version of Spark you are running. 
 The potential use cases are not that important, but they include:
 # Branching your code based on what version of Spark is running.
 # Checking your version without having to quit and restart the Spark shell.
 Right now in PySpark, I believe the only way to determine your version is by 
 firing up the Spark shell and looking at the startup banner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2411) Standalone Master - direct users to turn on event logs


 [ 
https://issues.apache.org/jira/browse/SPARK-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2411:
---

Assignee: Andrew Or

 Standalone Master - direct users to turn on event logs
 --

 Key: SPARK-2411
 URL: https://issues.apache.org/jira/browse/SPARK-2411
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Andrew Or
 Fix For: 1.1.0

 Attachments: Application history load error.png, Application history 
 not found.png, Event logging not enabled.png


 Right now if the user attempts to click on a finished application's UI, it 
 simply refreshes. This is simply because the event logs are not there, in 
 which case we set the href=.
 We could provide more information by pointing them to configure 
 spark.eventLog.enabled if they click on the empty link.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2411) Standalone Master - direct users to turn on event logs


 [ 
https://issues.apache.org/jira/browse/SPARK-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2411:
---

Fix Version/s: 1.1.0

 Standalone Master - direct users to turn on event logs
 --

 Key: SPARK-2411
 URL: https://issues.apache.org/jira/browse/SPARK-2411
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
 Fix For: 1.1.0

 Attachments: Application history load error.png, Application history 
 not found.png, Event logging not enabled.png


 Right now if the user attempts to click on a finished application's UI, it 
 simply refreshes. This is simply because the event logs are not there, in 
 which case we set the href=.
 We could provide more information by pointing them to configure 
 spark.eventLog.enabled if they click on the empty link.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2543) Resizable serialization buffers for kryo