from:"Patrick Wendell"


 [ 
https://issues.apache.org/jira/browse/SPARK-5904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5904.

   Resolution: Fixed
Fix Version/s: 1.3.0

I think rxin just forgot to close this. It was merged several days ago.

 DataFrame methods with varargs do not work in Java
 --

 Key: SPARK-5904
 URL: https://issues.apache.org/jira/browse/SPARK-5904
 Project: Spark
  Issue Type: Sub-task
  Components: Java API, SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Reynold Xin
Priority: Blocker
  Labels: DataFrame
 Fix For: 1.3.0


 DataFrame methods with varargs fail when called from Java due to a bug in 
 Scala.
 This can be produced by, e.g., modifying the end of the example 
 ml.JavaSimpleParamsExample in the master branch:
 {code}
 DataFrame results = model2.transform(test);
 results.printSchema(); // works
 results.collect(); // works
 results.filter(label  0.0).count(); // works
 for (Row r: results.select(features, label, myProbability, 
 prediction).collect()) { // fails on select
   System.out.println(( + r.get(0) + ,  + r.get(1) + ) - prob= + 
 r.get(2)
   + , prediction= + r.get(3));
 }
 {code}
 I have also tried groupBy and found that failed too.
 The error looks like this:
 {code}
 Exception in thread main java.lang.AbstractMethodError: 
 org.apache.spark.sql.DataFrameImpl.groupBy(Ljava/lang/String;[Ljava/lang/String;)Lorg/apache/spark/sql/GroupedData;
   at 
 org.apache.spark.examples.ml.JavaSimpleParamsExample.main(JavaSimpleParamsExample.java:108)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 {code}
 The error appears to be from this Scala bug with using varargs in an abstract 
 method:
 [https://issues.scala-lang.org/browse/SI-9013]
 My current plan is to move the implementations of the methods with varargs 
 from DataFrameImpl to DataFrame.
 However, this may cause issues with IncomputableColumn---feedback??
 Thanks to [~joshrosen] for figuring the bug and fix out!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5463) Fix Parquet filter push-down


[ 
https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333608#comment-14333608
 ] 

Patrick Wendell commented on SPARK-5463:


Bumping to critical. Per our offline discussion last week we probably won't 
hold the release for this.

 Fix Parquet filter push-down
 

 Key: SPARK-5463
 URL: https://issues.apache.org/jira/browse/SPARK-5463
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.2.2
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5463) Fix Parquet filter push-down


 [ 
https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5463:
---
Priority: Critical  (was: Blocker)

 Fix Parquet filter push-down
 

 Key: SPARK-5463
 URL: https://issues.apache.org/jira/browse/SPARK-5463
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.2.2
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3650) Triangle Count handles reverse edges incorrectly


 [ 
https://issues.apache.org/jira/browse/SPARK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3650:
---
Priority: Critical  (was: Blocker)

 Triangle Count handles reverse edges incorrectly
 

 Key: SPARK-3650
 URL: https://issues.apache.org/jira/browse/SPARK-3650
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.1.0, 1.2.0
Reporter: Joseph E. Gonzalez
Priority: Critical

 The triangle count implementation assumes that edges are aligned in a 
 canonical direction.  As stated in the documentation:
 bq. Note that the input graph should have its edges in canonical direction 
 (i.e. the `sourceId` less than `destId`)
 However the TriangleCount algorithm does not verify that this condition holds 
 and indeed even the unit tests exploits this functionality:
 {code:scala}
 val triangles = Array(0L - 1L, 1L - 2L, 2L - 0L) ++
 Array(0L - -1L, -1L - -2L, -2L - 0L)
   val rawEdges = sc.parallelize(triangles, 2)
   val graph = Graph.fromEdgeTuples(rawEdges, true).cache()
   val triangleCount = graph.triangleCount()
   val verts = triangleCount.vertices
   verts.collect().foreach { case (vid, count) =
 if (vid == 0) {
   assert(count === 4)  // -- Should be 2
 } else {
   assert(count === 2) // -- Should be 1
 }
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Patrick Wendell

So actually, the list of blockers on JIRA is a bit outdated. These
days I won't cut RC1 unless there are no known issues that I'm aware
of that would actually block the release (that's what the snapshot
ones are for). I'm going to clean those up and push others to do so
also.

The main issues I'm aware of that came about post RC1 are:
1. Python submission broken on YARN
2. The license issue in MLlib [now fixed].
3. Varargs broken for Java Dataframes [now fixed]

Re: Corey - yeah, as it stands now I try to wait if there are things
that look like implicit -1 votes.

On Mon, Feb 23, 2015 at 6:13 AM, Corey Nolet cjno...@gmail.com wrote:
 Thanks Sean. I glossed over the comment about SPARK-5669.

 On Mon, Feb 23, 2015 at 9:05 AM, Sean Owen so...@cloudera.com wrote:

 Yes my understanding from Patrick's comment is that this RC will not
 be released, but, to keep testing. There's an implicit -1 out of the
 gates there, I believe, and so the vote won't pass, so perhaps that's
 why there weren't further binding votes. I'm sure that will be
 formalized shortly.

 FWIW here are 10 issues still listed as blockers for 1.3.0:

 SPARK-5910 DataFrame.selectExpr(col as newName) does not work
 SPARK-5904 SPARK-5166 DataFrame methods with varargs do not work in Java
 SPARK-5873 Can't see partially analyzed plans
 SPARK-5546 Improve path to Kafka assembly when trying Kafka Python API
 SPARK-5517 SPARK-5166 Add input types for Java UDFs
 SPARK-5463 Fix Parquet filter push-down
 SPARK-5310 SPARK-5166 Update SQL programming guide for 1.3
 SPARK-5183 SPARK-5180 Document data source API
 SPARK-3650 Triangle Count handles reverse edges incorrectly
 SPARK-3511 Create a RELEASE-NOTES.txt file in the repo


 On Mon, Feb 23, 2015 at 1:55 PM, Corey Nolet cjno...@gmail.com wrote:
  This vote was supposed to close on Saturday but it looks like no PMCs
  voted
  (other than the implicit vote from Patrick). Was there a discussion
  offline
  to cut an RC2? Was the vote extended?



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Patrick Wendell

It's only been reported on this thread by Tom, so far.

On Mon, Feb 23, 2015 at 10:29 AM, Marcelo Vanzin van...@cloudera.com wrote:
 Hey Patrick,

 Do you have a link to the bug related to Python and Yarn? I looked at
 the blockers in Jira but couldn't find it.

 On Mon, Feb 23, 2015 at 10:18 AM, Patrick Wendell pwend...@gmail.com wrote:
 So actually, the list of blockers on JIRA is a bit outdated. These
 days I won't cut RC1 unless there are no known issues that I'm aware
 of that would actually block the release (that's what the snapshot
 ones are for). I'm going to clean those up and push others to do so
 also.

 The main issues I'm aware of that came about post RC1 are:
 1. Python submission broken on YARN
 2. The license issue in MLlib [now fixed].
 3. Varargs broken for Java Dataframes [now fixed]

 Re: Corey - yeah, as it stands now I try to wait if there are things
 that look like implicit -1 votes.

 --
 Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: FW: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Patrick Wendell

What happens if you submit from the master node itself on ec2 (in
client mode), does that work? What about in cluster mode?

It would be helpful if you could print the full command that the
executor is failing. That might show that spark.driver.host is being
set strangely. IIRC we print the launch command before starting the
executor.

Overall the standalone cluster mode is not as well tested across
environments with asymmetric connectivity. I didn't actually realize
that akka (which the submission uses) can handle this scenario. But it
does seem like the job is submitted, it's just not starting correctly.

- Patrick

On Mon, Feb 23, 2015 at 1:13 AM, Oleg Shirokikh o...@solver.com wrote:
 Patrick,

 I haven't changed the configs much. I just executed ec2-script to create 1 
 master, 2 slaves cluster. Then I try to submit the jobs from remote machine 
 leaving all defaults configured by Spark scripts as default. I've tried to 
 change configs as suggested in other mailing-list and stack overflow threads 
 (such as setting spark.driver.host, etc...), removed (hopefully) all 
 security/firewall restrictions from AWS, etc. but it didn't help.

 I think that what you are saying is exactly the issue: on my master node UI 
 at the bottom I can see the list of Completed Drivers all with ERROR 
 state...

 Thanks,
 Oleg

 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]
 Sent: Monday, February 23, 2015 12:59 AM
 To: Oleg Shirokikh
 Cc: user@spark.apache.org
 Subject: Re: Submitting jobs to Spark EC2 cluster remotely

 Can you list other configs that you are setting? It looks like the executor 
 can't communicate back to the driver. I'm actually not sure it's a good idea 
 to set spark.driver.host here, you want to let spark set that automatically.

 - Patrick

 On Mon, Feb 23, 2015 at 12:48 AM, Oleg Shirokikh o...@solver.com wrote:
 Dear Patrick,

 Thanks a lot for your quick response. Indeed, following your advice I've 
 uploaded the jar onto S3 and FileNotFoundException is gone now and job is 
 submitted in cluster deploy mode.

 However, now both (client and cluster) fail with the following errors in 
 executors (they keep exiting/killing executors as I see in UI):

 15/02/23 08:42:46 ERROR security.UserGroupInformation:
 PriviledgedActionException as:oleg
 cause:java.util.concurrent.TimeoutException: Futures timed out after
 [30 seconds]


 Full log is:

 15/02/23 01:59:11 INFO executor.CoarseGrainedExecutorBackend:
 Registered signal handlers for [TERM, HUP, INT]
 15/02/23 01:59:12 INFO spark.SecurityManager: Changing view acls to:
 root,oleg
 15/02/23 01:59:12 INFO spark.SecurityManager: Changing modify acls to:
 root,oleg
 15/02/23 01:59:12 INFO spark.SecurityManager: SecurityManager:
 authentication disabled; ui acls disabled; users with view
 permissions: Set(root, oleg); users with modify permissions: Set(root,
 oleg)
 15/02/23 01:59:12 INFO slf4j.Slf4jLogger: Slf4jLogger started
 15/02/23 01:59:12 INFO Remoting: Starting remoting
 15/02/23 01:59:13 INFO Remoting: Remoting started; listening on
 addresses
 :[akka.tcp://driverpropsfetc...@ip-172-31-33-194.us-west-2.compute.int
 ernal:39379]
 15/02/23 01:59:13 INFO util.Utils: Successfully started service 
 'driverPropsFetcher' on port 39379.
 15/02/23 01:59:43 ERROR security.UserGroupInformation:
 PriviledgedActionException as:oleg 
 cause:java.util.concurrent.TimeoutException: Futures timed out after [30 
 seconds] Exception in thread main 
 java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
 at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
 at
 org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrai
 nedExecutorBackend.scala) Caused by:
 java.security.PrivilegedActionException: 
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
 ... 4 more
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
 [30 seconds]
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107

[jira] [Updated] (SPARK-5920) Use a BufferedInputStream to read local shuffle data


 [ 
https://issues.apache.org/jira/browse/SPARK-5920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5920:
---
Priority: Critical  (was: Major)

 Use a BufferedInputStream to read local shuffle data
 

 Key: SPARK-5920
 URL: https://issues.apache.org/jira/browse/SPARK-5920
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 1.3.0, 1.2.1
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Critical

 When reading local shuffle data, Spark doesn't currently buffer the local 
 reads into larger chunks, which can lead to terrible disk performance if many 
 tasks are concurrently reading local data from the same disk.  We should use 
 a BufferedInputStream to mitigate this problem; we can lazily create the 
 input stream to avoid allocating a bunch of in-memory buffers at the same 
 time for tasks that read shuffle data from a large number of local blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5920) Use a BufferedInputStream to read local shuffle data


 [ 
https://issues.apache.org/jira/browse/SPARK-5920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5920:
---
Priority: Blocker  (was: Critical)

 Use a BufferedInputStream to read local shuffle data
 

 Key: SPARK-5920
 URL: https://issues.apache.org/jira/browse/SPARK-5920
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 1.3.0, 1.2.1
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Blocker

 When reading local shuffle data, Spark doesn't currently buffer the local 
 reads into larger chunks, which can lead to terrible disk performance if many 
 tasks are concurrently reading local data from the same disk.  We should use 
 a BufferedInputStream to mitigate this problem; we can lazily create the 
 input stream to avoid allocating a bunch of in-memory buffers at the same 
 time for tasks that read shuffle data from a large number of local blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5920) Use a BufferedInputStream to read local shuffle data


[ 
https://issues.apache.org/jira/browse/SPARK-5920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14331986#comment-14331986
 ] 

Patrick Wendell commented on SPARK-5920:


We should definitely do this.

 Use a BufferedInputStream to read local shuffle data
 

 Key: SPARK-5920
 URL: https://issues.apache.org/jira/browse/SPARK-5920
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 1.3.0, 1.2.1
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Blocker

 When reading local shuffle data, Spark doesn't currently buffer the local 
 reads into larger chunks, which can lead to terrible disk performance if many 
 tasks are concurrently reading local data from the same disk.  We should use 
 a BufferedInputStream to mitigate this problem; we can lazily create the 
 input stream to avoid allocating a bunch of in-memory buffers at the same 
 time for tasks that read shuffle data from a large number of local blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5916) $SPARK_HOME/bin/beeline conflicts with $HIVE_HOME/bin/beeline

[
https://issues.apache.org/jira/browse/SPARK-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332001#comment-14332001
]

Patrick Wendell commented on SPARK-5916:

The naming conflict is unfortunate. However, scripts like this are a public API
in Spark, so I don't think we can remove this API randomly in a minor release
given our versioning policies. For instance, people write scripts that invoke
beeline to submit queries, etc, and it would break those.

One thing we could potentially do is rename this to spark-beeline (or even to
something else) and update references accordingly. And then leave ./bin/beeline
as a backwards-compatible alias. And if individual distributors want to order
their path so that the beeline hits the Hive version first, they can do that.
However, that approach is also confusing because we'd have beeline and
spark-beeline both there.

Ping also [~marmbrus] and [~matei] regarding how we approach changes to scripts
like this. In the past I don't think we've ever changed one of these in a minor
release.

$SPARK_HOME/bin/beeline conflicts with $HIVE_HOME/bin/beeline
-

Key: SPARK-5916
URL: https://issues.apache.org/jira/browse/SPARK-5916
Project: Spark
Issue Type: Improvement
Components: SQL
Reporter: Carl Steinbach
Priority: Minor

Hive provides a JDBC CLI named beeline. Spark currently depends on beeline,
but provides its own beeline wrapper script for launching it. This results
in a conflict when both $HIVE_HOME/bin and $SPARK_HOME/bin appear on a user's
PATH.
In order to eliminate the potential for conflict I propose changing the name
of Spark's beeline wrapper script to sparkline.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5887) Class not found exception com.datastax.spark.connector.rdd.partitioner.CassandraPartition

2015-02-19 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5887.

Resolution: Invalid

The Datastax connector is not part of the Apache Spark distribution, it's 
maintained by Datastax directly. So please reach out to them for support. 
Thanks!

 Class not found exception  
 com.datastax.spark.connector.rdd.partitioner.CassandraPartition
 --

 Key: SPARK-5887
 URL: https://issues.apache.org/jira/browse/SPARK-5887
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: Spark 1.2.1
 Spark Cassandra Connector 1.2.0 Alpha2
Reporter: Vijay Pawnarkar

 I am getting following class not found exception when using Spark 1.2.1 with 
 spark-cassandra-connector_2.10-1.2.0-alpha2. When the job is submitted to 
 Spark.. it successfully adds required connector JAR file to Worker's 
 classpath. Corresponding log entries are also included in following 
 description.
 From log statements and looking at spark 1.2.1 codebase it looks like the JAR 
 get added to urlClassLoader via Executor.scala's updateDependencies method. 
 However when it time to execute the Task, its not able to resolve the class 
 name. 
 
 [task-result-getter-0] WARN org.apache.spark.scheduler.TaskSetManager - Lost 
 task 0.0 in stage 0.0 (TID 0, 127.0.0.1): java.lang.ClassNotFoundException: 
 com.datastax.spark.connector.rdd.partitioner.CassandraPartition
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:274)
 at 
 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
 at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
 at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
 at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 --
 LOG indicating JAR files were added to worker classpath.
 15/02/17 16:56:48 INFO Executor: Fetching 
 http://127.0.0.1:64265/jars/spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar
  with timestamp 1424210185005
 15/02/17 16:56:48 INFO Utils: Fetching 
 http://127.0.0.1:64265/jars/spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar
  to 
 C:\Users\sparkus\AppData\Local\Temp\spark-10f5e149-5460-4899-9c8f-b19b19bdaf55\spark-fba24b2b-5847-4b04-848c-90677d12ff99\spark-35f5ed4b-041d-40d8-8854-b243787de188\fetchFileTemp4665176275367448514.tmp
 15/02/17 16:56:48 DEBUG Utils: fetchFile not using security
 15/02/17 16:56:48 INFO Utils: Copying 
 C:\Users\sparkus\AppData\Local\Temp\spark-10f5e149-5460-4899-9c8f-b19b19bdaf55\spark-fba24b2b-5847-4b04-848c-90677d12ff99\spark-35f5ed4b-041d-40d8-8854-b243787de188\16215993091424210185005_cache
  to 
 C:\localapps\spark-1.2.1-bin-hadoop2.4\work\app-20150217165625-0006\0\.\spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar
 15/02/17 16:56:48 INFO Executor: Adding 
 file:/C:/localapps/spark-1.2.1-bin-hadoop2.4/work/app-20150217165625-0006/0/./spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar
  to class loader
 15/02/17 16:56:50 INFO Executor: Fetching 
 http://127.0.0.1:64265/jars/spark-cassandra-connector_2.10-1.2.0-alpha2.jar 
 with timestamp 1424210185012
 15/02/17 16:56:50 INFO Utils: Fetching 
 http://127.0.0.1:64265/jars/spark-cassandra-connector_2.10-1.2.0-alpha2.jar 
 to 
 C:\Users\sparkus\AppData\Local\Temp\spark-10f5e149-5460-4899-9c8f-b19b19bdaf55\spark-fba24b2b-5847-4b04-848c

[jira] [Updated] (SPARK-5863) Performance regression in Spark SQL/Parquet due to ScalaReflection.convertRowToScala

2015-02-19 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5863:
---
Priority: Critical  (was: Major)

 Performance regression in Spark SQL/Parquet due to 
 ScalaReflection.convertRowToScala
 

 Key: SPARK-5863
 URL: https://issues.apache.org/jira/browse/SPARK-5863
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1
Reporter: Cristian
Priority: Critical

 Was doing some perf testing on reading parquet files and noticed that moving 
 from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the 
 culprit showed up as being in ScalaReflection.convertRowToScala.
 Particularly this zip is the issue:
 {code}
 r.toSeq.zip(schema.fields.map(_.dataType))
 {code}
 I see there's a comment on that currently that this is slow but it wasn't 
 fixed. This actually produces a 3x degradation in parquet read performance, 
 at least in my test case.
 Edit: the map is part of the issue as well. This whole code block is in a 
 tight loop and allocates a new ListBuffer that needs to grow for each 
 transformation. A possible solution is to change to using seq.view which 
 would allocate iterators instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Merging code into branch 1.3

2015-02-18 Thread Patrick Wendell

Hey Committers,

Now that Spark 1.3 rc1 is cut, please restrict branch-1.3 merges to
the following:

1. Fixes for issues blocking the 1.3 release (i.e. 1.2.X regressions)
2. Documentation and tests.
3. Fixes for non-blocker issues that are surgical, low-risk, and/or
outside of the core.

If there is a lower priority bug fix (a non-blocker) that requires
nontrivial code changes, do not merge it into 1.3. If something seems
borderline, feel free to reach out to me and we can work through it
together.

This is what we've done for the last few releases to make sure rc's
become progressively more stable, and it is important towards helping
us cut timely releases.

Thanks!

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

[VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-18 Thread Patrick Wendell

Please vote on releasing the following candidate as Apache Spark version 1.3.0!

The tag to be voted on is v1.3.0-rc1 (commit f97b0d4a):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=f97b0d4a6b26504916816d7aefcf3132cd1da6c2

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.3.0-rc1/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1069/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.3.0-rc1-docs/

Please vote on releasing this package as Apache Spark 1.3.0!

The vote is open until Saturday, February 21, at 08:03 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.3.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== How can I help test this release? ==
If you are a Spark user, you can help us test this release by
taking a Spark 1.2 workload and running on this release candidate,
then reporting any regressions.

== What justifies a -1 vote for this release? ==
This vote is happening towards the end of the 1.3 QA period,
so -1 votes should only occur for significant regressions from 1.2.1.
Bugs already present in 1.2.X, minor regressions, or bugs related
to new features will not block this release.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

[jira] [Resolved] (SPARK-5864) support .jar as python package


 [ 
https://issues.apache.org/jira/browse/SPARK-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5864.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Davies Liu

 support .jar as python package
 --

 Key: SPARK-5864
 URL: https://issues.apache.org/jira/browse/SPARK-5864
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.3.0


 Support .jar file as python package (same to .zip or .egg)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5850) Remove experimental label for Scala 2.11 and FlumePollingStream


 [ 
https://issues.apache.org/jira/browse/SPARK-5850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5850.

   Resolution: Fixed
Fix Version/s: 1.3.0

 Remove experimental label for Scala 2.11 and FlumePollingStream
 ---

 Key: SPARK-5850
 URL: https://issues.apache.org/jira/browse/SPARK-5850
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Streaming
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.3.0


 These things have been out for at least one release and can be promoted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5856) In Maven build script, launch Zinc with more memory


 [ 
https://issues.apache.org/jira/browse/SPARK-5856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5856.

   Resolution: Fixed
Fix Version/s: 1.3.0

 In Maven build script, launch Zinc with more memory
 ---

 Key: SPARK-5856
 URL: https://issues.apache.org/jira/browse/SPARK-5856
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.3.0


 I've seen out of memory exceptions when trying to run many parallel builds 
 against the same Zinc server during packaging. We should use the same 
 increased memory settings we use for Maven itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4579) Scheduling Delay appears negative


 [ 
https://issues.apache.org/jira/browse/SPARK-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4579:
---
Assignee: Andrew Or

 Scheduling Delay appears negative
 -

 Key: SPARK-4579
 URL: https://issues.apache.org/jira/browse/SPARK-4579
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.0
Reporter: Arun Ahuja
Assignee: Andrew Or
Priority: Critical

 !https://cloud.githubusercontent.com/assets/455755/5174438/23d08604-73ff-11e4-9a76-97233b610544.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4579) Scheduling Delay appears negative


[ 
https://issues.apache.org/jira/browse/SPARK-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325622#comment-14325622
 ] 

Patrick Wendell commented on SPARK-4579:


[~andrewor14] Can you take a look at this one?

 Scheduling Delay appears negative
 -

 Key: SPARK-4579
 URL: https://issues.apache.org/jira/browse/SPARK-4579
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.0
Reporter: Arun Ahuja
Assignee: Andrew Or
Priority: Critical

 !https://cloud.githubusercontent.com/assets/455755/5174438/23d08604-73ff-11e4-9a76-97233b610544.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4579) Scheduling Delay appears negative


 [ 
https://issues.apache.org/jira/browse/SPARK-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4579:
---
Labels:   (was: starter)

 Scheduling Delay appears negative
 -

 Key: SPARK-4579
 URL: https://issues.apache.org/jira/browse/SPARK-4579
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.0
Reporter: Arun Ahuja
Priority: Critical

 !https://cloud.githubusercontent.com/assets/455755/5174438/23d08604-73ff-11e4-9a76-97233b610544.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4579) Scheduling Delay appears negative


 [ 
https://issues.apache.org/jira/browse/SPARK-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4579:
---
Priority: Critical  (was: Minor)

 Scheduling Delay appears negative
 -

 Key: SPARK-4579
 URL: https://issues.apache.org/jira/browse/SPARK-4579
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.0
Reporter: Arun Ahuja
Priority: Critical
  Labels: starter

 !https://cloud.githubusercontent.com/assets/455755/5174438/23d08604-73ff-11e4-9a76-97233b610544.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4579) Scheduling Delay appears negative


 [ 
https://issues.apache.org/jira/browse/SPARK-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4579:
---
Labels: starter  (was: )

 Scheduling Delay appears negative
 -

 Key: SPARK-4579
 URL: https://issues.apache.org/jira/browse/SPARK-4579
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.0
Reporter: Arun Ahuja
Priority: Critical
  Labels: starter

 !https://cloud.githubusercontent.com/assets/455755/5174438/23d08604-73ff-11e4-9a76-97233b610544.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-18 Thread Patrick Wendell

 UISeleniumSuite:
 *** RUN ABORTED ***
   java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal
 ...

This is a newer test suite. There is something flaky about it, we
should definitely fix it, IMO it's not a blocker though.


 Patrick this link gives a 404:
 https://people.apache.org/keys/committer/pwendell.asc

Works for me. Maybe it's some ephemeral issue?

 Finally, I already realized I failed to get the fix for
 https://issues.apache.org/jira/browse/SPARK-5669 correct, and that has
 to be correct for the release. I'll patch that up straight away,
 sorry. I believe the result of the intended fix is still as I
 described in SPARK-5669, so there is no bad news there. A local test
 seems to confirm it and I'm waiting on Jenkins. If it's all good I'll
 merge that fix. So, that much will need a new release, I apologize.

Thanks for finding this. I'm going to leave this open for continued testing...

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [Performance] Possible regression in rdd.take()?

2015-02-18 Thread Patrick Wendell

I believe the heuristic governing the way that take() decides to fetch
partitions changed between these versions. It could be that in certain
cases the new heuristic is worse, but it might be good to just look at
the source code and see, for your number of elements taken and number
of partitions, if there was any effective change in how aggressively
spark fetched partitions.

This was quite a while ago, but I think the change was made because in
many cases the newer code works more efficiently.

- Patrick

On Wed, Feb 18, 2015 at 4:47 PM, Matt Cheah mch...@palantir.com wrote:
 Hi everyone,

 Between Spark 1.0.2 and Spark 1.1.1, I have noticed that rdd.take()
 consistently has a slower execution time on the later release. I was
 wondering if anyone else has had similar observations.

 I have two setups where this reproduces. The first is a local test. I
 launched a spark cluster with 4 worker JVMs on my Mac, and launched a
 Spark-Shell. I retrieved the text file and immediately called rdd.take(N) on
 it, where N varied. The RDD is a plaintext CSV, 4GB in size, split over 8
 files, which ends up having 128 partitions, and a total of 8000 rows.
 The numbers I discovered between Spark 1.0.2 and Spark 1.1.1 are, with all
 numbers being in seconds:

 1 items

 Spark 1.0.2: 0.069281, 0.012261, 0.011083

 Spark 1.1.1: 0.11577, 0.097636, 0.11321


 4 items

 Spark 1.0.2: 0.023751, 0.069365, 0.023603

 Spark 1.1.1: 0.224287, 0.229651, 0.158431


 10 items

 Spark 1.0.2: 0.047019, 0.049056, 0.042568

 Spark 1.1.1: 0.353277, 0.288965, 0.281751


 40 items

 Spark 1.0.2: 0.216048, 0.198049, 0.796037

 Spark 1.1.1: 1.865622, 2.224424, 2.037672

 This small test suite indicates a consistently reproducible performance
 regression.


 I also notice this on a larger scale test. The cluster used is on EC2:

 ec2 instance type: m2.4xlarge
 10 slaves, 1 master
 ephemeral storage
 70 cores, 50 GB/box

 In this case, I have a 100GB dataset split into 78 files totally 350 million
 items, and I take the first 50,000 items from the RDD. In this case, I have
 tested this on different formats of the raw data.

 With plaintext files:

 Spark 1.0.2: 0.422s, 0.363s, 0.382s

 Spark 1.1.1: 4.54s, 1.28s, 1.221s, 1.13s


 With snappy-compressed Avro files:

 Spark 1.0.2: 0.73s, 0.395s, 0.426s

 Spark 1.1.1: 4.618s, 1.81s, 1.158s, 1.333s

 Again demonstrating a reproducible performance regression.

 I was wondering if anyone else observed this regression, and if so, if
 anyone would have any idea what could possibly have caused it between Spark
 1.0.2 and Spark 1.1.1?

 Thanks,

 -Matt Cheah

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

[jira] [Resolved] (SPARK-5811) Documentation for --packages and --repositories on Spark Shell


 [ 
https://issues.apache.org/jira/browse/SPARK-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5811.

Resolution: Fixed
  Assignee: Burak Yavuz

 Documentation for --packages and --repositories on Spark Shell
 --

 Key: SPARK-5811
 URL: https://issues.apache.org/jira/browse/SPARK-5811
 Project: Spark
  Issue Type: Documentation
  Components: Deploy, Spark Shell
Affects Versions: 1.3.0
Reporter: Burak Yavuz
Assignee: Burak Yavuz
Priority: Critical
 Fix For: 1.3.0


 Documentation for the new support for dependency management using maven 
 coordinates using --packages and --repositories



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4454) Race condition in DAGScheduler


 [ 
https://issues.apache.org/jira/browse/SPARK-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4454:
---
Labels: backport-needed  (was: )

 Race condition in DAGScheduler
 --

 Key: SPARK-4454
 URL: https://issues.apache.org/jira/browse/SPARK-4454
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.1.0
Reporter: Rafal Kwasny
Assignee: Josh Rosen
Priority: Critical
  Labels: backport-needed
 Fix For: 1.3.0


 It seems to be a race condition in DAGScheduler that manifests on jobs with 
 high concurrency:
 {noformat}
  Exception in thread main java.util.NoSuchElementException: key not found: 
 35
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
 at 
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1275)
 at 
 org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:937

[jira] [Reopened] (SPARK-4454) Race condition in DAGScheduler


 [ 
https://issues.apache.org/jira/browse/SPARK-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reopened SPARK-4454:


Actually, re-opening this since we need to back port it.

 Race condition in DAGScheduler
 --

 Key: SPARK-4454
 URL: https://issues.apache.org/jira/browse/SPARK-4454
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.1.0
Reporter: Rafal Kwasny
Assignee: Josh Rosen
Priority: Critical
 Fix For: 1.3.0


 It seems to be a race condition in DAGScheduler that manifests on jobs with 
 high concurrency:
 {noformat}
  Exception in thread main java.util.NoSuchElementException: key not found: 
 35
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
 at 
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1275)
 at 
 org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:937)
 at 
 org.apache.spark.rdd.PartitionCoalescer.currPrefLocs

[jira] [Updated] (SPARK-4454) Race condition in DAGScheduler


 [ 
https://issues.apache.org/jira/browse/SPARK-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4454:
---
Target Version/s: 1.3.0, 1.2.2  (was: 1.3.0)

 Race condition in DAGScheduler
 --

 Key: SPARK-4454
 URL: https://issues.apache.org/jira/browse/SPARK-4454
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.1.0
Reporter: Rafal Kwasny
Assignee: Josh Rosen
Priority: Critical
  Labels: backport-needed
 Fix For: 1.3.0


 It seems to be a race condition in DAGScheduler that manifests on jobs with 
 high concurrency:
 {noformat}
  Exception in thread main java.util.NoSuchElementException: key not found: 
 35
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
 at 
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1275)
 at 
 org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:937

Re: Replacing Jetty with TomCat

2015-02-17 Thread Patrick Wendell

Hey Niranda,

It seems to me a lot of effort to support multiple libraries inside of
Spark like this, so I'm not sure that's a great solution.

If you are building an application that embeds Spark, is it not
possible for you to continue to use Jetty for Spark's internal servers
and use tomcat for your own server's? I would guess that many complex
applications end up embedding multiple server libraries in various
places (Spark itself has different transport mechanisms, etc.)

- Patrick

On Tue, Feb 17, 2015 at 7:14 PM, Niranda Perera
niranda.per...@gmail.com wrote:
 Hi Sean,
 The main issue we have is, running two web servers in a single product. we
 think it would not be an elegant solution.

 Could you please point me to the main areas where jetty server is tightly
 coupled or extension points where I could plug tomcat instead of jetty?
 If successful I could contribute it to the spark project. :-)

 cheers



 On Mon, Feb 16, 2015 at 4:51 PM, Sean Owen so...@cloudera.com wrote:

 There's no particular reason you have to remove the embedded Jetty
 server, right? it doesn't prevent you from using it inside another app
 that happens to run in Tomcat. You won't be able to switch it out
 without rewriting a fair bit of code, no, but you don't need to.

 On Mon, Feb 16, 2015 at 5:08 AM, Niranda Perera
 niranda.per...@gmail.com wrote:
  Hi,
 
  We are thinking of integrating Spark server inside a product. Our current
  product uses Tomcat as its webserver.
 
  Is it possible to switch the Jetty webserver in Spark to Tomcat
  off-the-shelf?
 
  Cheers
 
  --
  Niranda




 --
 Niranda

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

[jira] [Resolved] (SPARK-4454) Race condition in DAGScheduler


 [ 
https://issues.apache.org/jira/browse/SPARK-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4454.

   Resolution: Fixed
Fix Version/s: 1.3.0

We can't be 100% sure this is fixed because it was not a reproducible issue. 
However, Josh has committed a patch that I think should make it hard to have 
race conditions around the cache location data structure.

 Race condition in DAGScheduler
 --

 Key: SPARK-4454
 URL: https://issues.apache.org/jira/browse/SPARK-4454
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.1.0
Reporter: Rafal Kwasny
Assignee: Josh Rosen
Priority: Critical
 Fix For: 1.3.0


 It seems to be a race condition in DAGScheduler that manifests on jobs with 
 high concurrency:
 {noformat}
  Exception in thread main java.util.NoSuchElementException: key not found: 
 35
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304

[jira] [Commented] (SPARK-4454) Race condition in DAGScheduler


[ 
https://issues.apache.org/jira/browse/SPARK-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324747#comment-14324747
 ] 

Patrick Wendell commented on SPARK-4454:


[~srowen] yeah I meant the particular PR was bad, not that the issue does not 
exist.

 Race condition in DAGScheduler
 --

 Key: SPARK-4454
 URL: https://issues.apache.org/jira/browse/SPARK-4454
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.1.0
Reporter: Rafal Kwasny
Assignee: Josh Rosen
Priority: Minor

 It seems to be a race condition in DAGScheduler that manifests on jobs with 
 high concurrency:
 {noformat}
  Exception in thread main java.util.NoSuchElementException: key not found: 
 35
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
 at 
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1275)
 at 
 org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:937

[jira] [Updated] (SPARK-4454) Race condition in DAGScheduler


 [ 
https://issues.apache.org/jira/browse/SPARK-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4454:
---
Priority: Critical  (was: Minor)

 Race condition in DAGScheduler
 --

 Key: SPARK-4454
 URL: https://issues.apache.org/jira/browse/SPARK-4454
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.1.0
Reporter: Rafal Kwasny
Assignee: Josh Rosen
Priority: Critical

 It seems to be a race condition in DAGScheduler that manifests on jobs with 
 high concurrency:
 {noformat}
  Exception in thread main java.util.NoSuchElementException: key not found: 
 35
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
 at 
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1275)
 at 
 org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:937)
 at 
 org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:175

[jira] [Updated] (SPARK-4454) Race condition in DAGScheduler


 [ 
https://issues.apache.org/jira/browse/SPARK-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4454:
---
Target Version/s: 1.3.0

 Race condition in DAGScheduler
 --

 Key: SPARK-4454
 URL: https://issues.apache.org/jira/browse/SPARK-4454
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.1.0
Reporter: Rafal Kwasny
Assignee: Josh Rosen
Priority: Critical

 It seems to be a race condition in DAGScheduler that manifests on jobs with 
 high concurrency:
 {noformat}
  Exception in thread main java.util.NoSuchElementException: key not found: 
 35
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
 at 
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1275)
 at 
 org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:937)
 at 
 org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:175

[jira] [Commented] (SPARK-5864) support .jar as python package


[ 
https://issues.apache.org/jira/browse/SPARK-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324787#comment-14324787
 ] 

Patrick Wendell commented on SPARK-5864:


I merged davies PR, but per Burak's comment I think there is additional work 
needed.

 support .jar as python package
 --

 Key: SPARK-5864
 URL: https://issues.apache.org/jira/browse/SPARK-5864
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu

 Support .jar file as python package (same to .zip or .egg)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5778) Throw if nonexistent spark.metrics.conf file is provided


 [ 
https://issues.apache.org/jira/browse/SPARK-5778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5778.

   Resolution: Fixed
Fix Version/s: 1.3.0

 Throw if nonexistent spark.metrics.conf file is provided
 --

 Key: SPARK-5778
 URL: https://issues.apache.org/jira/browse/SPARK-5778
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Ryan Williams
Assignee: Ryan Williams
Priority: Minor
 Fix For: 1.3.0


 Spark looks for a {{MetricsSystem}} configuration file when the 
 {{spark.metrics.conf}} parameter is set, [defaulting to the path 
 {{metrics.properties}} when it's not 
 set|https://github.com/apache/spark/blob/466b1f671b21f575d28f9c103f51765790914fe3/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L52-L55].
 In the event of a failure to find or parse the file, [the exception is caught 
 and an error is 
 logged|https://github.com/apache/spark/blob/466b1f671b21f575d28f9c103f51765790914fe3/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L61].
 This seems like reasonable behavior in the general case where the user has 
 not specified a {{spark.metrics.conf}} file. However, I've been bitten 
 several times by having specified a file that all or some executors did not 
 have present (I typo'd the path, or forgot to add an additional {{--files}} 
 flag to make my local metrics config file get shipped to all executors), the 
 error was swallowed and I was confused about why I'd captured no metrics from 
 a job that appeared to have run successfully.
 I'd like to change the behavior to actually throw if the user has specified a 
 configuration file that doesn't exist.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5778) Throw if nonexistent spark.metrics.conf file is provided


 [ 
https://issues.apache.org/jira/browse/SPARK-5778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5778:
---
Assignee: Ryan Williams

 Throw if nonexistent spark.metrics.conf file is provided
 --

 Key: SPARK-5778
 URL: https://issues.apache.org/jira/browse/SPARK-5778
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Ryan Williams
Assignee: Ryan Williams
Priority: Minor

 Spark looks for a {{MetricsSystem}} configuration file when the 
 {{spark.metrics.conf}} parameter is set, [defaulting to the path 
 {{metrics.properties}} when it's not 
 set|https://github.com/apache/spark/blob/466b1f671b21f575d28f9c103f51765790914fe3/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L52-L55].
 In the event of a failure to find or parse the file, [the exception is caught 
 and an error is 
 logged|https://github.com/apache/spark/blob/466b1f671b21f575d28f9c103f51765790914fe3/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L61].
 This seems like reasonable behavior in the general case where the user has 
 not specified a {{spark.metrics.conf}} file. However, I've been bitten 
 several times by having specified a file that all or some executors did not 
 have present (I typo'd the path, or forgot to add an additional {{--files}} 
 flag to make my local metrics config file get shipped to all executors), the 
 error was swallowed and I was confused about why I'd captured no metrics from 
 a job that appeared to have run successfully.
 I'd like to change the behavior to actually throw if the user has specified a 
 configuration file that doesn't exist.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5848) ConsoleProgressBar timer thread leaks SparkContext


 [ 
https://issues.apache.org/jira/browse/SPARK-5848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5848:
---
Component/s: (was: Web UI)
 Spark Shell

 ConsoleProgressBar timer thread leaks SparkContext
 --

 Key: SPARK-5848
 URL: https://issues.apache.org/jira/browse/SPARK-5848
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.1
Reporter: Matt Whelan

 ConsoleProgressBar starts a timer.  This creates a thread (which is a garbage 
 collection root) with a reference to the ConsoleProgressBar instance, which 
 holds a reference to the SparkContext.  
 That timer is never canceled, so SparkContexts are leaked.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5846) Spark SQL does not correctly set job description and scheduler pool


 [ 
https://issues.apache.org/jira/browse/SPARK-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5846:
---
Priority: Critical  (was: Major)

 Spark SQL does not correctly set job description and scheduler pool
 ---

 Key: SPARK-5846
 URL: https://issues.apache.org/jira/browse/SPARK-5846
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.2.1
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Critical

 Spark SQL current sets the scheduler pool and job description AFTER jobs run 
 (see 
 https://github.com/apache/spark/blob/master/sql/hive-thriftserver/v0.13.1/src/main/scala/org/apache/spark/sql/hive/thriftserver/Shim13.scala#L168
  -- which happens after calling hiveContext.sql).  As a result, the 
 description for a SQL job ends up being the SQL query corresponding to the 
 previous job.  This should be done before the job is run so the description 
 is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5850) Remove experimental label for Scala 2.11 and FlumePollingStream


 [ 
https://issues.apache.org/jira/browse/SPARK-5850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5850:
---
Summary: Remove experimental label for Scala 2.11 and FlumePollingStream  
(was: Clean up experimental label for Scala 2.11 and FlumePollingStream)

 Remove experimental label for Scala 2.11 and FlumePollingStream
 ---

 Key: SPARK-5850
 URL: https://issues.apache.org/jira/browse/SPARK-5850
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Streaming
Reporter: Patrick Wendell
Assignee: Patrick Wendell

 These things have been out for at least one release and can be promoted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5850) Clean up experimental label for Scala 2.11 and FlumePollingStream

Patrick Wendell created SPARK-5850:
--

 Summary: Clean up experimental label for Scala 2.11 and 
FlumePollingStream
 Key: SPARK-5850
 URL: https://issues.apache.org/jira/browse/SPARK-5850
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Streaming
Reporter: Patrick Wendell
Assignee: Patrick Wendell


These things have been out for at least one release and can be promoted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5856) In Maven build script, launch Zinc with more memory

Patrick Wendell created SPARK-5856:
--

 Summary: In Maven build script, launch Zinc with more memory
 Key: SPARK-5856
 URL: https://issues.apache.org/jira/browse/SPARK-5856
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker


I've seen out of memory exceptions when trying to run many parallel builds 
against the same Zinc server during packaging. We should use the same increased 
memory settings we use for Maven itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5081) Shuffle write increases


 [ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5081:
---
Priority: Critical  (was: Major)

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung
Priority: Critical

 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5850) Remove experimental label for Scala 2.11 and FlumePollingStream


 [ 
https://issues.apache.org/jira/browse/SPARK-5850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5850:
---
Priority: Blocker  (was: Major)

 Remove experimental label for Scala 2.11 and FlumePollingStream
 ---

 Key: SPARK-5850
 URL: https://issues.apache.org/jira/browse/SPARK-5850
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Streaming
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker

 These things have been out for at least one release and can be promoted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5081) Shuffle write increases


[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323629#comment-14323629
 ] 

Patrick Wendell commented on SPARK-5081:


Hey [~cb_betz], can you verify a few things? It would be good to make sure you 
revert all configuration changes from 1.2.0. Specifically, set 
spark.shuffle.blockTransferService to nio and set spark.shuffle.manager to 
hash. Also, verify in the UI that they are set correctly.

If all this is set, can you give us the change in the size of the aggregate 
shuffle output between the two releases?

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung

 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5745) Allow to use custom TaskMetrics implementation

2015-02-15 Thread Patrick Wendell (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322091#comment-14322091
]

Patrick Wendell commented on SPARK-5745:

Hey [~jlewandowski] - TaskMetrics are a mostly internal concept. In fact, there
isn't really any nice framework for aggregation internally. We instead have a
bunch of manual aggregation in various places.

The primary user-facing API we have aggregated counters are accumulators. Are
there features lacking from accumulators that make it difficult for you to use
them for your use case?

Allow to use custom TaskMetrics implementation
--

Key: SPARK-5745
URL: https://issues.apache.org/jira/browse/SPARK-5745
Project: Spark
Issue Type: Wish
Components: Spark Core
Reporter: Jacek Lewandowski

There can be various RDDs implemented and the {{TaskMetrics}} provides a
great API for collecting metrics and aggregating them. However some RDDs may
want to register some custom metrics and the current implementation doesn't
allow for this (for example the number of read rows or whatever).
I suppose that this can be changed without modifying the whole interface -
there could used some factory to create the initial {{TaskMetrics}} object.
The default factory could be overridden by user.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5826) JavaStreamingContext.fileStream cause Configuration NotSerializableException

2015-02-15 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5826:
---
Priority: Critical  (was: Minor)

 JavaStreamingContext.fileStream cause Configuration NotSerializableException
 

 Key: SPARK-5826
 URL: https://issues.apache.org/jira/browse/SPARK-5826
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
Reporter: Littlestar
Priority: Critical
 Attachments: TestStream.java


 org.apache.spark.streaming.api.java.JavaStreamingContext.fileStream(String 
 directory, ClassLongWritable kClass, ClassText vClass, 
 ClassTextInputFormat fClass, FunctionPath, Boolean filter, boolean 
 newFilesOnly, Configuration conf)
 I use JavaStreamingContext.fileStream with 1.3.0/master with Configuration.
 but it throw strange exception.
 java.io.NotSerializableException: org.apache.hadoop.conf.Configuration
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440)
   at 
 org.apache.spark.streaming.DStreamGraph$$anonfun$writeObject$1.apply$mcV$sp(DStreamGraph.scala:177)
   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1075)
   at 
 org.apache.spark.streaming.DStreamGraph.writeObject(DStreamGraph.scala:172)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
   at 
 org.apache.spark.streaming.CheckpointWriter.write(Checkpoint.scala:184)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:278)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:169)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:78)
   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1.aroundReceive(JobGenerator.scala:76)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec

[jira] [Commented] (SPARK-5770) Use addJar() to upload a new jar file to executor, it can't be added to classloader


[ 
https://issues.apache.org/jira/browse/SPARK-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321269#comment-14321269
 ] 

Patrick Wendell commented on SPARK-5770:


If the request is to support hot reloading of classes in a live JVM (can't tell 
if it is), it seems fairly dangerous to do that. I'm not sure how that works if 
you have live objects of a specific class and then the class definition changes 
under them.

Anyways, as Sean and Marcelo pointed out, it would be helpful to hear what the 
intended use case is!

 Use addJar() to upload a new jar file to executor, it can't be added to 
 classloader
 ---

 Key: SPARK-5770
 URL: https://issues.apache.org/jira/browse/SPARK-5770
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula

 First use addJar() to upload a jar to the executor, then change the jar 
 content and upload it again. We can see the jar file in the local has be 
 updated, but the classloader still load the old one. The executor log has no 
 error or exception to point it.
 I use spark-shell to test it. And set spark.files.overwrite is true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5801) Shuffle creates too many nested directories


 [ 
https://issues.apache.org/jira/browse/SPARK-5801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5801:
---
Priority: Critical  (was: Major)

 Shuffle creates too many nested directories
 ---

 Key: SPARK-5801
 URL: https://issues.apache.org/jira/browse/SPARK-5801
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.2.1
Reporter: Kay Ousterhout
Priority: Critical

 When running Spark on EC2, there are 4 nested shuffle directories before the 
 hashed directory names, for example:
 /mnt/spark/spark-5824d912-25af-4187-bc6a-29ae42cd78e5/spark-675133f0-b2c8-44a1-8775-5e394674609b/spark-69c1ea15-4e7f-454a-9f57-19763c7bdd17/spark-b036335c-60fa-48ab-a346-f1b420af2027/0c
 My understanding is that this should look like:
 /mnt/spark/spark-5824d912-25af-4187-bc6a-29ae42cd78e5/0c
 This happened when I was using the sort-based shuffle (all default 
 configurations for Spark on EC2).
 This is not a correctness problem (the shuffle still works fine).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5813) Spark-ec2: Switch to OracleJDK


[ 
https://issues.apache.org/jira/browse/SPARK-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321267#comment-14321267
 ] 

Patrick Wendell commented on SPARK-5813:


Hey [~florianverhein]. Just wondering, are there specific features of Oracle's 
JRE you are interest in? These days, Oracle's JRE and OpenJDK are basically 
equivalent. In the history of Spark, I don't think I've ever seen us have a bug 
that was specific to OpenJDK and not also present in Oracle JDK. Given how much 
easier it is to install open JDK I'm not sure it's worth the extra packaging 
annoyance to add Oracle Java. Just curious if you have a specific reason to 
want Oracle.

 Spark-ec2: Switch to OracleJDK
 --

 Key: SPARK-5813
 URL: https://issues.apache.org/jira/browse/SPARK-5813
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein
Priority: Minor

 Currently using OpenJDK, however it is generally recommended to use Oracle 
 JDK, esp for Hadoop deployments, etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5801) Shuffle creates too many nested directories


 [ 
https://issues.apache.org/jira/browse/SPARK-5801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5801:
---
Component/s: Spark Core

 Shuffle creates too many nested directories
 ---

 Key: SPARK-5801
 URL: https://issues.apache.org/jira/browse/SPARK-5801
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.2.1
Reporter: Kay Ousterhout
Priority: Critical

 When running Spark on EC2, there are 4 nested shuffle directories before the 
 hashed directory names, for example:
 /mnt/spark/spark-5824d912-25af-4187-bc6a-29ae42cd78e5/spark-675133f0-b2c8-44a1-8775-5e394674609b/spark-69c1ea15-4e7f-454a-9f57-19763c7bdd17/spark-b036335c-60fa-48ab-a346-f1b420af2027/0c
 My understanding is that this should look like:
 /mnt/spark/spark-5824d912-25af-4187-bc6a-29ae42cd78e5/0c
 This happened when I was using the sort-based shuffle (all default 
 configurations for Spark on EC2).
 This is not a correctness problem (the shuffle still works fine).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset


 [ 
https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5731:
---
Priority: Blocker  (was: Major)

 Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic 
 stream receiving with multiple topics and smallest starting offset
 

 Key: SPARK-5731
 URL: https://issues.apache.org/jira/browse/SPARK-5731
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Tests
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Tathagata Das
Priority: Blocker
  Labels: flaky-test

 {code}
 sbt.ForkMain$ForkError: The code passed to eventually never returned 
 normally. Attempted 110 times over 20.070287525 seconds. Last failure 
 message: 300 did not equal 48 didn't get all messages.
   at 
 org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38)
   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfterAll$$super$run(DirectKafkaStreamSuite.scala:38)
   at 
 org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
   at 
 org.scalatest.BeforeAndAfterAll$class.run

[jira] [Commented] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset


[ 
https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320739#comment-14320739
 ] 

Patrick Wendell commented on SPARK-5731:


[~c...@koeninger.org] [~tdas] FYI we've disabled this test because it's caused 
a huge productivity loss to ongoing development with frequent failures. Please 
try to get this test into good shape ASAP - otherwise this code will be 
untested.

 Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic 
 stream receiving with multiple topics and smallest starting offset
 

 Key: SPARK-5731
 URL: https://issues.apache.org/jira/browse/SPARK-5731
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Tests
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Tathagata Das
Priority: Blocker
  Labels: flaky-test

 {code}
 sbt.ForkMain$ForkError: The code passed to eventually never returned 
 normally. Attempted 110 times over 20.070287525 seconds. Last failure 
 message: 300 did not equal 48 didn't get all messages.
   at 
 org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38)
   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241

[jira] [Updated] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset


 [ 
https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5731:
---
Labels: flaky-test  (was: )

 Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic 
 stream receiving with multiple topics and smallest starting offset
 

 Key: SPARK-5731
 URL: https://issues.apache.org/jira/browse/SPARK-5731
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Tests
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Tathagata Das
Priority: Blocker
  Labels: flaky-test

 {code}
 sbt.ForkMain$ForkError: The code passed to eventually never returned 
 normally. Attempted 110 times over 20.070287525 seconds. Last failure 
 message: 300 did not equal 48 didn't get all messages.
   at 
 org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38)
   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfterAll$$super$run(DirectKafkaStreamSuite.scala:38)
   at 
 org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
   at 
 org.scalatest.BeforeAndAfterAll$class.run

[jira] [Resolved] (SPARK-5679) Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads and input metrics with mixed read method


 [ 
https://issues.apache.org/jira/browse/SPARK-5679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5679.

   Resolution: Fixed
Fix Version/s: 1.2.2
   1.3.0
 Assignee: Josh Rosen  (was: Kostas Sakellis)

 Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads 
 and input metrics with mixed read method 
 --

 Key: SPARK-5679
 URL: https://issues.apache.org/jira/browse/SPARK-5679
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Josh Rosen
  Labels: flaky-test
 Fix For: 1.3.0, 1.2.2


 Please audit these and see if there are any assumptions with respect to File 
 IO that might not hold in all cases. I'm happy to help if you can't find 
 anything.
 These both failed in the same run:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-SBT/38/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink
 {code}
 org.apache.spark.metrics.InputOutputMetricsSuite.input metrics with mixed 
 read method
 Failing for the past 13 builds (Since Failed#26 )
 Took 48 sec.
 Error Message
 2030 did not equal 6496
 Stacktrace
 sbt.ForkMain$ForkError: 2030 did not equal 6496
   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
   at 
 org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply$mcV$sp(InputOutputMetricsSuite.scala:135)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfter$$super$runTest(InputOutputMetricsSuite.scala:46)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite.runTest(InputOutputMetricsSuite.scala:46)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfterAll$$super$run(InputOutputMetricsSuite.scala:46)
   at 
 org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
   at 
 org.scalatest.BeforeAndAfterAll

Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-02-12 Thread Patrick Wendell

The map will start with a capacity of 64, but will grow to accommodate
new data. Are you using the groupBy operator in Spark or are you using
Spark SQL's group by? This usually happens if you are grouping or
aggregating in a way that doesn't sufficiently condense the data
created from each input partition.

- Patrick

On Wed, Feb 11, 2015 at 9:37 PM, fightf...@163.com fightf...@163.com wrote:
 Hi,

 Really have no adequate solution got for this issue. Expecting any available
 analytical rules or hints.

 Thanks,
 Sun.

 
 fightf...@163.com


 From: fightf...@163.com
 Date: 2015-02-09 11:56
 To: user; dev
 Subject: Re: Sort Shuffle performance issues about using AppendOnlyMap for
 large data sets
 Hi,
 Problem still exists. Any experts would take a look at this?

 Thanks,
 Sun.

 
 fightf...@163.com


 From: fightf...@163.com
 Date: 2015-02-06 17:54
 To: user; dev
 Subject: Sort Shuffle performance issues about using AppendOnlyMap for large
 data sets
 Hi, all
 Recently we had caught performance issues when using spark 1.2.0 to read
 data from hbase and do some summary work.
 Our scenario means to : read large data sets from hbase (maybe 100G+ file) ,
 form hbaseRDD, transform to schemardd,
 groupby and aggregate the data while got fewer new summary data sets,
 loading data into hbase (phoenix).

 Our major issue lead to : aggregate large datasets to get summary data sets
 would consume too long time (1 hour +) , while that
 should be supposed not so bad performance. We got the dump file attached and
 stacktrace from jstack like the following:

 From the stacktrace and dump file we can identify that processing large
 datasets would cause frequent AppendOnlyMap growing, and
 leading to huge map entrysize. We had referenced the source code of
 org.apache.spark.util.collection.AppendOnlyMap and found that
 the map had been initialized with capacity of 64. That would be too small
 for our use case.

 So the question is : Does anyone had encounted such issues before? How did
 that be resolved? I cannot find any jira issues for such problems and
 if someone had seen, please kindly let us know.

 More specified solution would goes to : Does any possibility exists for user
 defining the map capacity releatively in spark? If so, please
 tell how to achieve that.

 Best Thanks,
 Sun.

Thread 22432: (state = IN_JAVA)
 - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87,
 line=224 (Compiled frame; information may be imprecise)
 - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable()
 @bci=1, line=38 (Interpreted frame)
 - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22,
 line=198 (Compiled frame)
 -
 org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object,
 scala.Function2) @bci=201, line=145 (Compiled frame)
 -
 org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object,
 scala.Function2) @bci=3, line=32 (Compiled frame)
 -
 org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator)
 @bci=141, line=205 (Compiled frame)
 -
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator)
 @bci=74, line=58 (Interpreted frame)
 -
 org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)
 @bci=169, line=68 (Interpreted frame)
 -
 org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)
 @bci=2, line=41 (Interpreted frame)
 - org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted
 frame)
 - org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196
 (Interpreted frame)
 -
 java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
 @bci=95, line=1145 (Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
 (Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame)


 Thread 22431: (state = IN_JAVA)
 - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87,
 line=224 (Compiled frame; information may be imprecise)
 - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable()
 @bci=1, line=38 (Interpreted frame)
 - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22,
 line=198 (Compiled frame)
 -
 org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object,
 scala.Function2) @bci=201, line=145 (Compiled frame)
 -
 org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object,
 scala.Function2) @bci=3, line=32 (Compiled frame)
 -
 org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator)
 @bci=141, line=205 (Compiled frame)
 -
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator)
 @bci=74, line=58 (Interpreted frame)
 -
 org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)

Re: driver fail-over in Spark streaming 1.2.0

2015-02-12 Thread Patrick Wendell

It will create and connect to new executors. The executors are mostly
stateless, so the program can resume with new executors.

On Wed, Feb 11, 2015 at 11:24 PM, lin kurtt@gmail.com wrote:
 Hi, all

 In Spark Streaming 1.2.0, when the driver fails and a new driver starts
 with the most updated check-pointed data, will the former Executors
 connects to the new driver, or will the new driver starts out its own set
 of new Executors? In which piece of codes is that done?

 Any reply will be appreciated :)

 regards,

 lin

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-02-12 Thread Patrick Wendell

The map will start with a capacity of 64, but will grow to accommodate
new data. Are you using the groupBy operator in Spark or are you using
Spark SQL's group by? This usually happens if you are grouping or
aggregating in a way that doesn't sufficiently condense the data
created from each input partition.

- Patrick

On Wed, Feb 11, 2015 at 9:37 PM, fightf...@163.com fightf...@163.com wrote:
 Hi,

 Really have no adequate solution got for this issue. Expecting any available
 analytical rules or hints.

 Thanks,
 Sun.

 
 fightf...@163.com


 From: fightf...@163.com
 Date: 2015-02-09 11:56
 To: user; dev
 Subject: Re: Sort Shuffle performance issues about using AppendOnlyMap for
 large data sets
 Hi,
 Problem still exists. Any experts would take a look at this?

 Thanks,
 Sun.

 
 fightf...@163.com


 From: fightf...@163.com
 Date: 2015-02-06 17:54
 To: user; dev
 Subject: Sort Shuffle performance issues about using AppendOnlyMap for large
 data sets
 Hi, all
 Recently we had caught performance issues when using spark 1.2.0 to read
 data from hbase and do some summary work.
 Our scenario means to : read large data sets from hbase (maybe 100G+ file) ,
 form hbaseRDD, transform to schemardd,
 groupby and aggregate the data while got fewer new summary data sets,
 loading data into hbase (phoenix).

 Our major issue lead to : aggregate large datasets to get summary data sets
 would consume too long time (1 hour +) , while that
 should be supposed not so bad performance. We got the dump file attached and
 stacktrace from jstack like the following:

 From the stacktrace and dump file we can identify that processing large
 datasets would cause frequent AppendOnlyMap growing, and
 leading to huge map entrysize. We had referenced the source code of
 org.apache.spark.util.collection.AppendOnlyMap and found that
 the map had been initialized with capacity of 64. That would be too small
 for our use case.

 So the question is : Does anyone had encounted such issues before? How did
 that be resolved? I cannot find any jira issues for such problems and
 if someone had seen, please kindly let us know.

 More specified solution would goes to : Does any possibility exists for user
 defining the map capacity releatively in spark? If so, please
 tell how to achieve that.

 Best Thanks,
 Sun.

Thread 22432: (state = IN_JAVA)
 - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87,
 line=224 (Compiled frame; information may be imprecise)
 - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable()
 @bci=1, line=38 (Interpreted frame)
 - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22,
 line=198 (Compiled frame)
 -
 org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object,
 scala.Function2) @bci=201, line=145 (Compiled frame)
 -
 org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object,
 scala.Function2) @bci=3, line=32 (Compiled frame)
 -
 org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator)
 @bci=141, line=205 (Compiled frame)
 -
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator)
 @bci=74, line=58 (Interpreted frame)
 -
 org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)
 @bci=169, line=68 (Interpreted frame)
 -
 org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)
 @bci=2, line=41 (Interpreted frame)
 - org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted
 frame)
 - org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196
 (Interpreted frame)
 -
 java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
 @bci=95, line=1145 (Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
 (Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame)


 Thread 22431: (state = IN_JAVA)
 - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87,
 line=224 (Compiled frame; information may be imprecise)
 - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable()
 @bci=1, line=38 (Interpreted frame)
 - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22,
 line=198 (Compiled frame)
 -
 org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object,
 scala.Function2) @bci=201, line=145 (Compiled frame)
 -
 org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object,
 scala.Function2) @bci=3, line=32 (Compiled frame)
 -
 org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator)
 @bci=141, line=205 (Compiled frame)
 -
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator)
 @bci=74, line=58 (Interpreted frame)
 -
 org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)

Re: How to track issues that must wait for Spark 2.x in JIRA?

2015-02-12 Thread Patrick Wendell

Yeah my preferred is also having a more open ended 2+ for issues
that are clearly desirable but blocked by compatibility concerns.

What I would really want to avoid is major feature proposals sitting
around in our JIRA and tagged under some 2.X version. IMO JIRA isn't
the place for thoughts about very-long-term things. When we get these,
I'd be include to either close them as won't fix or later.

On Thu, Feb 12, 2015 at 12:47 AM, Reynold Xin r...@databricks.com wrote:
 It seems to me having a version that is 2+ is good for that? Once we move
 to 2.0, we can retag those that are not going to be fixed in 2.0 as 2.0.1
 or 2.1.0 .

 On Thu, Feb 12, 2015 at 12:42 AM, Sean Owen so...@cloudera.com wrote:

 Patrick and I were chatting about how to handle several issues which
 clearly need a fix, and are easy, but can't be implemented until a
 next major release like Spark 2.x since it would change APIs.
 Examples:

 https://issues.apache.org/jira/browse/SPARK-3266
 https://issues.apache.org/jira/browse/SPARK-3369
 https://issues.apache.org/jira/browse/SPARK-4819

 We could simply make version 2.0.0 in JIRA. Although straightforward,
 it might imply that release planning has begun for 2.0.0.

 The version could be called 2+ for now to better indicate its status.

 There is also a Later JIRA resolution. Although resolving the above
 seems a little wrong, it might be reasonable if we're sure to revisit
 Later, well, at some well defined later. The three issues above risk
 getting lost in the shuffle.

 We also wondered whether using Later is good or bad. It takes items
 off the radar that aren't going to be acted on anytime soon -- and
 there are lots of those right now. It might send a message that these
 will be revisited when they are even less likely to if resolved.

 Any opinions?

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

[ANNOUNCE] Spark 1.3.0 Snapshot 1

2015-02-11 Thread Patrick Wendell

Hey All,

I've posted Spark 1.3.0 snapshot 1. At this point the 1.3 branch is
ready for community testing and we are strictly merging fixes and
documentation across all components.

The release files, including signatures, digests, etc can be found at:
http://people.apache.org/~pwendell/spark-1.3.0-snapshot1/

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1068/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.3.0-snapshot1-docs/

Please report any issues with the release to this thread and/or to our
project JIRA. Thanks!

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

[jira] [Updated] (SPARK-5606) Support plus sign in HiveContext


 [ 
https://issues.apache.org/jira/browse/SPARK-5606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5606:
---
Assignee: Yadong Qi

 Support plus sign in HiveContext
 

 Key: SPARK-5606
 URL: https://issues.apache.org/jira/browse/SPARK-5606
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yadong Qi
Assignee: Yadong Qi
 Fix For: 1.3.0


 Now spark version is only support ```SELECT -key FROM DECIMAL_UDF;``` in 
 HiveContext.
 This patch is used to support ```SELECT +key FROM DECIMAL_UDF;``` in 
 HiveContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: feeding DataFrames into predictive algorithms

2015-02-11 Thread Patrick Wendell

I think there is a minor error here in that the first example needs a
tail after the seq:

df.map { row =
  (row.getDouble(0), row.toSeq.tail.map(_.asInstanceOf[Double]))
}.toDataFrame(label, features)

On Wed, Feb 11, 2015 at 7:46 PM, Michael Armbrust
mich...@databricks.com wrote:
 It sounds like you probably want to do a standard Spark map, that results in
 a tuple with the structure you are looking for.  You can then just assign
 names to turn it back into a dataframe.

 Assuming the first column is your label and the rest are features you can do
 something like this:

 val df = sc.parallelize(
   (1.0, 2.3, 2.4) ::
   (1.2, 3.4, 1.2) ::
   (1.2, 2.3, 1.2) :: Nil).toDataFrame(a, b, c)

 df.map { row =
   (row.getDouble(0), row.toSeq.map(_.asInstanceOf[Double]))
 }.toDataFrame(label, features)

 df: org.apache.spark.sql.DataFrame = [label: double, features:
 arraydouble]

 If you'd prefer to stick closer to SQL you can define a UDF:

 val createArray = udf((a: Double, b: Double) = Seq(a, b))
 df.select('a as 'label, createArray('b,'c) as 'features)

 df: org.apache.spark.sql.DataFrame = [label: double, features:
 arraydouble]

 We'll add createArray as a first class member of the DSL.

 Michael

 On Wed, Feb 11, 2015 at 6:37 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Hey All,

 I've been playing around with the new DataFrame and ML pipelines APIs and
 am having trouble accomplishing what seems like should be a fairly basic
 task.

 I have a DataFrame where each column is a Double.  I'd like to turn this
 into a DataFrame with a features column and a label column that I can feed
 into a regression.

 So far all the paths I've gone down have led me to internal APIs or
 convoluted casting in and out of RDD[Row] and DataFrame.  Is there a simple
 way of accomplishing this?

 any assistance (lookin' at you Xiangrui) much appreciated,
 Sandy



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

[jira] [Updated] (SPARK-5656) NegativeArraySizeException in EigenValueDecomposition.symmetricEigs for large n and/or large k


 [ 
https://issues.apache.org/jira/browse/SPARK-5656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5656:
---
Assignee: Mark Bittmann

 NegativeArraySizeException in EigenValueDecomposition.symmetricEigs for large 
 n and/or large k
 --

 Key: SPARK-5656
 URL: https://issues.apache.org/jira/browse/SPARK-5656
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Mark Bittmann
Assignee: Mark Bittmann
Priority: Minor
 Fix For: 1.4.0


 Large values of n or k in EigenValueDecomposition.symmetricEigs will fail 
 with a NegativeArraySizeException. Specifically, this occurs when 2*n*k  
 Integer.MAX_VALUE. These values are currently unchecked and allow for the 
 array to be initialized to a value greater than Integer.MAX_VALUE. I have 
 written the below 'require' to fail this condition gracefully. I will submit 
 a pull request. 
 require(ncv * n.toLong  Integer.MAX_VALUE, Product of 2*k*n must be smaller 
 than  +
   sInteger.MAX_VALUE. Found required eigenvalues k = $k and matrix 
 dimension n = $n)
 Here is the exception that occurs from computeSVD with large k and/or n: 
 Exception in thread main java.lang.NegativeArraySizeException
   at 
 org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:85)
   at 
 org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:258)
   at 
 org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:190)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5366) check for mode of private key file


 [ 
https://issues.apache.org/jira/browse/SPARK-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5366:
---
Assignee: liu chang

 check for mode of private key file
 --

 Key: SPARK-5366
 URL: https://issues.apache.org/jira/browse/SPARK-5366
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: liu chang
Assignee: liu chang
Priority: Minor
 Fix For: 1.4.0


 check the mode for the private key. User should set it to 600.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5611) Allow spark-ec2 repo to be specified in CLI of spark_ec2.py


 [ 
https://issues.apache.org/jira/browse/SPARK-5611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5611:
---
Assignee: Florian Verhein

 Allow spark-ec2 repo to be specified in CLI of spark_ec2.py
 ---

 Key: SPARK-5611
 URL: https://issues.apache.org/jira/browse/SPARK-5611
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein
Assignee: Florian Verhein
Priority: Minor
 Fix For: 1.4.0


 Allows repo and branch of the desired spark-ec2 (and by extension the 
 ami-list) to be specified on the command line.
 Helps when trying out different branches or forks of spark-ec2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5648) support alter ... unset tblproperties(key)


 [ 
https://issues.apache.org/jira/browse/SPARK-5648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5648:
---
Assignee: DoingDone9

 support alter ... unset tblproperties(key) 
 ---

 Key: SPARK-5648
 URL: https://issues.apache.org/jira/browse/SPARK-5648
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: DoingDone9
Assignee: DoingDone9
 Fix For: 1.3.0


 make hivecontext support unset tblproperties(key)
 like :
 alter view viewName unset tblproperties(k)
 alter table tableName unset tblproperties(k)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5568) Python API for the write support of the data source API


 [ 
https://issues.apache.org/jira/browse/SPARK-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5568:
---
Assignee: Yin Huai

 Python API for the write support of the data source API
 ---

 Key: SPARK-5568
 URL: https://issues.apache.org/jira/browse/SPARK-5568
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5733) Error Link in Pagination of HistroyPage when showing Incomplete Applications


 [ 
https://issues.apache.org/jira/browse/SPARK-5733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5733:
---
Assignee: Liangliang Gu

 Error Link in Pagination of HistroyPage when showing Incomplete Applications 
 -

 Key: SPARK-5733
 URL: https://issues.apache.org/jira/browse/SPARK-5733
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.1
Reporter: Liangliang Gu
Assignee: Liangliang Gu
Priority: Minor
 Fix For: 1.3.0


 The links in pagination of HistroyPage is wrong when showing Incomplete 
 Applications. 
 If 2 is click on the following page 
 http://history-server:18080/?page=1showIncomplete=true;, it will go  to  
 http://history-server:18080/?page=2; instead of 
 http://history-server:18080/?page=2showIncomplete=true;.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5658) Finalize DDL and write support APIs


 [ 
https://issues.apache.org/jira/browse/SPARK-5658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5658:
---
Assignee: Yin Huai

 Finalize DDL and write support APIs
 ---

 Key: SPARK-5658
 URL: https://issues.apache.org/jira/browse/SPARK-5658
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5704) createDataFrame replace applySchema/inferSchema


 [ 
https://issues.apache.org/jira/browse/SPARK-5704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5704:
---
Assignee: Davies Liu

 createDataFrame replace applySchema/inferSchema
 ---

 Key: SPARK-5704
 URL: https://issues.apache.org/jira/browse/SPARK-5704
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5709) Add EXPLAIN support for DataFrame API for debugging purpose


 [ 
https://issues.apache.org/jira/browse/SPARK-5709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5709:
---
Assignee: Cheng Hao

 Add EXPLAIN support for DataFrame API for debugging purpose
 -

 Key: SPARK-5709
 URL: https://issues.apache.org/jira/browse/SPARK-5709
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5683) Improve the json serialization for DataFrame API


 [ 
https://issues.apache.org/jira/browse/SPARK-5683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5683:
---
Assignee: Cheng Hao

 Improve the json serialization for DataFrame API
 

 Key: SPARK-5683
 URL: https://issues.apache.org/jira/browse/SPARK-5683
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Minor
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5509) EqualTo operator doesn't handle binary type properly


 [ 
https://issues.apache.org/jira/browse/SPARK-5509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5509:
---
Assignee: Cheng Lian

 EqualTo operator doesn't handle binary type properly
 

 Key: SPARK-5509
 URL: https://issues.apache.org/jira/browse/SPARK-5509
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.3.0, 1.2.1
Reporter: Cheng Lian
Assignee: Cheng Lian
 Fix For: 1.3.0


 Binary type is mapped to {{Array\[Byte\]}}, which can't be compared with 
 {{==}} directly. However, {{EqualTo.eval()}} uses plain {{==}} to compare 
 values. Run the following {{spark-shell}} snippet with Spark 1.2.0 to 
 reproduce this issue: 
 {code}
 import org.apache.spark.sql.SQLContext
 import sc._
 val sqlContext = new SQLContext(sc)
 import sqlContext._
 case class KV(key: Int, value: Array[Byte])
 def toBinary(s: String): Array[Byte] = s.toString.getBytes(UTF-8)
 registerFunction(toBinary, toBinary _)
 parallelize(1 to 1024).map(i = KV(i, 
 toBinary(i.toString))).registerTempTable(bin)
 // OK
 sql(select * from bin where value  toBinary('100')).collect()
 // Oops, returns nothing
 sql(select * from bin where value = toBinary('100')).collect()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5135) Add support for describe [extended] table to DDL in SQLContext


 [ 
https://issues.apache.org/jira/browse/SPARK-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5135:
---
Assignee: Li Sheng

 Add support for describe [extended] table to DDL in SQLContext
 --

 Key: SPARK-5135
 URL: https://issues.apache.org/jira/browse/SPARK-5135
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.3.0
Reporter: Li Sheng
Assignee: Li Sheng
Priority: Minor
 Fix For: 1.3.0

   Original Estimate: 72h
  Remaining Estimate: 72h

 Support Describe Table Command.
 describe [extended] tableName.
 This also support external datasource table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5380) There will be an ArrayIndexOutOfBoundsException if the format of the source file is wrong


 [ 
https://issues.apache.org/jira/browse/SPARK-5380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5380:
---
Assignee: Leo_lh

 There will be an ArrayIndexOutOfBoundsException if the format of the source 
 file is wrong
 -

 Key: SPARK-5380
 URL: https://issues.apache.org/jira/browse/SPARK-5380
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.2.0
Reporter: Leo_lh
Assignee: Leo_lh
Priority: Minor
 Fix For: 1.3.0, 1.4.0


 When I build a graph with a file format error, there will be an 
 ArrayIndexOutOfBoundsException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5528) Support schema merging while reading Parquet files


 [ 
https://issues.apache.org/jira/browse/SPARK-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5528:
---
Assignee: Cheng Lian

 Support schema merging while reading Parquet files
 --

 Key: SPARK-5528
 URL: https://issues.apache.org/jira/browse/SPARK-5528
 Project: Spark
  Issue Type: Improvement
Reporter: Cheng Lian
Assignee: Cheng Lian
 Fix For: 1.3.0


 Spark 1.2.0 and prior versions only reads Parquet schema from {{_metadata}} 
 or a random Parquet part-file, and assumes all part-files share exactly the 
 same schema.
 In practice, it's common that users append new columns to existing Parquet 
 schema. Parquet has native schema merging support for such scenarios. Spark 
 SQL should also support this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5640) org.apache.spark.sql.catalyst.ScalaReflection is not thread safe


 [ 
https://issues.apache.org/jira/browse/SPARK-5640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5640:
---
Assignee: Tobias Schlatter

 org.apache.spark.sql.catalyst.ScalaReflection is not thread safe
 

 Key: SPARK-5640
 URL: https://issues.apache.org/jira/browse/SPARK-5640
 Project: Spark
  Issue Type: Bug
Reporter: Tobias Schlatter
Assignee: Tobias Schlatter
 Fix For: 1.3.0


 ScalaReflection uses the Scala reflection API but does not synchronize (for 
 example in the {{schemaFor}} method). This leads to concurrency bugs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5619) Support 'show roles' in HiveContext


 [ 
https://issues.apache.org/jira/browse/SPARK-5619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5619:
---
Assignee: Yadong Qi

 Support 'show roles' in HiveContext
 ---

 Key: SPARK-5619
 URL: https://issues.apache.org/jira/browse/SPARK-5619
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yadong Qi
Assignee: Yadong Qi
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5686) Support `show current roles`


 [ 
https://issues.apache.org/jira/browse/SPARK-5686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5686:
---
Assignee: Li Sheng

 Support `show current roles`
 

 Key: SPARK-5686
 URL: https://issues.apache.org/jira/browse/SPARK-5686
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Li Sheng
Assignee: Li Sheng
Priority: Minor
 Fix For: 1.3.0

   Original Estimate: 3h
  Remaining Estimate: 3h

 show current roles



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5668) spark_ec2.py region parameter could be either mandatory or its value displayed

[
https://issues.apache.org/jira/browse/SPARK-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Patrick Wendell updated SPARK-5668:
---
Assignee: Miguel Peralvo

spark_ec2.py region parameter could be either mandatory or its value displayed
--

Key: SPARK-5668
URL: https://issues.apache.org/jira/browse/SPARK-5668
Project: Spark
Issue Type: Improvement
Components: EC2
Affects Versions: 1.2.0, 1.3.0, 1.4.0
Reporter: Miguel Peralvo
Assignee: Miguel Peralvo
Priority: Minor
Labels: starter
Fix For: 1.4.0

If the region parameter is not specified when invoking spark-ec2
(spark-ec2.py behind the scenes) it defaults to us-east-1. When the cluster
doesn't belong to that region, after showing the Searching for existing
cluster Spark... message, it causes an ERROR: Could not find any existing
cluster exception because it doesn't find you cluster in the default region.
As it doesn't tell you anything about the region, It can be a small headache
for new users.
In
http://stackoverflow.com/questions/21171576/why-does-spark-ec2-fail-with-error-could-not-find-any-existing-cluster,
Dmitriy Selivanov explains it.
I propose that:
1. Either we make the search message a little bit more informative with
something like Searching for existing cluster Spark in region +
opts.region.
2. Or we remove the us-east-1 as default and make the --region parameter
mandatory.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5716) Support TOK_CHARSETLITERAL in HiveQl


 [ 
https://issues.apache.org/jira/browse/SPARK-5716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5716:
---
Assignee: Adrian Wang

 Support TOK_CHARSETLITERAL in HiveQl
 

 Key: SPARK-5716
 URL: https://issues.apache.org/jira/browse/SPARK-5716
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Adrian Wang
Assignee: Adrian Wang
 Fix For: 1.3.0


 where value = _UTF8 0x12345678



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5667) Remove version from spark-ec2 example.


 [ 
https://issues.apache.org/jira/browse/SPARK-5667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5667:
---
Assignee: Miguel Peralvo

 Remove version from spark-ec2 example.
 --

 Key: SPARK-5667
 URL: https://issues.apache.org/jira/browse/SPARK-5667
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.2.0, 1.3.0, 1.2.1, 1.2.2
Reporter: Miguel Peralvo
Assignee: Miguel Peralvo
Priority: Trivial
  Labels: documentation
 Fix For: 1.3.0


 Remove version from spark-ec2 example for spark-ec2/Launch Cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5595) In memory data cache should be invalidated after insert into/overwrite


 [ 
https://issues.apache.org/jira/browse/SPARK-5595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5595:
---
Assignee: Yin Huai

 In memory data cache should be invalidated after insert into/overwrite
 --

 Key: SPARK-5595
 URL: https://issues.apache.org/jira/browse/SPARK-5595
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5278) check ambiguous reference to fields in Spark SQL is incompleted


 [ 
https://issues.apache.org/jira/browse/SPARK-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5278:
---
Assignee: Wenchen Fan

 check ambiguous reference to fields in Spark SQL is incompleted
 ---

 Key: SPARK-5278
 URL: https://issues.apache.org/jira/browse/SPARK-5278
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan
 Fix For: 1.3.0


 at hive context
 for json string like
 {code}{a: {b: 1, B: 2}}{code}
 The SQL `SELECT a.b from t` will report error for ambiguous reference to 
 fields.
 But for json string like
 {code}{a: [{b: 1, B: 2}]}{code}
 The SQL `SELECT a[0].b from t` will pass and pick the first `b`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5324) Results of describe can't be queried


 [ 
https://issues.apache.org/jira/browse/SPARK-5324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5324:
---
Assignee: Li Sheng

 Results of describe can't be queried
 

 Key: SPARK-5324
 URL: https://issues.apache.org/jira/browse/SPARK-5324
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Michael Armbrust
Assignee: Li Sheng
 Fix For: 1.3.0


 {code}
 sql(DESCRIBE TABLE test).registerTempTable(describeTest)
 sql(SELECT * FROM describeTest).collect()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5603) Preinsert casting and renaming rule is needed in the Analyzer


 [ 
https://issues.apache.org/jira/browse/SPARK-5603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5603:
---
Assignee: Yin Huai

 Preinsert casting and renaming rule is needed in the Analyzer
 -

 Key: SPARK-5603
 URL: https://issues.apache.org/jira/browse/SPARK-5603
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.3.0


 For an INSERT INTO/OVERWRITE statement, we should add necessary Cast and 
 Alias to the output of the query.
 {code}
 CREATE TEMPORARY TABLE jsonTable (a int, b string)
 USING org.apache.spark.sql.json.DefaultSource
 OPTIONS (
   path '...'
 )
 INSERT OVERWRITE TABLE jsonTable SELECT a * 2, a * 4 FROM table
 {code}
 For a*2, we should create an Alias, so the InsertableRelation can know it is 
 the column a. For a*4, it is actually the column b in jsonTable. We should 
 first cast it to StringType and add an Alias b to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5650) Optional 'FROM' clause in HiveQl


 [ 
https://issues.apache.org/jira/browse/SPARK-5650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5650:
---
Assignee: Liang-Chi Hsieh

 Optional 'FROM' clause in HiveQl
 

 Key: SPARK-5650
 URL: https://issues.apache.org/jira/browse/SPARK-5650
 Project: Spark
  Issue Type: Bug
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
Priority: Minor
 Fix For: 1.3.0


 In Hive, 'FROM' clause should be optional.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5679) Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads and input metrics with mixed read method


 [ 
https://issues.apache.org/jira/browse/SPARK-5679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5679:
---
Priority: Major  (was: Blocker)

 Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads 
 and input metrics with mixed read method 
 --

 Key: SPARK-5679
 URL: https://issues.apache.org/jira/browse/SPARK-5679
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Kostas Sakellis
  Labels: flaky-test

 Please audit these and see if there are any assumptions with respect to File 
 IO that might not hold in all cases. I'm happy to help if you can't find 
 anything.
 These both failed in the same run:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-SBT/38/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink
 {code}
 org.apache.spark.metrics.InputOutputMetricsSuite.input metrics with mixed 
 read method
 Failing for the past 13 builds (Since Failed#26 )
 Took 48 sec.
 Error Message
 2030 did not equal 6496
 Stacktrace
 sbt.ForkMain$ForkError: 2030 did not equal 6496
   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
   at 
 org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply$mcV$sp(InputOutputMetricsSuite.scala:135)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfter$$super$runTest(InputOutputMetricsSuite.scala:46)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite.runTest(InputOutputMetricsSuite.scala:46)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfterAll$$super$run(InputOutputMetricsSuite.scala:46)
   at 
 org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
   at 
 org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfter

[jira] [Created] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset

Patrick Wendell created SPARK-5731:
--

 Summary: Flaky Test: 
org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving 
with multiple topics and smallest starting offset
 Key: SPARK-5731
 URL: https://issues.apache.org/jira/browse/SPARK-5731
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Patrick Wendell
Assignee: Tathagata Das


{code}
sbt.ForkMain$ForkError: The code passed to eventually never returned normally. 
Attempted 110 times over 20.070287525 seconds. Last failure message: 300 did 
not equal 48 didn't get all messages.
at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
at 
org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
at 
org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
at 
org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110)
at 
org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
at 
org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at 
org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38)
at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
at 
org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
at org.scalatest.Suite$class.run(Suite.scala:1424)
at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
at 
org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38)
at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
at 
org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfterAll$$super$run(DirectKafkaStreamSuite.scala:38)
at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
at 
org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.run(DirectKafkaStreamSuite.scala:38)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
at sbt.ForkMain$Run$2.call

[jira] [Updated] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset


 [ 
https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5731:
---
Affects Version/s: 1.3.0

 Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic 
 stream receiving with multiple topics and smallest starting offset
 

 Key: SPARK-5731
 URL: https://issues.apache.org/jira/browse/SPARK-5731
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Tests
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Tathagata Das

 {code}
 sbt.ForkMain$ForkError: The code passed to eventually never returned 
 normally. Attempted 110 times over 20.070287525 seconds. Last failure 
 message: 300 did not equal 48 didn't get all messages.
   at 
 org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38)
   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfterAll$$super$run(DirectKafkaStreamSuite.scala:38)
   at 
 org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
   at 
 org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256

[jira] [Updated] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset


 [ 
https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5731:
---
Component/s: Tests

 Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic 
 stream receiving with multiple topics and smallest starting offset
 

 Key: SPARK-5731
 URL: https://issues.apache.org/jira/browse/SPARK-5731
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Tests
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Tathagata Das

 {code}
 sbt.ForkMain$ForkError: The code passed to eventually never returned 
 normally. Attempted 110 times over 20.070287525 seconds. Last failure 
 message: 300 did not equal 48 didn't get all messages.
   at 
 org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38)
   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfterAll$$super$run(DirectKafkaStreamSuite.scala:38)
   at 
 org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
   at 
 org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256

[jira] [Updated] (SPARK-5493) Support proxy users under kerberos


 [ 
https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5493:
---
Assignee: Marcelo Vanzin

 Support proxy users under kerberos
 --

 Key: SPARK-5493
 URL: https://issues.apache.org/jira/browse/SPARK-5493
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Brock Noland
Assignee: Marcelo Vanzin
 Fix For: 1.3.0


 When using kerberos, services may want to use spark-submit to submit jobs as 
 a separate user. For example a service like oozie might want to submit jobs 
 as a client user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5493) Support proxy users under kerberos


 [ 
https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5493.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Target Version/s: 1.3.0

 Support proxy users under kerberos
 --

 Key: SPARK-5493
 URL: https://issues.apache.org/jira/browse/SPARK-5493
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Brock Noland
Assignee: Marcelo Vanzin
 Fix For: 1.3.0


 When using kerberos, services may want to use spark-submit to submit jobs as 
 a separate user. For example a service like oozie might want to submit jobs 
 as a client user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Powered by Spark: Concur

2015-02-10 Thread Patrick Wendell

Thanks Paolo - I've fixed it.

On Mon, Feb 9, 2015 at 11:10 PM, Paolo Platter
paolo.plat...@agilelab.it wrote:
 Hi,

 I checked the powered by wiki too and Agile Labs should be Agile Lab. The 
 link is wrong too, it should be www.agilelab.it.
 The description is correct.

 Thanks a lot

 Paolo

 Inviata dal mio Windows Phone
 
 Da: Denny Leemailto:denny.g@gmail.com
 Inviato: 10/02/2015 07:41
 A: Matei Zahariamailto:matei.zaha...@gmail.com
 Cc: dev@spark.apache.orgmailto:dev@spark.apache.org
 Oggetto: Re: Powered by Spark: Concur

 Thanks Matei - much appreciated!

 On Mon Feb 09 2015 at 10:23:57 PM Matei Zaharia matei.zaha...@gmail.com
 wrote:

 Thanks Denny; added you.

 Matei

  On Feb 9, 2015, at 10:11 PM, Denny Lee denny.g@gmail.com wrote:
 
  Forgot to add Concur to the Powered by Spark wiki:
 
  Concur
  https://www.concur.com
  Spark SQL, MLLib
  Using Spark for travel and expenses analytics and personalization
 
  Thanks!
  Denny



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

[jira] [Commented] (SPARK-5613) YarnClientSchedulerBackend fails to get application report when yarn restarts


[ 
https://issues.apache.org/jira/browse/SPARK-5613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14314821#comment-14314821
 ] 

Patrick Wendell commented on SPARK-5613:


I have cherry picked it into the 1.3 branch.

 YarnClientSchedulerBackend fails to get application report when yarn restarts
 -

 Key: SPARK-5613
 URL: https://issues.apache.org/jira/browse/SPARK-5613
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Kashish Jain
Assignee: Kashish Jain
Priority: Minor
 Fix For: 1.3.0, 1.2.2

   Original Estimate: 24h
  Remaining Estimate: 24h

 Steps to Reproduce
 1) Run any spark job
 2) Stop yarn while the spark job is running (an application id has been 
 generated by now)
 3) Restart yarn now
 4) AsyncMonitorApplication thread fails due to ApplicationNotFoundException 
 exception. This leads to termination of thread. 
 Here is the StackTrace
 15/02/05 05:22:37 INFO Client: Retrying connect to server: 
 nn1/192.168.173.176:8032. Already tried 6 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 15/02/05 05:22:38 INFO Client: Retrying connect to server: 
 nn1/192.168.173.176:8032. Already tried 7 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 15/02/05 05:22:39 INFO Client: Retrying connect to server: 
 nn1/192.168.173.176:8032. Already tried 8 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 15/02/05 05:22:40 INFO Client: Retrying connect to server: 
 nn1/192.168.173.176:8032. Already tried 9 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 5/02/05 05:22:40 INFO Client: Retrying connect to server: 
 nn1/192.168.173.176:8032. Already tried 9 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 Exception in thread Yarn application state monitor 
 org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
 with id 'application_1423113179043_0003' doesn't exist in RM.
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
   at 
 org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Unknown Source)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown 
 Source)
   at java.lang.reflect.Constructor.newInstance(Unknown Source)
   at 
 org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
   at 
 org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:166)
   at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
   at java.lang.reflect.Method.invoke(Unknown Source)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
   at com.sun.proxy.$Proxy12.getApplicationReport(Unknown Source)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:291)
   at 
 org.apache.spark.deploy.yarn.Client.getApplicationReport(Client.scala:116)
   at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:120)
 Caused by: 
 org.apache.hadoop.ipc.RemoteException

[jira] [Updated] (SPARK-4382) Add locations parameter to Twitter Stream


 [ 
https://issues.apache.org/jira/browse/SPARK-4382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4382:
---
Component/s: Streaming

 Add locations parameter to Twitter Stream
 -

 Key: SPARK-4382
 URL: https://issues.apache.org/jira/browse/SPARK-4382
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Liang-Chi Hsieh

 When we request Tweet stream, geo-location is one of the most important 
 parameters. In addition to the track parameter, the locations parameter is 
 widely used to ask for the Tweets falling within the requested bounding 
 boxes. This PR adds the locations parameter to existing APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5735) Replace uses of EasyMock with Mockito