[jira] [Commented] (SPARK-5845) Time to cleanup intermediate shuffle files not included in shuffle write time
[ https://issues.apache.org/jira/browse/SPARK-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335667#comment-14335667 ] Patrick Wendell commented on SPARK-5845: [~kayousterhout] did you mean the time required to delete spill files used during aggregation? For the shuffle files themselves, they are deleted asynchronously, as [~ilganeli] has mentioned. Time to cleanup intermediate shuffle files not included in shuffle write time - Key: SPARK-5845 URL: https://issues.apache.org/jira/browse/SPARK-5845 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.3.0, 1.2.1 Reporter: Kay Ousterhout Assignee: Ilya Ganelin Priority: Minor When the disk is contended, I've observed cases when it takes as long as 7 seconds to clean up all of the intermediate spill files for a shuffle (when using the sort based shuffle, but bypassing merging because there are =200 shuffle partitions). This is even when the shuffle data is non-huge (152MB written from one of the tasks where I observed this). This is effectively part of the shuffle write time (because it's a necessary side effect of writing data to disk) so should be added to the shuffle write time to facilitate debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3851) Support for reading parquet files with different but compatible schema
[ https://issues.apache.org/jira/browse/SPARK-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3851: --- Fix Version/s: 1.3.0 Support for reading parquet files with different but compatible schema -- Key: SPARK-3851 URL: https://issues.apache.org/jira/browse/SPARK-3851 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker Fix For: 1.3.0 Right now it is required that all of the parquet files have the same schema. It would be nice to support some safe subset of cases where the schemas of files is different. For example: - Adding and removing nullable columns. - Widening types (a column that is of both Int and Long type) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: Can you add Big Industries to the Powered by Spark page?
I've added it, thanks! On Fri, Feb 20, 2015 at 12:22 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, Could you please add Big Industries to the Powered by Spark page at https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark ? Company Name: Big Industries URL: http://http://www.bigindustries.be/ Spark Components: Spark Streaming Use Case: Big Content Platform Summary: The Big Content Platform is a business-to-business content asset management service providing a searchable, aggregated source of live news feeds, public domain media and archives of content. The platform is founded on Apache Hadoop, uses the HDFS filesystem, Apache Spark, Titan Distributed Graph Database, HBase, and Solr. Additionally, the platform leverages public datasets like Freebase, DBpedia, Wiktionary, and Geonames to support semantic text enrichment. Kind regards, Emre Sevinç http://www.bigindustries.be/ - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[jira] [Resolved] (SPARK-5904) DataFrame methods with varargs do not work in Java
[ https://issues.apache.org/jira/browse/SPARK-5904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5904. Resolution: Fixed Fix Version/s: 1.3.0 I think rxin just forgot to close this. It was merged several days ago. DataFrame methods with varargs do not work in Java -- Key: SPARK-5904 URL: https://issues.apache.org/jira/browse/SPARK-5904 Project: Spark Issue Type: Sub-task Components: Java API, SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Reynold Xin Priority: Blocker Labels: DataFrame Fix For: 1.3.0 DataFrame methods with varargs fail when called from Java due to a bug in Scala. This can be produced by, e.g., modifying the end of the example ml.JavaSimpleParamsExample in the master branch: {code} DataFrame results = model2.transform(test); results.printSchema(); // works results.collect(); // works results.filter(label 0.0).count(); // works for (Row r: results.select(features, label, myProbability, prediction).collect()) { // fails on select System.out.println(( + r.get(0) + , + r.get(1) + ) - prob= + r.get(2) + , prediction= + r.get(3)); } {code} I have also tried groupBy and found that failed too. The error looks like this: {code} Exception in thread main java.lang.AbstractMethodError: org.apache.spark.sql.DataFrameImpl.groupBy(Ljava/lang/String;[Ljava/lang/String;)Lorg/apache/spark/sql/GroupedData; at org.apache.spark.examples.ml.JavaSimpleParamsExample.main(JavaSimpleParamsExample.java:108) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} The error appears to be from this Scala bug with using varargs in an abstract method: [https://issues.scala-lang.org/browse/SI-9013] My current plan is to move the implementations of the methods with varargs from DataFrameImpl to DataFrame. However, this may cause issues with IncomputableColumn---feedback?? Thanks to [~joshrosen] for figuring the bug and fix out! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5463) Fix Parquet filter push-down
[ https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333608#comment-14333608 ] Patrick Wendell commented on SPARK-5463: Bumping to critical. Per our offline discussion last week we probably won't hold the release for this. Fix Parquet filter push-down Key: SPARK-5463 URL: https://issues.apache.org/jira/browse/SPARK-5463 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.2.2 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5463) Fix Parquet filter push-down
[ https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5463: --- Priority: Critical (was: Blocker) Fix Parquet filter push-down Key: SPARK-5463 URL: https://issues.apache.org/jira/browse/SPARK-5463 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.2.2 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3650) Triangle Count handles reverse edges incorrectly
[ https://issues.apache.org/jira/browse/SPARK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3650: --- Priority: Critical (was: Blocker) Triangle Count handles reverse edges incorrectly Key: SPARK-3650 URL: https://issues.apache.org/jira/browse/SPARK-3650 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.1.0, 1.2.0 Reporter: Joseph E. Gonzalez Priority: Critical The triangle count implementation assumes that edges are aligned in a canonical direction. As stated in the documentation: bq. Note that the input graph should have its edges in canonical direction (i.e. the `sourceId` less than `destId`) However the TriangleCount algorithm does not verify that this condition holds and indeed even the unit tests exploits this functionality: {code:scala} val triangles = Array(0L - 1L, 1L - 2L, 2L - 0L) ++ Array(0L - -1L, -1L - -2L, -2L - 0L) val rawEdges = sc.parallelize(triangles, 2) val graph = Graph.fromEdgeTuples(rawEdges, true).cache() val triangleCount = graph.triangleCount() val verts = triangleCount.vertices verts.collect().foreach { case (vid, count) = if (vid == 0) { assert(count === 4) // -- Should be 2 } else { assert(count === 2) // -- Should be 1 } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
So actually, the list of blockers on JIRA is a bit outdated. These days I won't cut RC1 unless there are no known issues that I'm aware of that would actually block the release (that's what the snapshot ones are for). I'm going to clean those up and push others to do so also. The main issues I'm aware of that came about post RC1 are: 1. Python submission broken on YARN 2. The license issue in MLlib [now fixed]. 3. Varargs broken for Java Dataframes [now fixed] Re: Corey - yeah, as it stands now I try to wait if there are things that look like implicit -1 votes. On Mon, Feb 23, 2015 at 6:13 AM, Corey Nolet cjno...@gmail.com wrote: Thanks Sean. I glossed over the comment about SPARK-5669. On Mon, Feb 23, 2015 at 9:05 AM, Sean Owen so...@cloudera.com wrote: Yes my understanding from Patrick's comment is that this RC will not be released, but, to keep testing. There's an implicit -1 out of the gates there, I believe, and so the vote won't pass, so perhaps that's why there weren't further binding votes. I'm sure that will be formalized shortly. FWIW here are 10 issues still listed as blockers for 1.3.0: SPARK-5910 DataFrame.selectExpr(col as newName) does not work SPARK-5904 SPARK-5166 DataFrame methods with varargs do not work in Java SPARK-5873 Can't see partially analyzed plans SPARK-5546 Improve path to Kafka assembly when trying Kafka Python API SPARK-5517 SPARK-5166 Add input types for Java UDFs SPARK-5463 Fix Parquet filter push-down SPARK-5310 SPARK-5166 Update SQL programming guide for 1.3 SPARK-5183 SPARK-5180 Document data source API SPARK-3650 Triangle Count handles reverse edges incorrectly SPARK-3511 Create a RELEASE-NOTES.txt file in the repo On Mon, Feb 23, 2015 at 1:55 PM, Corey Nolet cjno...@gmail.com wrote: This vote was supposed to close on Saturday but it looks like no PMCs voted (other than the implicit vote from Patrick). Was there a discussion offline to cut an RC2? Was the vote extended? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
It's only been reported on this thread by Tom, so far. On Mon, Feb 23, 2015 at 10:29 AM, Marcelo Vanzin van...@cloudera.com wrote: Hey Patrick, Do you have a link to the bug related to Python and Yarn? I looked at the blockers in Jira but couldn't find it. On Mon, Feb 23, 2015 at 10:18 AM, Patrick Wendell pwend...@gmail.com wrote: So actually, the list of blockers on JIRA is a bit outdated. These days I won't cut RC1 unless there are no known issues that I'm aware of that would actually block the release (that's what the snapshot ones are for). I'm going to clean those up and push others to do so also. The main issues I'm aware of that came about post RC1 are: 1. Python submission broken on YARN 2. The license issue in MLlib [now fixed]. 3. Varargs broken for Java Dataframes [now fixed] Re: Corey - yeah, as it stands now I try to wait if there are things that look like implicit -1 votes. -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: FW: Submitting jobs to Spark EC2 cluster remotely
What happens if you submit from the master node itself on ec2 (in client mode), does that work? What about in cluster mode? It would be helpful if you could print the full command that the executor is failing. That might show that spark.driver.host is being set strangely. IIRC we print the launch command before starting the executor. Overall the standalone cluster mode is not as well tested across environments with asymmetric connectivity. I didn't actually realize that akka (which the submission uses) can handle this scenario. But it does seem like the job is submitted, it's just not starting correctly. - Patrick On Mon, Feb 23, 2015 at 1:13 AM, Oleg Shirokikh o...@solver.com wrote: Patrick, I haven't changed the configs much. I just executed ec2-script to create 1 master, 2 slaves cluster. Then I try to submit the jobs from remote machine leaving all defaults configured by Spark scripts as default. I've tried to change configs as suggested in other mailing-list and stack overflow threads (such as setting spark.driver.host, etc...), removed (hopefully) all security/firewall restrictions from AWS, etc. but it didn't help. I think that what you are saying is exactly the issue: on my master node UI at the bottom I can see the list of Completed Drivers all with ERROR state... Thanks, Oleg -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Monday, February 23, 2015 12:59 AM To: Oleg Shirokikh Cc: user@spark.apache.org Subject: Re: Submitting jobs to Spark EC2 cluster remotely Can you list other configs that you are setting? It looks like the executor can't communicate back to the driver. I'm actually not sure it's a good idea to set spark.driver.host here, you want to let spark set that automatically. - Patrick On Mon, Feb 23, 2015 at 12:48 AM, Oleg Shirokikh o...@solver.com wrote: Dear Patrick, Thanks a lot for your quick response. Indeed, following your advice I've uploaded the jar onto S3 and FileNotFoundException is gone now and job is submitted in cluster deploy mode. However, now both (client and cluster) fail with the following errors in executors (they keep exiting/killing executors as I see in UI): 15/02/23 08:42:46 ERROR security.UserGroupInformation: PriviledgedActionException as:oleg cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Full log is: 15/02/23 01:59:11 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 15/02/23 01:59:12 INFO spark.SecurityManager: Changing view acls to: root,oleg 15/02/23 01:59:12 INFO spark.SecurityManager: Changing modify acls to: root,oleg 15/02/23 01:59:12 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, oleg); users with modify permissions: Set(root, oleg) 15/02/23 01:59:12 INFO slf4j.Slf4jLogger: Slf4jLogger started 15/02/23 01:59:12 INFO Remoting: Starting remoting 15/02/23 01:59:13 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverpropsfetc...@ip-172-31-33-194.us-west-2.compute.int ernal:39379] 15/02/23 01:59:13 INFO util.Utils: Successfully started service 'driverPropsFetcher' on port 39379. 15/02/23 01:59:43 ERROR security.UserGroupInformation: PriviledgedActionException as:oleg cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Exception in thread main java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrai nedExecutorBackend.scala) Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) ... 4 more Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107
[jira] [Updated] (SPARK-5920) Use a BufferedInputStream to read local shuffle data
[ https://issues.apache.org/jira/browse/SPARK-5920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5920: --- Priority: Critical (was: Major) Use a BufferedInputStream to read local shuffle data Key: SPARK-5920 URL: https://issues.apache.org/jira/browse/SPARK-5920 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 1.3.0, 1.2.1 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Critical When reading local shuffle data, Spark doesn't currently buffer the local reads into larger chunks, which can lead to terrible disk performance if many tasks are concurrently reading local data from the same disk. We should use a BufferedInputStream to mitigate this problem; we can lazily create the input stream to avoid allocating a bunch of in-memory buffers at the same time for tasks that read shuffle data from a large number of local blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5920) Use a BufferedInputStream to read local shuffle data
[ https://issues.apache.org/jira/browse/SPARK-5920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5920: --- Priority: Blocker (was: Critical) Use a BufferedInputStream to read local shuffle data Key: SPARK-5920 URL: https://issues.apache.org/jira/browse/SPARK-5920 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 1.3.0, 1.2.1 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Blocker When reading local shuffle data, Spark doesn't currently buffer the local reads into larger chunks, which can lead to terrible disk performance if many tasks are concurrently reading local data from the same disk. We should use a BufferedInputStream to mitigate this problem; we can lazily create the input stream to avoid allocating a bunch of in-memory buffers at the same time for tasks that read shuffle data from a large number of local blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5920) Use a BufferedInputStream to read local shuffle data
[ https://issues.apache.org/jira/browse/SPARK-5920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14331986#comment-14331986 ] Patrick Wendell commented on SPARK-5920: We should definitely do this. Use a BufferedInputStream to read local shuffle data Key: SPARK-5920 URL: https://issues.apache.org/jira/browse/SPARK-5920 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 1.3.0, 1.2.1 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Blocker When reading local shuffle data, Spark doesn't currently buffer the local reads into larger chunks, which can lead to terrible disk performance if many tasks are concurrently reading local data from the same disk. We should use a BufferedInputStream to mitigate this problem; we can lazily create the input stream to avoid allocating a bunch of in-memory buffers at the same time for tasks that read shuffle data from a large number of local blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5916) $SPARK_HOME/bin/beeline conflicts with $HIVE_HOME/bin/beeline
[ https://issues.apache.org/jira/browse/SPARK-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332001#comment-14332001 ] Patrick Wendell commented on SPARK-5916: The naming conflict is unfortunate. However, scripts like this are a public API in Spark, so I don't think we can remove this API randomly in a minor release given our versioning policies. For instance, people write scripts that invoke beeline to submit queries, etc, and it would break those. One thing we could potentially do is rename this to spark-beeline (or even to something else) and update references accordingly. And then leave ./bin/beeline as a backwards-compatible alias. And if individual distributors want to order their path so that the beeline hits the Hive version first, they can do that. However, that approach is also confusing because we'd have beeline and spark-beeline both there. Ping also [~marmbrus] and [~matei] regarding how we approach changes to scripts like this. In the past I don't think we've ever changed one of these in a minor release. $SPARK_HOME/bin/beeline conflicts with $HIVE_HOME/bin/beeline - Key: SPARK-5916 URL: https://issues.apache.org/jira/browse/SPARK-5916 Project: Spark Issue Type: Improvement Components: SQL Reporter: Carl Steinbach Priority: Minor Hive provides a JDBC CLI named beeline. Spark currently depends on beeline, but provides its own beeline wrapper script for launching it. This results in a conflict when both $HIVE_HOME/bin and $SPARK_HOME/bin appear on a user's PATH. In order to eliminate the potential for conflict I propose changing the name of Spark's beeline wrapper script to sparkline. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5887) Class not found exception com.datastax.spark.connector.rdd.partitioner.CassandraPartition
[ https://issues.apache.org/jira/browse/SPARK-5887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5887. Resolution: Invalid The Datastax connector is not part of the Apache Spark distribution, it's maintained by Datastax directly. So please reach out to them for support. Thanks! Class not found exception com.datastax.spark.connector.rdd.partitioner.CassandraPartition -- Key: SPARK-5887 URL: https://issues.apache.org/jira/browse/SPARK-5887 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: Spark 1.2.1 Spark Cassandra Connector 1.2.0 Alpha2 Reporter: Vijay Pawnarkar I am getting following class not found exception when using Spark 1.2.1 with spark-cassandra-connector_2.10-1.2.0-alpha2. When the job is submitted to Spark.. it successfully adds required connector JAR file to Worker's classpath. Corresponding log entries are also included in following description. From log statements and looking at spark 1.2.1 codebase it looks like the JAR get added to urlClassLoader via Executor.scala's updateDependencies method. However when it time to execute the Task, its not able to resolve the class name. [task-result-getter-0] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 0.0 (TID 0, 127.0.0.1): java.lang.ClassNotFoundException: com.datastax.spark.connector.rdd.partitioner.CassandraPartition at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:274) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- LOG indicating JAR files were added to worker classpath. 15/02/17 16:56:48 INFO Executor: Fetching http://127.0.0.1:64265/jars/spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar with timestamp 1424210185005 15/02/17 16:56:48 INFO Utils: Fetching http://127.0.0.1:64265/jars/spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar to C:\Users\sparkus\AppData\Local\Temp\spark-10f5e149-5460-4899-9c8f-b19b19bdaf55\spark-fba24b2b-5847-4b04-848c-90677d12ff99\spark-35f5ed4b-041d-40d8-8854-b243787de188\fetchFileTemp4665176275367448514.tmp 15/02/17 16:56:48 DEBUG Utils: fetchFile not using security 15/02/17 16:56:48 INFO Utils: Copying C:\Users\sparkus\AppData\Local\Temp\spark-10f5e149-5460-4899-9c8f-b19b19bdaf55\spark-fba24b2b-5847-4b04-848c-90677d12ff99\spark-35f5ed4b-041d-40d8-8854-b243787de188\16215993091424210185005_cache to C:\localapps\spark-1.2.1-bin-hadoop2.4\work\app-20150217165625-0006\0\.\spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar 15/02/17 16:56:48 INFO Executor: Adding file:/C:/localapps/spark-1.2.1-bin-hadoop2.4/work/app-20150217165625-0006/0/./spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar to class loader 15/02/17 16:56:50 INFO Executor: Fetching http://127.0.0.1:64265/jars/spark-cassandra-connector_2.10-1.2.0-alpha2.jar with timestamp 1424210185012 15/02/17 16:56:50 INFO Utils: Fetching http://127.0.0.1:64265/jars/spark-cassandra-connector_2.10-1.2.0-alpha2.jar to C:\Users\sparkus\AppData\Local\Temp\spark-10f5e149-5460-4899-9c8f-b19b19bdaf55\spark-fba24b2b-5847-4b04-848c
[jira] [Updated] (SPARK-5863) Performance regression in Spark SQL/Parquet due to ScalaReflection.convertRowToScala
[ https://issues.apache.org/jira/browse/SPARK-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5863: --- Priority: Critical (was: Major) Performance regression in Spark SQL/Parquet due to ScalaReflection.convertRowToScala Key: SPARK-5863 URL: https://issues.apache.org/jira/browse/SPARK-5863 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1 Reporter: Cristian Priority: Critical Was doing some perf testing on reading parquet files and noticed that moving from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the culprit showed up as being in ScalaReflection.convertRowToScala. Particularly this zip is the issue: {code} r.toSeq.zip(schema.fields.map(_.dataType)) {code} I see there's a comment on that currently that this is slow but it wasn't fixed. This actually produces a 3x degradation in parquet read performance, at least in my test case. Edit: the map is part of the issue as well. This whole code block is in a tight loop and allocates a new ListBuffer that needs to grow for each transformation. A possible solution is to change to using seq.view which would allocate iterators instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Merging code into branch 1.3
Hey Committers, Now that Spark 1.3 rc1 is cut, please restrict branch-1.3 merges to the following: 1. Fixes for issues blocking the 1.3 release (i.e. 1.2.X regressions) 2. Documentation and tests. 3. Fixes for non-blocker issues that are surgical, low-risk, and/or outside of the core. If there is a lower priority bug fix (a non-blocker) that requires nontrivial code changes, do not merge it into 1.3. If something seems borderline, feel free to reach out to me and we can work through it together. This is what we've done for the last few releases to make sure rc's become progressively more stable, and it is important towards helping us cut timely releases. Thanks! - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[VOTE] Release Apache Spark 1.3.0 (RC1)
Please vote on releasing the following candidate as Apache Spark version 1.3.0! The tag to be voted on is v1.3.0-rc1 (commit f97b0d4a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=f97b0d4a6b26504916816d7aefcf3132cd1da6c2 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.0-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1069/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.0-rc1-docs/ Please vote on releasing this package as Apache Spark 1.3.0! The vote is open until Saturday, February 21, at 08:03 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.2 workload and running on this release candidate, then reporting any regressions. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.3 QA period, so -1 votes should only occur for significant regressions from 1.2.1. Bugs already present in 1.2.X, minor regressions, or bugs related to new features will not block this release. - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[jira] [Resolved] (SPARK-5864) support .jar as python package
[ https://issues.apache.org/jira/browse/SPARK-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5864. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Davies Liu support .jar as python package -- Key: SPARK-5864 URL: https://issues.apache.org/jira/browse/SPARK-5864 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.3.0 Support .jar file as python package (same to .zip or .egg) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5850) Remove experimental label for Scala 2.11 and FlumePollingStream
[ https://issues.apache.org/jira/browse/SPARK-5850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5850. Resolution: Fixed Fix Version/s: 1.3.0 Remove experimental label for Scala 2.11 and FlumePollingStream --- Key: SPARK-5850 URL: https://issues.apache.org/jira/browse/SPARK-5850 Project: Spark Issue Type: Bug Components: Spark Core, Streaming Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.3.0 These things have been out for at least one release and can be promoted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5856) In Maven build script, launch Zinc with more memory
[ https://issues.apache.org/jira/browse/SPARK-5856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5856. Resolution: Fixed Fix Version/s: 1.3.0 In Maven build script, launch Zinc with more memory --- Key: SPARK-5856 URL: https://issues.apache.org/jira/browse/SPARK-5856 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.3.0 I've seen out of memory exceptions when trying to run many parallel builds against the same Zinc server during packaging. We should use the same increased memory settings we use for Maven itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4579) Scheduling Delay appears negative
[ https://issues.apache.org/jira/browse/SPARK-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4579: --- Assignee: Andrew Or Scheduling Delay appears negative - Key: SPARK-4579 URL: https://issues.apache.org/jira/browse/SPARK-4579 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.0 Reporter: Arun Ahuja Assignee: Andrew Or Priority: Critical !https://cloud.githubusercontent.com/assets/455755/5174438/23d08604-73ff-11e4-9a76-97233b610544.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4579) Scheduling Delay appears negative
[ https://issues.apache.org/jira/browse/SPARK-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325622#comment-14325622 ] Patrick Wendell commented on SPARK-4579: [~andrewor14] Can you take a look at this one? Scheduling Delay appears negative - Key: SPARK-4579 URL: https://issues.apache.org/jira/browse/SPARK-4579 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.0 Reporter: Arun Ahuja Assignee: Andrew Or Priority: Critical !https://cloud.githubusercontent.com/assets/455755/5174438/23d08604-73ff-11e4-9a76-97233b610544.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4579) Scheduling Delay appears negative
[ https://issues.apache.org/jira/browse/SPARK-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4579: --- Labels: (was: starter) Scheduling Delay appears negative - Key: SPARK-4579 URL: https://issues.apache.org/jira/browse/SPARK-4579 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.0 Reporter: Arun Ahuja Priority: Critical !https://cloud.githubusercontent.com/assets/455755/5174438/23d08604-73ff-11e4-9a76-97233b610544.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4579) Scheduling Delay appears negative
[ https://issues.apache.org/jira/browse/SPARK-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4579: --- Priority: Critical (was: Minor) Scheduling Delay appears negative - Key: SPARK-4579 URL: https://issues.apache.org/jira/browse/SPARK-4579 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.0 Reporter: Arun Ahuja Priority: Critical Labels: starter !https://cloud.githubusercontent.com/assets/455755/5174438/23d08604-73ff-11e4-9a76-97233b610544.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4579) Scheduling Delay appears negative
[ https://issues.apache.org/jira/browse/SPARK-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4579: --- Labels: starter (was: ) Scheduling Delay appears negative - Key: SPARK-4579 URL: https://issues.apache.org/jira/browse/SPARK-4579 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.0 Reporter: Arun Ahuja Priority: Critical Labels: starter !https://cloud.githubusercontent.com/assets/455755/5174438/23d08604-73ff-11e4-9a76-97233b610544.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
UISeleniumSuite: *** RUN ABORTED *** java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal ... This is a newer test suite. There is something flaky about it, we should definitely fix it, IMO it's not a blocker though. Patrick this link gives a 404: https://people.apache.org/keys/committer/pwendell.asc Works for me. Maybe it's some ephemeral issue? Finally, I already realized I failed to get the fix for https://issues.apache.org/jira/browse/SPARK-5669 correct, and that has to be correct for the release. I'll patch that up straight away, sorry. I believe the result of the intended fix is still as I described in SPARK-5669, so there is no bad news there. A local test seems to confirm it and I'm waiting on Jenkins. If it's all good I'll merge that fix. So, that much will need a new release, I apologize. Thanks for finding this. I'm going to leave this open for continued testing... - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [Performance] Possible regression in rdd.take()?
I believe the heuristic governing the way that take() decides to fetch partitions changed between these versions. It could be that in certain cases the new heuristic is worse, but it might be good to just look at the source code and see, for your number of elements taken and number of partitions, if there was any effective change in how aggressively spark fetched partitions. This was quite a while ago, but I think the change was made because in many cases the newer code works more efficiently. - Patrick On Wed, Feb 18, 2015 at 4:47 PM, Matt Cheah mch...@palantir.com wrote: Hi everyone, Between Spark 1.0.2 and Spark 1.1.1, I have noticed that rdd.take() consistently has a slower execution time on the later release. I was wondering if anyone else has had similar observations. I have two setups where this reproduces. The first is a local test. I launched a spark cluster with 4 worker JVMs on my Mac, and launched a Spark-Shell. I retrieved the text file and immediately called rdd.take(N) on it, where N varied. The RDD is a plaintext CSV, 4GB in size, split over 8 files, which ends up having 128 partitions, and a total of 8000 rows. The numbers I discovered between Spark 1.0.2 and Spark 1.1.1 are, with all numbers being in seconds: 1 items Spark 1.0.2: 0.069281, 0.012261, 0.011083 Spark 1.1.1: 0.11577, 0.097636, 0.11321 4 items Spark 1.0.2: 0.023751, 0.069365, 0.023603 Spark 1.1.1: 0.224287, 0.229651, 0.158431 10 items Spark 1.0.2: 0.047019, 0.049056, 0.042568 Spark 1.1.1: 0.353277, 0.288965, 0.281751 40 items Spark 1.0.2: 0.216048, 0.198049, 0.796037 Spark 1.1.1: 1.865622, 2.224424, 2.037672 This small test suite indicates a consistently reproducible performance regression. I also notice this on a larger scale test. The cluster used is on EC2: ec2 instance type: m2.4xlarge 10 slaves, 1 master ephemeral storage 70 cores, 50 GB/box In this case, I have a 100GB dataset split into 78 files totally 350 million items, and I take the first 50,000 items from the RDD. In this case, I have tested this on different formats of the raw data. With plaintext files: Spark 1.0.2: 0.422s, 0.363s, 0.382s Spark 1.1.1: 4.54s, 1.28s, 1.221s, 1.13s With snappy-compressed Avro files: Spark 1.0.2: 0.73s, 0.395s, 0.426s Spark 1.1.1: 4.618s, 1.81s, 1.158s, 1.333s Again demonstrating a reproducible performance regression. I was wondering if anyone else observed this regression, and if so, if anyone would have any idea what could possibly have caused it between Spark 1.0.2 and Spark 1.1.1? Thanks, -Matt Cheah - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[jira] [Resolved] (SPARK-5811) Documentation for --packages and --repositories on Spark Shell
[ https://issues.apache.org/jira/browse/SPARK-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5811. Resolution: Fixed Assignee: Burak Yavuz Documentation for --packages and --repositories on Spark Shell -- Key: SPARK-5811 URL: https://issues.apache.org/jira/browse/SPARK-5811 Project: Spark Issue Type: Documentation Components: Deploy, Spark Shell Affects Versions: 1.3.0 Reporter: Burak Yavuz Assignee: Burak Yavuz Priority: Critical Fix For: 1.3.0 Documentation for the new support for dependency management using maven coordinates using --packages and --repositories -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4454) Race condition in DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4454: --- Labels: backport-needed (was: ) Race condition in DAGScheduler -- Key: SPARK-4454 URL: https://issues.apache.org/jira/browse/SPARK-4454 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.1.0 Reporter: Rafal Kwasny Assignee: Josh Rosen Priority: Critical Labels: backport-needed Fix For: 1.3.0 It seems to be a race condition in DAGScheduler that manifests on jobs with high concurrency: {noformat} Exception in thread main java.util.NoSuchElementException: key not found: 35 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1275) at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:937
[jira] [Reopened] (SPARK-4454) Race condition in DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell reopened SPARK-4454: Actually, re-opening this since we need to back port it. Race condition in DAGScheduler -- Key: SPARK-4454 URL: https://issues.apache.org/jira/browse/SPARK-4454 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.1.0 Reporter: Rafal Kwasny Assignee: Josh Rosen Priority: Critical Fix For: 1.3.0 It seems to be a race condition in DAGScheduler that manifests on jobs with high concurrency: {noformat} Exception in thread main java.util.NoSuchElementException: key not found: 35 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1275) at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:937) at org.apache.spark.rdd.PartitionCoalescer.currPrefLocs
[jira] [Updated] (SPARK-4454) Race condition in DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4454: --- Target Version/s: 1.3.0, 1.2.2 (was: 1.3.0) Race condition in DAGScheduler -- Key: SPARK-4454 URL: https://issues.apache.org/jira/browse/SPARK-4454 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.1.0 Reporter: Rafal Kwasny Assignee: Josh Rosen Priority: Critical Labels: backport-needed Fix For: 1.3.0 It seems to be a race condition in DAGScheduler that manifests on jobs with high concurrency: {noformat} Exception in thread main java.util.NoSuchElementException: key not found: 35 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1275) at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:937
Re: Replacing Jetty with TomCat
Hey Niranda, It seems to me a lot of effort to support multiple libraries inside of Spark like this, so I'm not sure that's a great solution. If you are building an application that embeds Spark, is it not possible for you to continue to use Jetty for Spark's internal servers and use tomcat for your own server's? I would guess that many complex applications end up embedding multiple server libraries in various places (Spark itself has different transport mechanisms, etc.) - Patrick On Tue, Feb 17, 2015 at 7:14 PM, Niranda Perera niranda.per...@gmail.com wrote: Hi Sean, The main issue we have is, running two web servers in a single product. we think it would not be an elegant solution. Could you please point me to the main areas where jetty server is tightly coupled or extension points where I could plug tomcat instead of jetty? If successful I could contribute it to the spark project. :-) cheers On Mon, Feb 16, 2015 at 4:51 PM, Sean Owen so...@cloudera.com wrote: There's no particular reason you have to remove the embedded Jetty server, right? it doesn't prevent you from using it inside another app that happens to run in Tomcat. You won't be able to switch it out without rewriting a fair bit of code, no, but you don't need to. On Mon, Feb 16, 2015 at 5:08 AM, Niranda Perera niranda.per...@gmail.com wrote: Hi, We are thinking of integrating Spark server inside a product. Our current product uses Tomcat as its webserver. Is it possible to switch the Jetty webserver in Spark to Tomcat off-the-shelf? Cheers -- Niranda -- Niranda - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[jira] [Resolved] (SPARK-4454) Race condition in DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4454. Resolution: Fixed Fix Version/s: 1.3.0 We can't be 100% sure this is fixed because it was not a reproducible issue. However, Josh has committed a patch that I think should make it hard to have race conditions around the cache location data structure. Race condition in DAGScheduler -- Key: SPARK-4454 URL: https://issues.apache.org/jira/browse/SPARK-4454 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.1.0 Reporter: Rafal Kwasny Assignee: Josh Rosen Priority: Critical Fix For: 1.3.0 It seems to be a race condition in DAGScheduler that manifests on jobs with high concurrency: {noformat} Exception in thread main java.util.NoSuchElementException: key not found: 35 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304
[jira] [Commented] (SPARK-4454) Race condition in DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324747#comment-14324747 ] Patrick Wendell commented on SPARK-4454: [~srowen] yeah I meant the particular PR was bad, not that the issue does not exist. Race condition in DAGScheduler -- Key: SPARK-4454 URL: https://issues.apache.org/jira/browse/SPARK-4454 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.1.0 Reporter: Rafal Kwasny Assignee: Josh Rosen Priority: Minor It seems to be a race condition in DAGScheduler that manifests on jobs with high concurrency: {noformat} Exception in thread main java.util.NoSuchElementException: key not found: 35 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1275) at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:937
[jira] [Updated] (SPARK-4454) Race condition in DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4454: --- Priority: Critical (was: Minor) Race condition in DAGScheduler -- Key: SPARK-4454 URL: https://issues.apache.org/jira/browse/SPARK-4454 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.1.0 Reporter: Rafal Kwasny Assignee: Josh Rosen Priority: Critical It seems to be a race condition in DAGScheduler that manifests on jobs with high concurrency: {noformat} Exception in thread main java.util.NoSuchElementException: key not found: 35 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1275) at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:937) at org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:175
[jira] [Updated] (SPARK-4454) Race condition in DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4454: --- Target Version/s: 1.3.0 Race condition in DAGScheduler -- Key: SPARK-4454 URL: https://issues.apache.org/jira/browse/SPARK-4454 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.1.0 Reporter: Rafal Kwasny Assignee: Josh Rosen Priority: Critical It seems to be a race condition in DAGScheduler that manifests on jobs with high concurrency: {noformat} Exception in thread main java.util.NoSuchElementException: key not found: 35 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1275) at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:937) at org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:175
[jira] [Commented] (SPARK-5864) support .jar as python package
[ https://issues.apache.org/jira/browse/SPARK-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324787#comment-14324787 ] Patrick Wendell commented on SPARK-5864: I merged davies PR, but per Burak's comment I think there is additional work needed. support .jar as python package -- Key: SPARK-5864 URL: https://issues.apache.org/jira/browse/SPARK-5864 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Support .jar file as python package (same to .zip or .egg) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5778) Throw if nonexistent spark.metrics.conf file is provided
[ https://issues.apache.org/jira/browse/SPARK-5778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5778. Resolution: Fixed Fix Version/s: 1.3.0 Throw if nonexistent spark.metrics.conf file is provided -- Key: SPARK-5778 URL: https://issues.apache.org/jira/browse/SPARK-5778 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.1 Reporter: Ryan Williams Assignee: Ryan Williams Priority: Minor Fix For: 1.3.0 Spark looks for a {{MetricsSystem}} configuration file when the {{spark.metrics.conf}} parameter is set, [defaulting to the path {{metrics.properties}} when it's not set|https://github.com/apache/spark/blob/466b1f671b21f575d28f9c103f51765790914fe3/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L52-L55]. In the event of a failure to find or parse the file, [the exception is caught and an error is logged|https://github.com/apache/spark/blob/466b1f671b21f575d28f9c103f51765790914fe3/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L61]. This seems like reasonable behavior in the general case where the user has not specified a {{spark.metrics.conf}} file. However, I've been bitten several times by having specified a file that all or some executors did not have present (I typo'd the path, or forgot to add an additional {{--files}} flag to make my local metrics config file get shipped to all executors), the error was swallowed and I was confused about why I'd captured no metrics from a job that appeared to have run successfully. I'd like to change the behavior to actually throw if the user has specified a configuration file that doesn't exist. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5778) Throw if nonexistent spark.metrics.conf file is provided
[ https://issues.apache.org/jira/browse/SPARK-5778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5778: --- Assignee: Ryan Williams Throw if nonexistent spark.metrics.conf file is provided -- Key: SPARK-5778 URL: https://issues.apache.org/jira/browse/SPARK-5778 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.1 Reporter: Ryan Williams Assignee: Ryan Williams Priority: Minor Spark looks for a {{MetricsSystem}} configuration file when the {{spark.metrics.conf}} parameter is set, [defaulting to the path {{metrics.properties}} when it's not set|https://github.com/apache/spark/blob/466b1f671b21f575d28f9c103f51765790914fe3/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L52-L55]. In the event of a failure to find or parse the file, [the exception is caught and an error is logged|https://github.com/apache/spark/blob/466b1f671b21f575d28f9c103f51765790914fe3/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L61]. This seems like reasonable behavior in the general case where the user has not specified a {{spark.metrics.conf}} file. However, I've been bitten several times by having specified a file that all or some executors did not have present (I typo'd the path, or forgot to add an additional {{--files}} flag to make my local metrics config file get shipped to all executors), the error was swallowed and I was confused about why I'd captured no metrics from a job that appeared to have run successfully. I'd like to change the behavior to actually throw if the user has specified a configuration file that doesn't exist. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5848) ConsoleProgressBar timer thread leaks SparkContext
[ https://issues.apache.org/jira/browse/SPARK-5848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5848: --- Component/s: (was: Web UI) Spark Shell ConsoleProgressBar timer thread leaks SparkContext -- Key: SPARK-5848 URL: https://issues.apache.org/jira/browse/SPARK-5848 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.1 Reporter: Matt Whelan ConsoleProgressBar starts a timer. This creates a thread (which is a garbage collection root) with a reference to the ConsoleProgressBar instance, which holds a reference to the SparkContext. That timer is never canceled, so SparkContexts are leaked. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5846) Spark SQL does not correctly set job description and scheduler pool
[ https://issues.apache.org/jira/browse/SPARK-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5846: --- Priority: Critical (was: Major) Spark SQL does not correctly set job description and scheduler pool --- Key: SPARK-5846 URL: https://issues.apache.org/jira/browse/SPARK-5846 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.2.1 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Critical Spark SQL current sets the scheduler pool and job description AFTER jobs run (see https://github.com/apache/spark/blob/master/sql/hive-thriftserver/v0.13.1/src/main/scala/org/apache/spark/sql/hive/thriftserver/Shim13.scala#L168 -- which happens after calling hiveContext.sql). As a result, the description for a SQL job ends up being the SQL query corresponding to the previous job. This should be done before the job is run so the description is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5850) Remove experimental label for Scala 2.11 and FlumePollingStream
[ https://issues.apache.org/jira/browse/SPARK-5850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5850: --- Summary: Remove experimental label for Scala 2.11 and FlumePollingStream (was: Clean up experimental label for Scala 2.11 and FlumePollingStream) Remove experimental label for Scala 2.11 and FlumePollingStream --- Key: SPARK-5850 URL: https://issues.apache.org/jira/browse/SPARK-5850 Project: Spark Issue Type: Bug Components: Spark Core, Streaming Reporter: Patrick Wendell Assignee: Patrick Wendell These things have been out for at least one release and can be promoted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5850) Clean up experimental label for Scala 2.11 and FlumePollingStream
Patrick Wendell created SPARK-5850: -- Summary: Clean up experimental label for Scala 2.11 and FlumePollingStream Key: SPARK-5850 URL: https://issues.apache.org/jira/browse/SPARK-5850 Project: Spark Issue Type: Bug Components: Spark Core, Streaming Reporter: Patrick Wendell Assignee: Patrick Wendell These things have been out for at least one release and can be promoted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5856) In Maven build script, launch Zinc with more memory
Patrick Wendell created SPARK-5856: -- Summary: In Maven build script, launch Zinc with more memory Key: SPARK-5856 URL: https://issues.apache.org/jira/browse/SPARK-5856 Project: Spark Issue Type: Improvement Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker I've seen out of memory exceptions when trying to run many parallel builds against the same Zinc server during packaging. We should use the same increased memory settings we use for Maven itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5081) Shuffle write increases
[ https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5081: --- Priority: Critical (was: Major) Shuffle write increases --- Key: SPARK-5081 URL: https://issues.apache.org/jira/browse/SPARK-5081 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Kevin Jung Priority: Critical The size of shuffle write showing in spark web UI is much different when I execute same spark job with same input data in both spark 1.1 and spark 1.2. At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 still writes shuffle output more than spark 1.1. It can increase disk I/O overhead exponentially as the input file gets bigger and it causes the jobs take more time to complete. In the case of about 100GB input, for example, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. spark 1.1 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |9|saveAsTextFile| |1169.4KB| | |12|combineByKey| |1265.4KB|1275.0KB| |6|sortByKey| |1276.5KB| | |8|mapPartitions| |91.0MB|1383.1KB| |4|apply| |89.4MB| | |5|sortBy|155.6MB| |98.1MB| |3|sortBy|155.6MB| | | |1|collect| |2.1MB| | |2|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | spark 1.2 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |12|saveAsTextFile| |1170.2KB| | |11|combineByKey| |1264.5KB|1275.0KB| |8|sortByKey| |1273.6KB| | |7|mapPartitions| |134.5MB|1383.1KB| |5|zipWithIndex| |132.5MB| | |4|sortBy|155.6MB| |146.9MB| |3|sortBy|155.6MB| | | |2|collect| |2.0MB| | |1|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5850) Remove experimental label for Scala 2.11 and FlumePollingStream
[ https://issues.apache.org/jira/browse/SPARK-5850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5850: --- Priority: Blocker (was: Major) Remove experimental label for Scala 2.11 and FlumePollingStream --- Key: SPARK-5850 URL: https://issues.apache.org/jira/browse/SPARK-5850 Project: Spark Issue Type: Bug Components: Spark Core, Streaming Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker These things have been out for at least one release and can be promoted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5081) Shuffle write increases
[ https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323629#comment-14323629 ] Patrick Wendell commented on SPARK-5081: Hey [~cb_betz], can you verify a few things? It would be good to make sure you revert all configuration changes from 1.2.0. Specifically, set spark.shuffle.blockTransferService to nio and set spark.shuffle.manager to hash. Also, verify in the UI that they are set correctly. If all this is set, can you give us the change in the size of the aggregate shuffle output between the two releases? Shuffle write increases --- Key: SPARK-5081 URL: https://issues.apache.org/jira/browse/SPARK-5081 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Kevin Jung The size of shuffle write showing in spark web UI is much different when I execute same spark job with same input data in both spark 1.1 and spark 1.2. At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 still writes shuffle output more than spark 1.1. It can increase disk I/O overhead exponentially as the input file gets bigger and it causes the jobs take more time to complete. In the case of about 100GB input, for example, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. spark 1.1 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |9|saveAsTextFile| |1169.4KB| | |12|combineByKey| |1265.4KB|1275.0KB| |6|sortByKey| |1276.5KB| | |8|mapPartitions| |91.0MB|1383.1KB| |4|apply| |89.4MB| | |5|sortBy|155.6MB| |98.1MB| |3|sortBy|155.6MB| | | |1|collect| |2.1MB| | |2|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | spark 1.2 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |12|saveAsTextFile| |1170.2KB| | |11|combineByKey| |1264.5KB|1275.0KB| |8|sortByKey| |1273.6KB| | |7|mapPartitions| |134.5MB|1383.1KB| |5|zipWithIndex| |132.5MB| | |4|sortBy|155.6MB| |146.9MB| |3|sortBy|155.6MB| | | |2|collect| |2.0MB| | |1|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5745) Allow to use custom TaskMetrics implementation
[ https://issues.apache.org/jira/browse/SPARK-5745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322091#comment-14322091 ] Patrick Wendell commented on SPARK-5745: Hey [~jlewandowski] - TaskMetrics are a mostly internal concept. In fact, there isn't really any nice framework for aggregation internally. We instead have a bunch of manual aggregation in various places. The primary user-facing API we have aggregated counters are accumulators. Are there features lacking from accumulators that make it difficult for you to use them for your use case? Allow to use custom TaskMetrics implementation -- Key: SPARK-5745 URL: https://issues.apache.org/jira/browse/SPARK-5745 Project: Spark Issue Type: Wish Components: Spark Core Reporter: Jacek Lewandowski There can be various RDDs implemented and the {{TaskMetrics}} provides a great API for collecting metrics and aggregating them. However some RDDs may want to register some custom metrics and the current implementation doesn't allow for this (for example the number of read rows or whatever). I suppose that this can be changed without modifying the whole interface - there could used some factory to create the initial {{TaskMetrics}} object. The default factory could be overridden by user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5826) JavaStreamingContext.fileStream cause Configuration NotSerializableException
[ https://issues.apache.org/jira/browse/SPARK-5826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5826: --- Priority: Critical (was: Minor) JavaStreamingContext.fileStream cause Configuration NotSerializableException Key: SPARK-5826 URL: https://issues.apache.org/jira/browse/SPARK-5826 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Littlestar Priority: Critical Attachments: TestStream.java org.apache.spark.streaming.api.java.JavaStreamingContext.fileStream(String directory, ClassLongWritable kClass, ClassText vClass, ClassTextInputFormat fClass, FunctionPath, Boolean filter, boolean newFilesOnly, Configuration conf) I use JavaStreamingContext.fileStream with 1.3.0/master with Configuration. but it throw strange exception. java.io.NotSerializableException: org.apache.hadoop.conf.Configuration at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440) at org.apache.spark.streaming.DStreamGraph$$anonfun$writeObject$1.apply$mcV$sp(DStreamGraph.scala:177) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1075) at org.apache.spark.streaming.DStreamGraph.writeObject(DStreamGraph.scala:172) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.streaming.CheckpointWriter.write(Checkpoint.scala:184) at org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:278) at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:169) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:78) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1.aroundReceive(JobGenerator.scala:76) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec
[jira] [Commented] (SPARK-5770) Use addJar() to upload a new jar file to executor, it can't be added to classloader
[ https://issues.apache.org/jira/browse/SPARK-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321269#comment-14321269 ] Patrick Wendell commented on SPARK-5770: If the request is to support hot reloading of classes in a live JVM (can't tell if it is), it seems fairly dangerous to do that. I'm not sure how that works if you have live objects of a specific class and then the class definition changes under them. Anyways, as Sean and Marcelo pointed out, it would be helpful to hear what the intended use case is! Use addJar() to upload a new jar file to executor, it can't be added to classloader --- Key: SPARK-5770 URL: https://issues.apache.org/jira/browse/SPARK-5770 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula First use addJar() to upload a jar to the executor, then change the jar content and upload it again. We can see the jar file in the local has be updated, but the classloader still load the old one. The executor log has no error or exception to point it. I use spark-shell to test it. And set spark.files.overwrite is true. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5801) Shuffle creates too many nested directories
[ https://issues.apache.org/jira/browse/SPARK-5801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5801: --- Priority: Critical (was: Major) Shuffle creates too many nested directories --- Key: SPARK-5801 URL: https://issues.apache.org/jira/browse/SPARK-5801 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.2.1 Reporter: Kay Ousterhout Priority: Critical When running Spark on EC2, there are 4 nested shuffle directories before the hashed directory names, for example: /mnt/spark/spark-5824d912-25af-4187-bc6a-29ae42cd78e5/spark-675133f0-b2c8-44a1-8775-5e394674609b/spark-69c1ea15-4e7f-454a-9f57-19763c7bdd17/spark-b036335c-60fa-48ab-a346-f1b420af2027/0c My understanding is that this should look like: /mnt/spark/spark-5824d912-25af-4187-bc6a-29ae42cd78e5/0c This happened when I was using the sort-based shuffle (all default configurations for Spark on EC2). This is not a correctness problem (the shuffle still works fine). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5813) Spark-ec2: Switch to OracleJDK
[ https://issues.apache.org/jira/browse/SPARK-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321267#comment-14321267 ] Patrick Wendell commented on SPARK-5813: Hey [~florianverhein]. Just wondering, are there specific features of Oracle's JRE you are interest in? These days, Oracle's JRE and OpenJDK are basically equivalent. In the history of Spark, I don't think I've ever seen us have a bug that was specific to OpenJDK and not also present in Oracle JDK. Given how much easier it is to install open JDK I'm not sure it's worth the extra packaging annoyance to add Oracle Java. Just curious if you have a specific reason to want Oracle. Spark-ec2: Switch to OracleJDK -- Key: SPARK-5813 URL: https://issues.apache.org/jira/browse/SPARK-5813 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Florian Verhein Priority: Minor Currently using OpenJDK, however it is generally recommended to use Oracle JDK, esp for Hadoop deployments, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5801) Shuffle creates too many nested directories
[ https://issues.apache.org/jira/browse/SPARK-5801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5801: --- Component/s: Spark Core Shuffle creates too many nested directories --- Key: SPARK-5801 URL: https://issues.apache.org/jira/browse/SPARK-5801 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.2.1 Reporter: Kay Ousterhout Priority: Critical When running Spark on EC2, there are 4 nested shuffle directories before the hashed directory names, for example: /mnt/spark/spark-5824d912-25af-4187-bc6a-29ae42cd78e5/spark-675133f0-b2c8-44a1-8775-5e394674609b/spark-69c1ea15-4e7f-454a-9f57-19763c7bdd17/spark-b036335c-60fa-48ab-a346-f1b420af2027/0c My understanding is that this should look like: /mnt/spark/spark-5824d912-25af-4187-bc6a-29ae42cd78e5/0c This happened when I was using the sort-based shuffle (all default configurations for Spark on EC2). This is not a correctness problem (the shuffle still works fine). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset
[ https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5731: --- Priority: Blocker (was: Major) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset Key: SPARK-5731 URL: https://issues.apache.org/jira/browse/SPARK-5731 Project: Spark Issue Type: Bug Components: Streaming, Tests Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Tathagata Das Priority: Blocker Labels: flaky-test {code} sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 110 times over 20.070287525 seconds. Last failure message: 300 did not equal 48 didn't get all messages. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfterAll$$super$run(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at org.scalatest.BeforeAndAfterAll$class.run
[jira] [Commented] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset
[ https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320739#comment-14320739 ] Patrick Wendell commented on SPARK-5731: [~c...@koeninger.org] [~tdas] FYI we've disabled this test because it's caused a huge productivity loss to ongoing development with frequent failures. Please try to get this test into good shape ASAP - otherwise this code will be untested. Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset Key: SPARK-5731 URL: https://issues.apache.org/jira/browse/SPARK-5731 Project: Spark Issue Type: Bug Components: Streaming, Tests Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Tathagata Das Priority: Blocker Labels: flaky-test {code} sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 110 times over 20.070287525 seconds. Last failure message: 300 did not equal 48 didn't get all messages. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241
[jira] [Updated] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset
[ https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5731: --- Labels: flaky-test (was: ) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset Key: SPARK-5731 URL: https://issues.apache.org/jira/browse/SPARK-5731 Project: Spark Issue Type: Bug Components: Streaming, Tests Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Tathagata Das Priority: Blocker Labels: flaky-test {code} sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 110 times over 20.070287525 seconds. Last failure message: 300 did not equal 48 didn't get all messages. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfterAll$$super$run(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at org.scalatest.BeforeAndAfterAll$class.run
[jira] [Resolved] (SPARK-5679) Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads and input metrics with mixed read method
[ https://issues.apache.org/jira/browse/SPARK-5679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5679. Resolution: Fixed Fix Version/s: 1.2.2 1.3.0 Assignee: Josh Rosen (was: Kostas Sakellis) Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads and input metrics with mixed read method -- Key: SPARK-5679 URL: https://issues.apache.org/jira/browse/SPARK-5679 Project: Spark Issue Type: Bug Components: Spark Core, Tests Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Josh Rosen Labels: flaky-test Fix For: 1.3.0, 1.2.2 Please audit these and see if there are any assumptions with respect to File IO that might not hold in all cases. I'm happy to help if you can't find anything. These both failed in the same run: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-SBT/38/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink {code} org.apache.spark.metrics.InputOutputMetricsSuite.input metrics with mixed read method Failing for the past 13 builds (Since Failed#26 ) Took 48 sec. Error Message 2030 did not equal 6496 Stacktrace sbt.ForkMain$ForkError: 2030 did not equal 6496 at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) at org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply$mcV$sp(InputOutputMetricsSuite.scala:135) at org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113) at org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfter$$super$runTest(InputOutputMetricsSuite.scala:46) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.metrics.InputOutputMetricsSuite.runTest(InputOutputMetricsSuite.scala:46) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfterAll$$super$run(InputOutputMetricsSuite.scala:46) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at org.scalatest.BeforeAndAfterAll
Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets
The map will start with a capacity of 64, but will grow to accommodate new data. Are you using the groupBy operator in Spark or are you using Spark SQL's group by? This usually happens if you are grouping or aggregating in a way that doesn't sufficiently condense the data created from each input partition. - Patrick On Wed, Feb 11, 2015 at 9:37 PM, fightf...@163.com fightf...@163.com wrote: Hi, Really have no adequate solution got for this issue. Expecting any available analytical rules or hints. Thanks, Sun. fightf...@163.com From: fightf...@163.com Date: 2015-02-09 11:56 To: user; dev Subject: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets Hi, Problem still exists. Any experts would take a look at this? Thanks, Sun. fightf...@163.com From: fightf...@163.com Date: 2015-02-06 17:54 To: user; dev Subject: Sort Shuffle performance issues about using AppendOnlyMap for large data sets Hi, all Recently we had caught performance issues when using spark 1.2.0 to read data from hbase and do some summary work. Our scenario means to : read large data sets from hbase (maybe 100G+ file) , form hbaseRDD, transform to schemardd, groupby and aggregate the data while got fewer new summary data sets, loading data into hbase (phoenix). Our major issue lead to : aggregate large datasets to get summary data sets would consume too long time (1 hour +) , while that should be supposed not so bad performance. We got the dump file attached and stacktrace from jstack like the following: From the stacktrace and dump file we can identify that processing large datasets would cause frequent AppendOnlyMap growing, and leading to huge map entrysize. We had referenced the source code of org.apache.spark.util.collection.AppendOnlyMap and found that the map had been initialized with capacity of 64. That would be too small for our use case. So the question is : Does anyone had encounted such issues before? How did that be resolved? I cannot find any jira issues for such problems and if someone had seen, please kindly let us know. More specified solution would goes to : Does any possibility exists for user defining the map capacity releatively in spark? If so, please tell how to achieve that. Best Thanks, Sun. Thread 22432: (state = IN_JAVA) - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87, line=224 (Compiled frame; information may be imprecise) - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable() @bci=1, line=38 (Interpreted frame) - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22, line=198 (Compiled frame) - org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object, scala.Function2) @bci=201, line=145 (Compiled frame) - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object, scala.Function2) @bci=3, line=32 (Compiled frame) - org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator) @bci=141, line=205 (Compiled frame) - org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator) @bci=74, line=58 (Interpreted frame) - org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) @bci=169, line=68 (Interpreted frame) - org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) @bci=2, line=41 (Interpreted frame) - org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted frame) - org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1145 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame) Thread 22431: (state = IN_JAVA) - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87, line=224 (Compiled frame; information may be imprecise) - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable() @bci=1, line=38 (Interpreted frame) - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22, line=198 (Compiled frame) - org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object, scala.Function2) @bci=201, line=145 (Compiled frame) - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object, scala.Function2) @bci=3, line=32 (Compiled frame) - org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator) @bci=141, line=205 (Compiled frame) - org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator) @bci=74, line=58 (Interpreted frame) - org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)
Re: driver fail-over in Spark streaming 1.2.0
It will create and connect to new executors. The executors are mostly stateless, so the program can resume with new executors. On Wed, Feb 11, 2015 at 11:24 PM, lin kurtt@gmail.com wrote: Hi, all In Spark Streaming 1.2.0, when the driver fails and a new driver starts with the most updated check-pointed data, will the former Executors connects to the new driver, or will the new driver starts out its own set of new Executors? In which piece of codes is that done? Any reply will be appreciated :) regards, lin - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets
The map will start with a capacity of 64, but will grow to accommodate new data. Are you using the groupBy operator in Spark or are you using Spark SQL's group by? This usually happens if you are grouping or aggregating in a way that doesn't sufficiently condense the data created from each input partition. - Patrick On Wed, Feb 11, 2015 at 9:37 PM, fightf...@163.com fightf...@163.com wrote: Hi, Really have no adequate solution got for this issue. Expecting any available analytical rules or hints. Thanks, Sun. fightf...@163.com From: fightf...@163.com Date: 2015-02-09 11:56 To: user; dev Subject: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets Hi, Problem still exists. Any experts would take a look at this? Thanks, Sun. fightf...@163.com From: fightf...@163.com Date: 2015-02-06 17:54 To: user; dev Subject: Sort Shuffle performance issues about using AppendOnlyMap for large data sets Hi, all Recently we had caught performance issues when using spark 1.2.0 to read data from hbase and do some summary work. Our scenario means to : read large data sets from hbase (maybe 100G+ file) , form hbaseRDD, transform to schemardd, groupby and aggregate the data while got fewer new summary data sets, loading data into hbase (phoenix). Our major issue lead to : aggregate large datasets to get summary data sets would consume too long time (1 hour +) , while that should be supposed not so bad performance. We got the dump file attached and stacktrace from jstack like the following: From the stacktrace and dump file we can identify that processing large datasets would cause frequent AppendOnlyMap growing, and leading to huge map entrysize. We had referenced the source code of org.apache.spark.util.collection.AppendOnlyMap and found that the map had been initialized with capacity of 64. That would be too small for our use case. So the question is : Does anyone had encounted such issues before? How did that be resolved? I cannot find any jira issues for such problems and if someone had seen, please kindly let us know. More specified solution would goes to : Does any possibility exists for user defining the map capacity releatively in spark? If so, please tell how to achieve that. Best Thanks, Sun. Thread 22432: (state = IN_JAVA) - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87, line=224 (Compiled frame; information may be imprecise) - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable() @bci=1, line=38 (Interpreted frame) - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22, line=198 (Compiled frame) - org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object, scala.Function2) @bci=201, line=145 (Compiled frame) - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object, scala.Function2) @bci=3, line=32 (Compiled frame) - org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator) @bci=141, line=205 (Compiled frame) - org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator) @bci=74, line=58 (Interpreted frame) - org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) @bci=169, line=68 (Interpreted frame) - org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) @bci=2, line=41 (Interpreted frame) - org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted frame) - org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1145 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame) Thread 22431: (state = IN_JAVA) - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87, line=224 (Compiled frame; information may be imprecise) - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable() @bci=1, line=38 (Interpreted frame) - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22, line=198 (Compiled frame) - org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object, scala.Function2) @bci=201, line=145 (Compiled frame) - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object, scala.Function2) @bci=3, line=32 (Compiled frame) - org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator) @bci=141, line=205 (Compiled frame) - org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator) @bci=74, line=58 (Interpreted frame) - org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)
Re: How to track issues that must wait for Spark 2.x in JIRA?
Yeah my preferred is also having a more open ended 2+ for issues that are clearly desirable but blocked by compatibility concerns. What I would really want to avoid is major feature proposals sitting around in our JIRA and tagged under some 2.X version. IMO JIRA isn't the place for thoughts about very-long-term things. When we get these, I'd be include to either close them as won't fix or later. On Thu, Feb 12, 2015 at 12:47 AM, Reynold Xin r...@databricks.com wrote: It seems to me having a version that is 2+ is good for that? Once we move to 2.0, we can retag those that are not going to be fixed in 2.0 as 2.0.1 or 2.1.0 . On Thu, Feb 12, 2015 at 12:42 AM, Sean Owen so...@cloudera.com wrote: Patrick and I were chatting about how to handle several issues which clearly need a fix, and are easy, but can't be implemented until a next major release like Spark 2.x since it would change APIs. Examples: https://issues.apache.org/jira/browse/SPARK-3266 https://issues.apache.org/jira/browse/SPARK-3369 https://issues.apache.org/jira/browse/SPARK-4819 We could simply make version 2.0.0 in JIRA. Although straightforward, it might imply that release planning has begun for 2.0.0. The version could be called 2+ for now to better indicate its status. There is also a Later JIRA resolution. Although resolving the above seems a little wrong, it might be reasonable if we're sure to revisit Later, well, at some well defined later. The three issues above risk getting lost in the shuffle. We also wondered whether using Later is good or bad. It takes items off the radar that aren't going to be acted on anytime soon -- and there are lots of those right now. It might send a message that these will be revisited when they are even less likely to if resolved. Any opinions? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[ANNOUNCE] Spark 1.3.0 Snapshot 1
Hey All, I've posted Spark 1.3.0 snapshot 1. At this point the 1.3 branch is ready for community testing and we are strictly merging fixes and documentation across all components. The release files, including signatures, digests, etc can be found at: http://people.apache.org/~pwendell/spark-1.3.0-snapshot1/ The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1068/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.0-snapshot1-docs/ Please report any issues with the release to this thread and/or to our project JIRA. Thanks! - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[jira] [Updated] (SPARK-5606) Support plus sign in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-5606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5606: --- Assignee: Yadong Qi Support plus sign in HiveContext Key: SPARK-5606 URL: https://issues.apache.org/jira/browse/SPARK-5606 Project: Spark Issue Type: Bug Components: SQL Reporter: Yadong Qi Assignee: Yadong Qi Fix For: 1.3.0 Now spark version is only support ```SELECT -key FROM DECIMAL_UDF;``` in HiveContext. This patch is used to support ```SELECT +key FROM DECIMAL_UDF;``` in HiveContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: feeding DataFrames into predictive algorithms
I think there is a minor error here in that the first example needs a tail after the seq: df.map { row = (row.getDouble(0), row.toSeq.tail.map(_.asInstanceOf[Double])) }.toDataFrame(label, features) On Wed, Feb 11, 2015 at 7:46 PM, Michael Armbrust mich...@databricks.com wrote: It sounds like you probably want to do a standard Spark map, that results in a tuple with the structure you are looking for. You can then just assign names to turn it back into a dataframe. Assuming the first column is your label and the rest are features you can do something like this: val df = sc.parallelize( (1.0, 2.3, 2.4) :: (1.2, 3.4, 1.2) :: (1.2, 2.3, 1.2) :: Nil).toDataFrame(a, b, c) df.map { row = (row.getDouble(0), row.toSeq.map(_.asInstanceOf[Double])) }.toDataFrame(label, features) df: org.apache.spark.sql.DataFrame = [label: double, features: arraydouble] If you'd prefer to stick closer to SQL you can define a UDF: val createArray = udf((a: Double, b: Double) = Seq(a, b)) df.select('a as 'label, createArray('b,'c) as 'features) df: org.apache.spark.sql.DataFrame = [label: double, features: arraydouble] We'll add createArray as a first class member of the DSL. Michael On Wed, Feb 11, 2015 at 6:37 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hey All, I've been playing around with the new DataFrame and ML pipelines APIs and am having trouble accomplishing what seems like should be a fairly basic task. I have a DataFrame where each column is a Double. I'd like to turn this into a DataFrame with a features column and a label column that I can feed into a regression. So far all the paths I've gone down have led me to internal APIs or convoluted casting in and out of RDD[Row] and DataFrame. Is there a simple way of accomplishing this? any assistance (lookin' at you Xiangrui) much appreciated, Sandy - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[jira] [Updated] (SPARK-5656) NegativeArraySizeException in EigenValueDecomposition.symmetricEigs for large n and/or large k
[ https://issues.apache.org/jira/browse/SPARK-5656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5656: --- Assignee: Mark Bittmann NegativeArraySizeException in EigenValueDecomposition.symmetricEigs for large n and/or large k -- Key: SPARK-5656 URL: https://issues.apache.org/jira/browse/SPARK-5656 Project: Spark Issue Type: Bug Components: MLlib Reporter: Mark Bittmann Assignee: Mark Bittmann Priority: Minor Fix For: 1.4.0 Large values of n or k in EigenValueDecomposition.symmetricEigs will fail with a NegativeArraySizeException. Specifically, this occurs when 2*n*k Integer.MAX_VALUE. These values are currently unchecked and allow for the array to be initialized to a value greater than Integer.MAX_VALUE. I have written the below 'require' to fail this condition gracefully. I will submit a pull request. require(ncv * n.toLong Integer.MAX_VALUE, Product of 2*k*n must be smaller than + sInteger.MAX_VALUE. Found required eigenvalues k = $k and matrix dimension n = $n) Here is the exception that occurs from computeSVD with large k and/or n: Exception in thread main java.lang.NegativeArraySizeException at org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:85) at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:258) at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:190) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5366) check for mode of private key file
[ https://issues.apache.org/jira/browse/SPARK-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5366: --- Assignee: liu chang check for mode of private key file -- Key: SPARK-5366 URL: https://issues.apache.org/jira/browse/SPARK-5366 Project: Spark Issue Type: Improvement Components: EC2 Reporter: liu chang Assignee: liu chang Priority: Minor Fix For: 1.4.0 check the mode for the private key. User should set it to 600. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5611) Allow spark-ec2 repo to be specified in CLI of spark_ec2.py
[ https://issues.apache.org/jira/browse/SPARK-5611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5611: --- Assignee: Florian Verhein Allow spark-ec2 repo to be specified in CLI of spark_ec2.py --- Key: SPARK-5611 URL: https://issues.apache.org/jira/browse/SPARK-5611 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Florian Verhein Assignee: Florian Verhein Priority: Minor Fix For: 1.4.0 Allows repo and branch of the desired spark-ec2 (and by extension the ami-list) to be specified on the command line. Helps when trying out different branches or forks of spark-ec2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5648) support alter ... unset tblproperties(key)
[ https://issues.apache.org/jira/browse/SPARK-5648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5648: --- Assignee: DoingDone9 support alter ... unset tblproperties(key) --- Key: SPARK-5648 URL: https://issues.apache.org/jira/browse/SPARK-5648 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: DoingDone9 Assignee: DoingDone9 Fix For: 1.3.0 make hivecontext support unset tblproperties(key) like : alter view viewName unset tblproperties(k) alter table tableName unset tblproperties(k) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5568) Python API for the write support of the data source API
[ https://issues.apache.org/jira/browse/SPARK-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5568: --- Assignee: Yin Huai Python API for the write support of the data source API --- Key: SPARK-5568 URL: https://issues.apache.org/jira/browse/SPARK-5568 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5733) Error Link in Pagination of HistroyPage when showing Incomplete Applications
[ https://issues.apache.org/jira/browse/SPARK-5733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5733: --- Assignee: Liangliang Gu Error Link in Pagination of HistroyPage when showing Incomplete Applications - Key: SPARK-5733 URL: https://issues.apache.org/jira/browse/SPARK-5733 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.1 Reporter: Liangliang Gu Assignee: Liangliang Gu Priority: Minor Fix For: 1.3.0 The links in pagination of HistroyPage is wrong when showing Incomplete Applications. If 2 is click on the following page http://history-server:18080/?page=1showIncomplete=true;, it will go to http://history-server:18080/?page=2; instead of http://history-server:18080/?page=2showIncomplete=true;. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5658) Finalize DDL and write support APIs
[ https://issues.apache.org/jira/browse/SPARK-5658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5658: --- Assignee: Yin Huai Finalize DDL and write support APIs --- Key: SPARK-5658 URL: https://issues.apache.org/jira/browse/SPARK-5658 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5704) createDataFrame replace applySchema/inferSchema
[ https://issues.apache.org/jira/browse/SPARK-5704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5704: --- Assignee: Davies Liu createDataFrame replace applySchema/inferSchema --- Key: SPARK-5704 URL: https://issues.apache.org/jira/browse/SPARK-5704 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5709) Add EXPLAIN support for DataFrame API for debugging purpose
[ https://issues.apache.org/jira/browse/SPARK-5709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5709: --- Assignee: Cheng Hao Add EXPLAIN support for DataFrame API for debugging purpose - Key: SPARK-5709 URL: https://issues.apache.org/jira/browse/SPARK-5709 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5683) Improve the json serialization for DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-5683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5683: --- Assignee: Cheng Hao Improve the json serialization for DataFrame API Key: SPARK-5683 URL: https://issues.apache.org/jira/browse/SPARK-5683 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Priority: Minor Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5509) EqualTo operator doesn't handle binary type properly
[ https://issues.apache.org/jira/browse/SPARK-5509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5509: --- Assignee: Cheng Lian EqualTo operator doesn't handle binary type properly Key: SPARK-5509 URL: https://issues.apache.org/jira/browse/SPARK-5509 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.3.0, 1.2.1 Reporter: Cheng Lian Assignee: Cheng Lian Fix For: 1.3.0 Binary type is mapped to {{Array\[Byte\]}}, which can't be compared with {{==}} directly. However, {{EqualTo.eval()}} uses plain {{==}} to compare values. Run the following {{spark-shell}} snippet with Spark 1.2.0 to reproduce this issue: {code} import org.apache.spark.sql.SQLContext import sc._ val sqlContext = new SQLContext(sc) import sqlContext._ case class KV(key: Int, value: Array[Byte]) def toBinary(s: String): Array[Byte] = s.toString.getBytes(UTF-8) registerFunction(toBinary, toBinary _) parallelize(1 to 1024).map(i = KV(i, toBinary(i.toString))).registerTempTable(bin) // OK sql(select * from bin where value toBinary('100')).collect() // Oops, returns nothing sql(select * from bin where value = toBinary('100')).collect() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5135) Add support for describe [extended] table to DDL in SQLContext
[ https://issues.apache.org/jira/browse/SPARK-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5135: --- Assignee: Li Sheng Add support for describe [extended] table to DDL in SQLContext -- Key: SPARK-5135 URL: https://issues.apache.org/jira/browse/SPARK-5135 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.3.0 Reporter: Li Sheng Assignee: Li Sheng Priority: Minor Fix For: 1.3.0 Original Estimate: 72h Remaining Estimate: 72h Support Describe Table Command. describe [extended] tableName. This also support external datasource table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5380) There will be an ArrayIndexOutOfBoundsException if the format of the source file is wrong
[ https://issues.apache.org/jira/browse/SPARK-5380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5380: --- Assignee: Leo_lh There will be an ArrayIndexOutOfBoundsException if the format of the source file is wrong - Key: SPARK-5380 URL: https://issues.apache.org/jira/browse/SPARK-5380 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.2.0 Reporter: Leo_lh Assignee: Leo_lh Priority: Minor Fix For: 1.3.0, 1.4.0 When I build a graph with a file format error, there will be an ArrayIndexOutOfBoundsException -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5528) Support schema merging while reading Parquet files
[ https://issues.apache.org/jira/browse/SPARK-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5528: --- Assignee: Cheng Lian Support schema merging while reading Parquet files -- Key: SPARK-5528 URL: https://issues.apache.org/jira/browse/SPARK-5528 Project: Spark Issue Type: Improvement Reporter: Cheng Lian Assignee: Cheng Lian Fix For: 1.3.0 Spark 1.2.0 and prior versions only reads Parquet schema from {{_metadata}} or a random Parquet part-file, and assumes all part-files share exactly the same schema. In practice, it's common that users append new columns to existing Parquet schema. Parquet has native schema merging support for such scenarios. Spark SQL should also support this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5640) org.apache.spark.sql.catalyst.ScalaReflection is not thread safe
[ https://issues.apache.org/jira/browse/SPARK-5640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5640: --- Assignee: Tobias Schlatter org.apache.spark.sql.catalyst.ScalaReflection is not thread safe Key: SPARK-5640 URL: https://issues.apache.org/jira/browse/SPARK-5640 Project: Spark Issue Type: Bug Reporter: Tobias Schlatter Assignee: Tobias Schlatter Fix For: 1.3.0 ScalaReflection uses the Scala reflection API but does not synchronize (for example in the {{schemaFor}} method). This leads to concurrency bugs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5619) Support 'show roles' in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-5619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5619: --- Assignee: Yadong Qi Support 'show roles' in HiveContext --- Key: SPARK-5619 URL: https://issues.apache.org/jira/browse/SPARK-5619 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yadong Qi Assignee: Yadong Qi Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5686) Support `show current roles`
[ https://issues.apache.org/jira/browse/SPARK-5686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5686: --- Assignee: Li Sheng Support `show current roles` Key: SPARK-5686 URL: https://issues.apache.org/jira/browse/SPARK-5686 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Li Sheng Assignee: Li Sheng Priority: Minor Fix For: 1.3.0 Original Estimate: 3h Remaining Estimate: 3h show current roles -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5668) spark_ec2.py region parameter could be either mandatory or its value displayed
[ https://issues.apache.org/jira/browse/SPARK-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5668: --- Assignee: Miguel Peralvo spark_ec2.py region parameter could be either mandatory or its value displayed -- Key: SPARK-5668 URL: https://issues.apache.org/jira/browse/SPARK-5668 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.2.0, 1.3.0, 1.4.0 Reporter: Miguel Peralvo Assignee: Miguel Peralvo Priority: Minor Labels: starter Fix For: 1.4.0 If the region parameter is not specified when invoking spark-ec2 (spark-ec2.py behind the scenes) it defaults to us-east-1. When the cluster doesn't belong to that region, after showing the Searching for existing cluster Spark... message, it causes an ERROR: Could not find any existing cluster exception because it doesn't find you cluster in the default region. As it doesn't tell you anything about the region, It can be a small headache for new users. In http://stackoverflow.com/questions/21171576/why-does-spark-ec2-fail-with-error-could-not-find-any-existing-cluster, Dmitriy Selivanov explains it. I propose that: 1. Either we make the search message a little bit more informative with something like Searching for existing cluster Spark in region + opts.region. 2. Or we remove the us-east-1 as default and make the --region parameter mandatory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5716) Support TOK_CHARSETLITERAL in HiveQl
[ https://issues.apache.org/jira/browse/SPARK-5716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5716: --- Assignee: Adrian Wang Support TOK_CHARSETLITERAL in HiveQl Key: SPARK-5716 URL: https://issues.apache.org/jira/browse/SPARK-5716 Project: Spark Issue Type: New Feature Components: SQL Reporter: Adrian Wang Assignee: Adrian Wang Fix For: 1.3.0 where value = _UTF8 0x12345678 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5667) Remove version from spark-ec2 example.
[ https://issues.apache.org/jira/browse/SPARK-5667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5667: --- Assignee: Miguel Peralvo Remove version from spark-ec2 example. -- Key: SPARK-5667 URL: https://issues.apache.org/jira/browse/SPARK-5667 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.2.0, 1.3.0, 1.2.1, 1.2.2 Reporter: Miguel Peralvo Assignee: Miguel Peralvo Priority: Trivial Labels: documentation Fix For: 1.3.0 Remove version from spark-ec2 example for spark-ec2/Launch Cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5595) In memory data cache should be invalidated after insert into/overwrite
[ https://issues.apache.org/jira/browse/SPARK-5595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5595: --- Assignee: Yin Huai In memory data cache should be invalidated after insert into/overwrite -- Key: SPARK-5595 URL: https://issues.apache.org/jira/browse/SPARK-5595 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5278) check ambiguous reference to fields in Spark SQL is incompleted
[ https://issues.apache.org/jira/browse/SPARK-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5278: --- Assignee: Wenchen Fan check ambiguous reference to fields in Spark SQL is incompleted --- Key: SPARK-5278 URL: https://issues.apache.org/jira/browse/SPARK-5278 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan Fix For: 1.3.0 at hive context for json string like {code}{a: {b: 1, B: 2}}{code} The SQL `SELECT a.b from t` will report error for ambiguous reference to fields. But for json string like {code}{a: [{b: 1, B: 2}]}{code} The SQL `SELECT a[0].b from t` will pass and pick the first `b` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5324) Results of describe can't be queried
[ https://issues.apache.org/jira/browse/SPARK-5324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5324: --- Assignee: Li Sheng Results of describe can't be queried Key: SPARK-5324 URL: https://issues.apache.org/jira/browse/SPARK-5324 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Michael Armbrust Assignee: Li Sheng Fix For: 1.3.0 {code} sql(DESCRIBE TABLE test).registerTempTable(describeTest) sql(SELECT * FROM describeTest).collect() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5603) Preinsert casting and renaming rule is needed in the Analyzer
[ https://issues.apache.org/jira/browse/SPARK-5603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5603: --- Assignee: Yin Huai Preinsert casting and renaming rule is needed in the Analyzer - Key: SPARK-5603 URL: https://issues.apache.org/jira/browse/SPARK-5603 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Fix For: 1.3.0 For an INSERT INTO/OVERWRITE statement, we should add necessary Cast and Alias to the output of the query. {code} CREATE TEMPORARY TABLE jsonTable (a int, b string) USING org.apache.spark.sql.json.DefaultSource OPTIONS ( path '...' ) INSERT OVERWRITE TABLE jsonTable SELECT a * 2, a * 4 FROM table {code} For a*2, we should create an Alias, so the InsertableRelation can know it is the column a. For a*4, it is actually the column b in jsonTable. We should first cast it to StringType and add an Alias b to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5650) Optional 'FROM' clause in HiveQl
[ https://issues.apache.org/jira/browse/SPARK-5650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5650: --- Assignee: Liang-Chi Hsieh Optional 'FROM' clause in HiveQl Key: SPARK-5650 URL: https://issues.apache.org/jira/browse/SPARK-5650 Project: Spark Issue Type: Bug Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Priority: Minor Fix For: 1.3.0 In Hive, 'FROM' clause should be optional. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5679) Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads and input metrics with mixed read method
[ https://issues.apache.org/jira/browse/SPARK-5679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5679: --- Priority: Major (was: Blocker) Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads and input metrics with mixed read method -- Key: SPARK-5679 URL: https://issues.apache.org/jira/browse/SPARK-5679 Project: Spark Issue Type: Bug Components: Spark Core, Tests Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Kostas Sakellis Labels: flaky-test Please audit these and see if there are any assumptions with respect to File IO that might not hold in all cases. I'm happy to help if you can't find anything. These both failed in the same run: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-SBT/38/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink {code} org.apache.spark.metrics.InputOutputMetricsSuite.input metrics with mixed read method Failing for the past 13 builds (Since Failed#26 ) Took 48 sec. Error Message 2030 did not equal 6496 Stacktrace sbt.ForkMain$ForkError: 2030 did not equal 6496 at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) at org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply$mcV$sp(InputOutputMetricsSuite.scala:135) at org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113) at org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfter$$super$runTest(InputOutputMetricsSuite.scala:46) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.metrics.InputOutputMetricsSuite.runTest(InputOutputMetricsSuite.scala:46) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfterAll$$super$run(InputOutputMetricsSuite.scala:46) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) at org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfter
[jira] [Created] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset
Patrick Wendell created SPARK-5731: -- Summary: Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset Key: SPARK-5731 URL: https://issues.apache.org/jira/browse/SPARK-5731 Project: Spark Issue Type: Bug Components: Streaming Reporter: Patrick Wendell Assignee: Tathagata Das {code} sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 110 times over 20.070287525 seconds. Last failure message: 300 did not equal 48 didn't get all messages. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfterAll$$super$run(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.run(DirectKafkaStreamSuite.scala:38) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462) at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671) at sbt.ForkMain$Run$2.call
[jira] [Updated] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset
[ https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5731: --- Affects Version/s: 1.3.0 Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset Key: SPARK-5731 URL: https://issues.apache.org/jira/browse/SPARK-5731 Project: Spark Issue Type: Bug Components: Streaming, Tests Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Tathagata Das {code} sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 110 times over 20.070287525 seconds. Last failure message: 300 did not equal 48 didn't get all messages. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfterAll$$super$run(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256
[jira] [Updated] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset
[ https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5731: --- Component/s: Tests Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset Key: SPARK-5731 URL: https://issues.apache.org/jira/browse/SPARK-5731 Project: Spark Issue Type: Bug Components: Streaming, Tests Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Tathagata Das {code} sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 110 times over 20.070287525 seconds. Last failure message: 300 did not equal 48 didn't get all messages. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfterAll$$super$run(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256
[jira] [Updated] (SPARK-5493) Support proxy users under kerberos
[ https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5493: --- Assignee: Marcelo Vanzin Support proxy users under kerberos -- Key: SPARK-5493 URL: https://issues.apache.org/jira/browse/SPARK-5493 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Brock Noland Assignee: Marcelo Vanzin Fix For: 1.3.0 When using kerberos, services may want to use spark-submit to submit jobs as a separate user. For example a service like oozie might want to submit jobs as a client user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5493) Support proxy users under kerberos
[ https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5493. Resolution: Fixed Fix Version/s: 1.3.0 Target Version/s: 1.3.0 Support proxy users under kerberos -- Key: SPARK-5493 URL: https://issues.apache.org/jira/browse/SPARK-5493 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Brock Noland Assignee: Marcelo Vanzin Fix For: 1.3.0 When using kerberos, services may want to use spark-submit to submit jobs as a separate user. For example a service like oozie might want to submit jobs as a client user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: Powered by Spark: Concur
Thanks Paolo - I've fixed it. On Mon, Feb 9, 2015 at 11:10 PM, Paolo Platter paolo.plat...@agilelab.it wrote: Hi, I checked the powered by wiki too and Agile Labs should be Agile Lab. The link is wrong too, it should be www.agilelab.it. The description is correct. Thanks a lot Paolo Inviata dal mio Windows Phone Da: Denny Leemailto:denny.g@gmail.com Inviato: 10/02/2015 07:41 A: Matei Zahariamailto:matei.zaha...@gmail.com Cc: dev@spark.apache.orgmailto:dev@spark.apache.org Oggetto: Re: Powered by Spark: Concur Thanks Matei - much appreciated! On Mon Feb 09 2015 at 10:23:57 PM Matei Zaharia matei.zaha...@gmail.com wrote: Thanks Denny; added you. Matei On Feb 9, 2015, at 10:11 PM, Denny Lee denny.g@gmail.com wrote: Forgot to add Concur to the Powered by Spark wiki: Concur https://www.concur.com Spark SQL, MLLib Using Spark for travel and expenses analytics and personalization Thanks! Denny - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[jira] [Commented] (SPARK-5613) YarnClientSchedulerBackend fails to get application report when yarn restarts
[ https://issues.apache.org/jira/browse/SPARK-5613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14314821#comment-14314821 ] Patrick Wendell commented on SPARK-5613: I have cherry picked it into the 1.3 branch. YarnClientSchedulerBackend fails to get application report when yarn restarts - Key: SPARK-5613 URL: https://issues.apache.org/jira/browse/SPARK-5613 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Kashish Jain Assignee: Kashish Jain Priority: Minor Fix For: 1.3.0, 1.2.2 Original Estimate: 24h Remaining Estimate: 24h Steps to Reproduce 1) Run any spark job 2) Stop yarn while the spark job is running (an application id has been generated by now) 3) Restart yarn now 4) AsyncMonitorApplication thread fails due to ApplicationNotFoundException exception. This leads to termination of thread. Here is the StackTrace 15/02/05 05:22:37 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/02/05 05:22:38 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/02/05 05:22:39 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/02/05 05:22:40 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 5/02/05 05:22:40 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) Exception in thread Yarn application state monitor org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1423113179043_0003' doesn't exist in RM. at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Unknown Source) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:166) at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) at com.sun.proxy.$Proxy12.getApplicationReport(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:291) at org.apache.spark.deploy.yarn.Client.getApplicationReport(Client.scala:116) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:120) Caused by: org.apache.hadoop.ipc.RemoteException
[jira] [Updated] (SPARK-4382) Add locations parameter to Twitter Stream
[ https://issues.apache.org/jira/browse/SPARK-4382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4382: --- Component/s: Streaming Add locations parameter to Twitter Stream - Key: SPARK-4382 URL: https://issues.apache.org/jira/browse/SPARK-4382 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Liang-Chi Hsieh When we request Tweet stream, geo-location is one of the most important parameters. In addition to the track parameter, the locations parameter is widely used to ask for the Tweets falling within the requested bounding boxes. This PR adds the locations parameter to existing APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5735) Replace uses of EasyMock with Mockito
Patrick Wendell created SPARK-5735: -- Summary: Replace uses of EasyMock with Mockito Key: SPARK-5735 URL: https://issues.apache.org/jira/browse/SPARK-5735 Project: Spark Issue Type: Improvement Components: Tests Reporter: Patrick Wendell There are a few reasons we should drop EasyMock. First, we should have a single mocking framework in our tests in general to keep things consistent. Second, EasyMock has caused us some dependency pain in our tests due to objenesis. We aren't totally sure but suspect such conflicts might be causing non deterministic test failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org