[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark application
[ https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291818#comment-14291818 ] Robert Stupp commented on SPARK-2389: - [~srowen] yes, the problem is that drivers cannot share RDDs. IMHO there are a lot of valid scenarios that can benefit from multiple drivers using shared RDDs. globally shared SparkContext / shared Spark application - Key: SPARK-2389 URL: https://issues.apache.org/jira/browse/SPARK-2389 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Robert Stupp The documentation (in Cluster Mode Overview) cites: bq. Each application gets its own executor processes, which *stay up for the duration of the whole application* and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that *data cannot be shared* across different Spark applications (instances of SparkContext) without writing it to an external storage system. IMO this is a limitation that should be lifted to support any number of --driver-- client processes to share executors and to share (persistent / cached) data. This is especially useful if you have a bunch of frontend servers (dump web app servers) that want to use Spark as a _big computing machine_. Most important is the fact that Spark is quite good in caching/persisting data in memory / on disk thus removing load from backend data stores. Means: it would be really great to let different --driver-- client JVMs operate on the same RDDs and benefit from Spark's caching/persistence. It would however introduce some administration mechanisms to * start a shared context * update the executor configuration (# of worker nodes, # of cpus, etc) on the fly * stop a shared context Even conventional batch MR applications would benefit if ran fequently against the same data set. As an implicit requirement, RDD persistence could get a TTL for its materialized state. With such a feature the overall performance of today's web applications could then be increased by adding more web app servers, more spark nodes, more nosql nodes etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5303) applySchema returns NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-5303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mauro Pirrone closed SPARK-5303. Resolution: Not a Problem applySchema returns NullPointerException Key: SPARK-5303 URL: https://issues.apache.org/jira/browse/SPARK-5303 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Mauro Pirrone The following code snippet returns NullPointerException: val result = . val rows = result.take(10) val rowRdd = SparkManager.getContext().parallelize(rows, 1) val schemaRdd = SparkManager.getSQLContext().applySchema(rowRdd, result.schema) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:147) at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210) at scala.util.hashing.MurmurHash3.listHash(MurmurHash3.scala:168) at scala.util.hashing.MurmurHash3$.seqHash(MurmurHash3.scala:216) at scala.collection.LinearSeqLike$class.hashCode(LinearSeqLike.scala:53) at scala.collection.immutable.List.hashCode(List.scala:84) at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210) at scala.util.hashing.MurmurHash3.productHash(MurmurHash3.scala:63) at scala.util.hashing.MurmurHash3$.productHash(MurmurHash3.scala:210) at scala.runtime.ScalaRunTime$._hashCode(ScalaRunTime.scala:172) at org.apache.spark.sql.execution.LogicalRDD.hashCode(ExistingRDD.scala:58) at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210) at scala.collection.mutable.HashTable$HashUtils$class.elemHashCode(HashTable.scala:398) at scala.collection.mutable.HashMap.elemHashCode(HashMap.scala:39) at scala.collection.mutable.HashTable$class.findEntry(HashTable.scala:130) at scala.collection.mutable.HashMap.findEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.get(HashMap.scala:69) at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:187) at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91) at scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:329) at scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:327) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:327) at scala.collection.AbstractTraversable.groupBy(Traversable.scala:105) at org.apache.spark.sql.catalyst.analysis.NewRelationInstances$.apply(MultiInstanceRelation.scala:44) at org.apache.spark.sql.catalyst.analysis.NewRelationInstances$.apply(MultiInstanceRelation.scala:40) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) at org.apache.spark.sql.SchemaRDD.schema$lzycompute(SchemaRDD.scala:135) at org.apache.spark.sql.SchemaRDD.schema(SchemaRDD.scala:135) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5409) Broken link in documentation
Mauro Pirrone created SPARK-5409: Summary: Broken link in documentation Key: SPARK-5409 URL: https://issues.apache.org/jira/browse/SPARK-5409 Project: Spark Issue Type: Documentation Reporter: Mauro Pirrone Priority: Minor https://spark.apache.org/docs/1.2.0/streaming-kafka-integration.html See the API docs and the example. Link to example is broken. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark application
[ https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291797#comment-14291797 ] Robert Stupp commented on SPARK-2389: - bq. That aside, why doesn't it scale? Simply because it's just a single Spark client. If that machine's at its limit for whatever reason (VM memory, OS resources, CPU, network, ...), that's it. Sure, you can run multiple drivers - but each has its own, private set of data. IMO separate preloading is nice for some applications. But data is usually not immutable. By example: * Imagine an application that provides offers for flights worldwide. It's a huge amount of data and a huge amount of processing. It cannot be simply preloaded - prices for tickets vary from minute to minute based on booking status etc etc etc * Overall data set is quite big * Overall load is too big for a single driver to handle - imagine thousands of offer requests per second * Failure of a single driver is an absolute no-go * All clients have to access the same set of data * Preloading is just impossible during runtime (just at initial deployment) So - a suitable approach would be to have: * a Spark cluster holding all the RDDs and doing all offer and booking related operations * a set of Spark clients to abstract Spark from the rest of the application * a huge number of non-uniform frontend clients (could be web app servers, rich clients, SOAP / REST frontends) * everything (except the data) stateless globally shared SparkContext / shared Spark application - Key: SPARK-2389 URL: https://issues.apache.org/jira/browse/SPARK-2389 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Robert Stupp The documentation (in Cluster Mode Overview) cites: bq. Each application gets its own executor processes, which *stay up for the duration of the whole application* and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that *data cannot be shared* across different Spark applications (instances of SparkContext) without writing it to an external storage system. IMO this is a limitation that should be lifted to support any number of --driver-- client processes to share executors and to share (persistent / cached) data. This is especially useful if you have a bunch of frontend servers (dump web app servers) that want to use Spark as a _big computing machine_. Most important is the fact that Spark is quite good in caching/persisting data in memory / on disk thus removing load from backend data stores. Means: it would be really great to let different --driver-- client JVMs operate on the same RDDs and benefit from Spark's caching/persistence. It would however introduce some administration mechanisms to * start a shared context * update the executor configuration (# of worker nodes, # of cpus, etc) on the fly * stop a shared context Even conventional batch MR applications would benefit if ran fequently against the same data set. As an implicit requirement, RDD persistence could get a TTL for its materialized state. With such a feature the overall performance of today's web applications could then be increased by adding more web app servers, more spark nodes, more nosql nodes etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark application
[ https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291795#comment-14291795 ] Murat Eken commented on SPARK-2389: --- [~sowen], I think Robert is talking about fault tolerance when he mentions scalability. Anyway, as I mentioned in my original comment, Tachyon is not an option, at least for us, due to interprocess serialization/deserialization costs. Although we haven't tried HDFS, but I would be surprised if that performed differently. globally shared SparkContext / shared Spark application - Key: SPARK-2389 URL: https://issues.apache.org/jira/browse/SPARK-2389 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Robert Stupp The documentation (in Cluster Mode Overview) cites: bq. Each application gets its own executor processes, which *stay up for the duration of the whole application* and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that *data cannot be shared* across different Spark applications (instances of SparkContext) without writing it to an external storage system. IMO this is a limitation that should be lifted to support any number of --driver-- client processes to share executors and to share (persistent / cached) data. This is especially useful if you have a bunch of frontend servers (dump web app servers) that want to use Spark as a _big computing machine_. Most important is the fact that Spark is quite good in caching/persisting data in memory / on disk thus removing load from backend data stores. Means: it would be really great to let different --driver-- client JVMs operate on the same RDDs and benefit from Spark's caching/persistence. It would however introduce some administration mechanisms to * start a shared context * update the executor configuration (# of worker nodes, # of cpus, etc) on the fly * stop a shared context Even conventional batch MR applications would benefit if ran fequently against the same data set. As an implicit requirement, RDD persistence could get a TTL for its materialized state. With such a feature the overall performance of today's web applications could then be increased by adding more web app servers, more spark nodes, more nosql nodes etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark application
[ https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291788#comment-14291788 ] Sean Owen commented on SPARK-2389: -- Yes, the SPOF problem makes sense. It doesn't seem to be what this JIRA was about though, which seems to be what the jobserver-style approach addresses. That aside, why doesn't it scale? because of work that needs to be done on the driver? You can of course still run a bunch of drivers, just not one per client. The preloading cache issue is what off-heap caching in Tachyon or HDFS is supposed to ameliorate. globally shared SparkContext / shared Spark application - Key: SPARK-2389 URL: https://issues.apache.org/jira/browse/SPARK-2389 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Robert Stupp The documentation (in Cluster Mode Overview) cites: bq. Each application gets its own executor processes, which *stay up for the duration of the whole application* and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that *data cannot be shared* across different Spark applications (instances of SparkContext) without writing it to an external storage system. IMO this is a limitation that should be lifted to support any number of --driver-- client processes to share executors and to share (persistent / cached) data. This is especially useful if you have a bunch of frontend servers (dump web app servers) that want to use Spark as a _big computing machine_. Most important is the fact that Spark is quite good in caching/persisting data in memory / on disk thus removing load from backend data stores. Means: it would be really great to let different --driver-- client JVMs operate on the same RDDs and benefit from Spark's caching/persistence. It would however introduce some administration mechanisms to * start a shared context * update the executor configuration (# of worker nodes, # of cpus, etc) on the fly * stop a shared context Even conventional batch MR applications would benefit if ran fequently against the same data set. As an implicit requirement, RDD persistence could get a TTL for its materialized state. With such a feature the overall performance of today's web applications could then be increased by adding more web app servers, more spark nodes, more nosql nodes etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark application
[ https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291798#comment-14291798 ] Robert Stupp commented on SPARK-2389: - bq. fault tolerance when he mentions scalability both play well together in a stateless application ;) globally shared SparkContext / shared Spark application - Key: SPARK-2389 URL: https://issues.apache.org/jira/browse/SPARK-2389 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Robert Stupp The documentation (in Cluster Mode Overview) cites: bq. Each application gets its own executor processes, which *stay up for the duration of the whole application* and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that *data cannot be shared* across different Spark applications (instances of SparkContext) without writing it to an external storage system. IMO this is a limitation that should be lifted to support any number of --driver-- client processes to share executors and to share (persistent / cached) data. This is especially useful if you have a bunch of frontend servers (dump web app servers) that want to use Spark as a _big computing machine_. Most important is the fact that Spark is quite good in caching/persisting data in memory / on disk thus removing load from backend data stores. Means: it would be really great to let different --driver-- client JVMs operate on the same RDDs and benefit from Spark's caching/persistence. It would however introduce some administration mechanisms to * start a shared context * update the executor configuration (# of worker nodes, # of cpus, etc) on the fly * stop a shared context Even conventional batch MR applications would benefit if ran fequently against the same data set. As an implicit requirement, RDD persistence could get a TTL for its materialized state. With such a feature the overall performance of today's web applications could then be increased by adding more web app servers, more spark nodes, more nosql nodes etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark application
[ https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291804#comment-14291804 ] Sean Owen commented on SPARK-2389: -- Yes, makes sense. Maxing out one driver isn't an issue since you can have many drivers (or push work into the cluster). The issue is really that each driver then has its own RDDs, and if you need 100s of drivers to keep up, that just won't work. (Although then I'd question how so much work is being done on the Spark driver?) In theory the redundancy of all those RDDs is what HDFS caching and Tachyon could in theory help with, although those help share outside Spark. Whether that works for a particular use case right now is a different question, although I suspect it makes more sense to make those work than start yet another solution. What you are describing -- mutating lots shared in-memory state -- doesn't sound like a problem Spark helps solve per se. That is, it doesn't sound like work that has to live in a Spark driver program, even if it needs to ask a Spark driver-based service for some results. Naturally you know your problem better than I, but I am wondering if the answer here isn't just using Spark differently, for what it's for. globally shared SparkContext / shared Spark application - Key: SPARK-2389 URL: https://issues.apache.org/jira/browse/SPARK-2389 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Robert Stupp The documentation (in Cluster Mode Overview) cites: bq. Each application gets its own executor processes, which *stay up for the duration of the whole application* and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that *data cannot be shared* across different Spark applications (instances of SparkContext) without writing it to an external storage system. IMO this is a limitation that should be lifted to support any number of --driver-- client processes to share executors and to share (persistent / cached) data. This is especially useful if you have a bunch of frontend servers (dump web app servers) that want to use Spark as a _big computing machine_. Most important is the fact that Spark is quite good in caching/persisting data in memory / on disk thus removing load from backend data stores. Means: it would be really great to let different --driver-- client JVMs operate on the same RDDs and benefit from Spark's caching/persistence. It would however introduce some administration mechanisms to * start a shared context * update the executor configuration (# of worker nodes, # of cpus, etc) on the fly * stop a shared context Even conventional batch MR applications would benefit if ran fequently against the same data set. As an implicit requirement, RDD persistence could get a TTL for its materialized state. With such a feature the overall performance of today's web applications could then be increased by adding more web app servers, more spark nodes, more nosql nodes etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5409) Broken link in documentation
[ https://issues.apache.org/jira/browse/SPARK-5409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291823#comment-14291823 ] Sean Owen commented on SPARK-5409: -- Should just be https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala Open a PR; this probably doesn't even need a JIRA. Broken link in documentation Key: SPARK-5409 URL: https://issues.apache.org/jira/browse/SPARK-5409 Project: Spark Issue Type: Documentation Reporter: Mauro Pirrone Priority: Minor https://spark.apache.org/docs/1.2.0/streaming-kafka-integration.html See the API docs and the example. Link to example is broken. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5407) No 1.2 AMI available for ec2
[ https://issues.apache.org/jira/browse/SPARK-5407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Håkan Jonsson closed SPARK-5407. Resolution: Invalid Error on my side. No 1.2 AMI available for ec2 Key: SPARK-5407 URL: https://issues.apache.org/jira/browse/SPARK-5407 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.2.0 Reporter: Håkan Jonsson When I try to launch a standalone cluster on EC2 using the scripts in the ec2 directory for Spark 1.2, (./spark-ec2 -k spark -i k.pem launch my12), I get the following error: Could not resolve AMI at: https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-east-1/pvm It seems there is not yet any AMI available on EC2 for Spark 1.2. This works well for Spark 1.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5410) Error parsing scientific notation in a select statement
Hugo Ferrira created SPARK-5410: --- Summary: Error parsing scientific notation in a select statement Key: SPARK-5410 URL: https://issues.apache.org/jira/browse/SPARK-5410 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Hugo Ferrira I am using the Cassandra DB and am attempting a select through the Spark SQL interface. SELECT * from key_value WHERE f2 2.2E10 And get the following error (no error if I remove the E10): [info] - should be able to select a subset of applicable features *** FAILED *** [info] java.lang.RuntimeException: [1.39] failure: ``UNION'' expected but identifier E10 found [info] [info] SELECT * from key_value WHERE f2 2.2E10 [info] ^ [info] at scala.sys.package$.error(package.scala:27) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33) [info] at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) [info] at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) [info] at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174) [info] at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3852) Document spark.driver.extra* configs
[ https://issues.apache.org/jira/browse/SPARK-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3852. -- Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Sean Owen Target Version/s: (was: 1.2.0) Document spark.driver.extra* configs Key: SPARK-3852 URL: https://issues.apache.org/jira/browse/SPARK-3852 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Sean Owen Fix For: 1.3.0 They are not documented... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4430) Apache RAT Checks fail spuriously on test files
[ https://issues.apache.org/jira/browse/SPARK-4430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4430. -- Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Sean Owen Apache RAT Checks fail spuriously on test files --- Key: SPARK-4430 URL: https://issues.apache.org/jira/browse/SPARK-4430 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Ryan Williams Assignee: Sean Owen Fix For: 1.3.0 Several of my recent runs of {{./dev/run-tests}} have failed quickly due to Apache RAT checks, e.g.: {code} $ ./dev/run-tests = Running Apache RAT checks = Could not find Apache license headers in the following files: !? /Users/ryan/c/spark/streaming/FailureSuite/b732c105-4fd3-4330-ba6d-a366b340c303/test/28 !? /Users/ryan/c/spark/streaming/FailureSuite/b732c105-4fd3-4330-ba6d-a366b340c303/test/29 !? /Users/ryan/c/spark/streaming/FailureSuite/b732c105-4fd3-4330-ba6d-a366b340c303/test/30 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/10 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/11 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/12 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/13 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/14 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/15 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/16 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/17 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/18 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/19 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/20 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/21 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/22 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/23 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/24 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/25 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/26 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/27 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/28 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/29 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/30 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/7 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/8 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/9 [error] Got a return code of 1 on line 114 of the run-tests script. {code} I think it's fair to say that these are not useful errors for {{run-tests}} to crash on. Ideally we could tell the linter which files we care about having it lint and which we don't. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-595) Document local-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292052#comment-14292052 ] Vladimir Grigor commented on SPARK-595: --- +1 for reopen Document local-cluster mode - Key: SPARK-595 URL: https://issues.apache.org/jira/browse/SPARK-595 Project: Spark Issue Type: New Feature Components: Documentation Affects Versions: 0.6.0 Reporter: Josh Rosen Priority: Minor The 'Spark Standalone Mode' guide describes how to manually launch a standalone cluster, which can be done locally for testing, but it does not mention SparkContext's `local-cluster` option. What are the differences between these approaches? Which one should I prefer for local testing? Can I still use the standalone web interface if I use 'local-cluster' mode? It would be useful to document this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5324) Results of describe can't be queried
[ https://issues.apache.org/jira/browse/SPARK-5324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291987#comment-14291987 ] Yanbo Liang commented on SPARK-5324: [~marmbrus] I have pull a request for this issue which will implement DESCRIBE [FORMATTED] [db_name.]table_name command for SQLContext. Meanwhile, it need to make the least effect on the corresponding command output of HiveContext. And I think other metadata operation command like show databases/tables, analyze, explain can also leverage this scenario. Can you assign this to me? Results of describe can't be queried Key: SPARK-5324 URL: https://issues.apache.org/jira/browse/SPARK-5324 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Michael Armbrust {code} sql(DESCRIBE TABLE test).registerTempTable(describeTest) sql(SELECT * FROM describeTest).collect() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5324) Results of describe can't be queried
[ https://issues.apache.org/jira/browse/SPARK-5324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292008#comment-14292008 ] Yanbo Liang commented on SPARK-5324: https://github.com/apache/spark/pull/4207 Results of describe can't be queried Key: SPARK-5324 URL: https://issues.apache.org/jira/browse/SPARK-5324 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Michael Armbrust {code} sql(DESCRIBE TABLE test).registerTempTable(describeTest) sql(SELECT * FROM describeTest).collect() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5355) SparkConf is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-5355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292048#comment-14292048 ] Apache Spark commented on SPARK-5355: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4208 SparkConf is not thread-safe Key: SPARK-5355 URL: https://issues.apache.org/jira/browse/SPARK-5355 Project: Spark Issue Type: Bug Affects Versions: 1.2.0, 1.3.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker Fix For: 1.3.0, 1.2.1 The SparkConf is not thread-safe, but is accessed by many threads. The getAll() could return parts of the configs if another thread is access it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-794) Remove sleep() in ClusterScheduler.stop
[ https://issues.apache.org/jira/browse/SPARK-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292303#comment-14292303 ] Brennon York commented on SPARK-794: [~joshrosen] How is this PR holding up? I haven't seen any issues on the dev board. Think we can close this JIRA ticket? Trying to help prune the JIRA tree :) Remove sleep() in ClusterScheduler.stop --- Key: SPARK-794 URL: https://issues.apache.org/jira/browse/SPARK-794 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Matei Zaharia Labels: backport-needed Fix For: 1.3.0 This temporary change made a while back slows down the unit tests quite a bit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-595) Document local-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reopened SPARK-595: -- I've re-opened this issue. Folks are using the API in the wild and we're not going to break compatibility for it, so we should document it. Document local-cluster mode - Key: SPARK-595 URL: https://issues.apache.org/jira/browse/SPARK-595 Project: Spark Issue Type: New Feature Components: Documentation Affects Versions: 0.6.0 Reporter: Josh Rosen Priority: Minor The 'Spark Standalone Mode' guide describes how to manually launch a standalone cluster, which can be done locally for testing, but it does not mention SparkContext's `local-cluster` option. What are the differences between these approaches? Which one should I prefer for local testing? Can I still use the standalone web interface if I use 'local-cluster' mode? It would be useful to document this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5395) Large number of Python workers causing resource depletion
[ https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292121#comment-14292121 ] Mark Khaitman commented on SPARK-5395: -- Having the same issue in standalone deployment mode. A single spark-submitted job is spawning a ton of pyspark.daemon instances and depleting the cluster memory even though the appropriate environment variables have been set. Large number of Python workers causing resource depletion - Key: SPARK-5395 URL: https://issues.apache.org/jira/browse/SPARK-5395 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: AWS ElasticMapReduce Reporter: Sven Krasser During job execution a large number of Python worker accumulates eventually causing YARN to kill containers for being over their memory allocation (in the case below that is about 8G for executors plus 6G for overhead per container). In this instance, at the time of killing the container 97 pyspark.daemon processes had accumulated. {noformat} 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler (Logging.scala:logInfo(59)) - Container marked as failed: container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: Container [pid=35211,containerID=container_1421692415636_0052_01_30] is running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing container. Dump of the process-tree for container_1421692415636_0052_01_30 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m pyspark.daemon |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m pyspark.daemon |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m pyspark.daemon [...] {noformat} The configuration used uses 64 containers with 2 cores each. Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c Mailinglist discussion: https://www.mail-archive.com/user@spark.apache.org/msg20102.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture
[ https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292176#comment-14292176 ] Joseph K. Bradley commented on SPARK-5400: -- I agree this could be done either way: Algorithm[Model] or Model[Algorithm]. For users, exposing the model type may be easiest; a person who is new to ML and wants to do some clustering will know the name of a clustering model (KMeans, GMM) but may not want to worry about picking an optimization algorithm. So I'd vote for Model[Algorithm]. That said, internally, I agree that Algorithm[Model] would be handy for generalizing. We could do the combination by having an internal LearningState class: {code} class GaussianMixture { def setOptimizer // once we have more than 1 optimization method def run = { val opt = new EM(new GMMLearningState(this)) ... } } private[mllib] GMMLearningState extends OurModelAbstraction { def this(gm: GaussianMixture) = this(...) } class EM(model: OurModelAbstraction) {code} Rename GaussianMixtureEM to GaussianMixture --- Key: SPARK-5400 URL: https://issues.apache.org/jira/browse/SPARK-5400 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor GaussianMixtureEM is following the old naming convention of including the optimization algorithm name in the class title. We should probably rename it to GaussianMixture so that it can use other optimization algorithms in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5162) Python yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292069#comment-14292069 ] Vladimir Grigor commented on SPARK-5162: I second [~jared.holmb...@orchestro.com] [~lianhuiwang] thank you! I'm going to try your PR. Related issue Even with this PR, there will be problem using Yarn in cluster mode on Amazon EMR. Normally one submits yarn jobs via API or aws command line utility, so paths to files are evaluated later at some remote host, hence files are not found. Currently Spark does not support non-local files. One idea would be to add support for non-local (python) files, eg: if file is not local it will be downloaded and made available locally. Something similar to Distributed Cache described at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-input-distributed-cache.html So following code would work: {code} aws emr add-steps --cluster-id j-XYWIXMD234 \ --steps Name=SparkPi,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://mybucketat.amazonaws.com/tasks/main.py,main.py,param1],ActionOnFailure=CONTINUE {code} What do you think? What is your way to run batch python spark scripts on Yarn in Amazon? Python yarn-cluster mode Key: SPARK-5162 URL: https://issues.apache.org/jira/browse/SPARK-5162 Project: Spark Issue Type: New Feature Components: PySpark, YARN Reporter: Dana Klassen Labels: cluster, python, yarn Running pyspark in yarn is currently limited to ‘yarn-client’ mode. It would be great to be able to submit python applications to the cluster and (just like java classes) have the resource manager setup an AM on any node in the cluster. Does anyone know the issues blocking this feature? I was snooping around with enabling python apps: Removing the logic stopping python and yarn-cluster from sparkSubmit.scala ... // The following modes are not supported or applicable (clusterManager, deployMode) match { ... case (_, CLUSTER) if args.isPython = printErrorAndExit(Cluster deploy mode is currently not supported for python applications.) ... } … and submitting application via: HADOOP_CONF_DIR={{insert conf dir}} ./bin/spark-submit --master yarn-cluster --num-executors 2 —-py-files {{insert location of egg here}} --executor-cores 1 ../tools/canary.py Everything looks to run alright, pythonRunner is picked up as main class, resources get setup, yarn client gets launched but falls flat on its face: 2015-01-08 18:48:03,444 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: DEBUG: FAILED { {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py, 1420742868009, FILE, null }, Resource {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py changed on src filesystem (expected 1420742868009, was 1420742869284 and 2015-01-08 18:48:03,446 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py(-/data/4/yarn/nm/usercache/klassen/filecache/11/canary.py) transitioned from DOWNLOADING to FAILED Tracked this down to the apache hadoop code(FSDownload.java line 249) related to container localization of files upon downloading. At this point thought it would be best to raise the issue here and get input. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5236) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt
[ https://issues.apache.org/jira/browse/SPARK-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-5236: -- Description: {code} 15/01/14 05:39:27 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 (TID 28, localhost): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file gs://pa-truven/20141205/parquet/P/part-r-1.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141) at org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt at org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:241) at org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter.updateInt(ParquetConverter.scala:375) at org.apache.spark.sql.parquet.CatalystPrimitiveConverter.addInt(ParquetConverter.scala:434) at parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:237) at parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:353) at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:402) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:194) ... 27 more {code} was: 15/01/14 05:39:27 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 (TID 28, localhost): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file gs://pa-truven/20141205/parquet/P/part-r-1.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at
[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently
[ https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292448#comment-14292448 ] Xuefu Zhang commented on SPARK-2688: Yeah. We don't need a syntactic suger, but a transformation that just does one pass of the input RDD. This has performance implications on Hive's multi-insert use cases. Need a way to run multiple data pipeline concurrently - Key: SPARK-2688 URL: https://issues.apache.org/jira/browse/SPARK-2688 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.1 Reporter: Xuefu Zhang Suppose we want to do the following data processing: {code} rdd1 - rdd2 - rdd3 | - rdd4 | - rdd5 \ - rdd6 {code} where - represents a transformation. rdd3 to rrdd6 are all derived from an intermediate rdd2. We use foreach(fn) with a dummy function to trigger the execution. However, rdd.foreach(fn) only trigger pipeline rdd1 - rdd2 - rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be recomputed. This is very inefficient. Ideally, we should be able to trigger the execution the whole graph and reuse rdd2, but there doesn't seem to be a way doing so. Tez already realized the importance of this (TEZ-391), so I think Spark should provide this too. This is required for Hive to support multi-insert queries. HIVE-7292. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)
[ https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292370#comment-14292370 ] Imran Rashid commented on SPARK-3644: - [~joshrosen] Hi Josh, I've got time to implement this now. You can assign to me if you like (or let me know if there is something else in the works ...) REST API for Spark application info (jobs / stages / tasks / storage info) -- Key: SPARK-3644 URL: https://issues.apache.org/jira/browse/SPARK-3644 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Reporter: Josh Rosen This JIRA is a forum to draft a design proposal for a REST interface for accessing information about Spark applications, such as job / stage / task / storage status. There have been a number of proposals to serve JSON representations of the information displayed in Spark's web UI. Given that we might redesign the pages of the web UI (and possibly re-implement the UI as a client of a REST API), the API endpoints and their responses should be independent of what we choose to display on particular web UI pages / layouts. Let's start a discussion of what a good REST API would look like from first-principles. We can discuss what urls / endpoints expose access to data, how our JSON responses will be formatted, how fields will be named, how the API will be documented and tested, etc. Some links for inspiration: https://developer.github.com/v3/ http://developer.netflix.com/docs/REST_API_Reference https://helloreverb.com/developers/swagger -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5339) build/mvn doesn't work because of invalid URL for maven's tgz.
[ https://issues.apache.org/jira/browse/SPARK-5339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5339. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Kousuke Saruta build/mvn doesn't work because of invalid URL for maven's tgz. -- Key: SPARK-5339 URL: https://issues.apache.org/jira/browse/SPARK-5339 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Priority: Blocker Fix For: 1.3.0 build/mvn will automatically download tarball of maven. But currently, the URL is invalid. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3789) Python bindings for GraphX
[ https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292477#comment-14292477 ] Reynold Xin commented on SPARK-3789: Unfortunately this is not going to make it into 1.3, given the code freeze deadline is in 1 week. [~kdatta1978] thanks for working on this. Can you write some high level design document for this change? Python bindings for GraphX -- Key: SPARK-3789 URL: https://issues.apache.org/jira/browse/SPARK-3789 Project: Spark Issue Type: New Feature Components: GraphX, PySpark Reporter: Ameet Talwalkar Assignee: Kushal Datta -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5411) Allow SparkListeners to be specified in SparkConf and loaded when creating SparkContext
[ https://issues.apache.org/jira/browse/SPARK-5411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292494#comment-14292494 ] Apache Spark commented on SPARK-5411: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/4111 Allow SparkListeners to be specified in SparkConf and loaded when creating SparkContext --- Key: SPARK-5411 URL: https://issues.apache.org/jira/browse/SPARK-5411 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen It would be nice if there was a mechanism to allow SparkListeners to be registered through SparkConf settings. This would allow monitoring frameworks to be easily injected into Spark programs without having to modify those programs' code. I propose to introduce a new configuration option, {{spark.extraListeners}}, that allows SparkListeners to be specified in SparkConf and registered before the SparkContext is created. Here is the proposed documentation for the new option: {quote} A comma-separated list of classes that implement SparkListener; when initializing SparkContext, instances of these classes will be created and registered with Spark's listener bus. If a class has a single-argument constructor that accepts a SparkConf, that constructor will be called; otherwise, a zero-argument constructor will be called. If no valid constructor can be found, the SparkContext creation will fail with an exception. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5411) Allow SparkListeners to be specified in SparkConf and loaded when creating SparkContext
Josh Rosen created SPARK-5411: - Summary: Allow SparkListeners to be specified in SparkConf and loaded when creating SparkContext Key: SPARK-5411 URL: https://issues.apache.org/jira/browse/SPARK-5411 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen It would be nice if there was a mechanism to allow SparkListeners to be registered through SparkConf settings. This would allow monitoring frameworks to be easily injected into Spark programs without having to modify those programs' code. I propose to introduce a new configuration option, {{spark.extraListeners}}, that allows SparkListeners to be specified in SparkConf and registered before the SparkContext is created. Here is the proposed documentation for the new option: {quote} A comma-separated list of classes that implement SparkListener; when initializing SparkContext, instances of these classes will be created and registered with Spark's listener bus. If a class has a single-argument constructor that accepts a SparkConf, that constructor will be called; otherwise, a zero-argument constructor will be called. If no valid constructor can be found, the SparkContext creation will fail with an exception. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3789) Python bindings for GraphX
[ https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292499#comment-14292499 ] Kushal Datta commented on SPARK-3789: - Sure, i will write up the design document. @ Ameet, do you think you can work from another branch which is not on 1.3? Python bindings for GraphX -- Key: SPARK-3789 URL: https://issues.apache.org/jira/browse/SPARK-3789 Project: Spark Issue Type: New Feature Components: GraphX, PySpark Reporter: Ameet Talwalkar Assignee: Kushal Datta -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently
[ https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292431#comment-14292431 ] Sean Owen commented on SPARK-2688: -- As [~irashid] says, #1 is just syntactic sugar on what you can do already in Spark. I'm not clear how something can need this functionality badly, then. Either it's not blocking anything, really, and let's see that, or let's discuss what beyond #1 is actually needed. What I think people want is a miniature push-based evaluation method inside of Spark's pull-based DAG evaluation: force evaluation of N children of 1 parent at once. The outcome of a sidebar I had with Sandy on this was that it's probably a) fraught with gotchas, given the push-vs-pull mismatch, but not impossible, and b) would force the children to be persisted in the general case, with possible optimizations in other special cases. Is that the kind of thing Hive on Spark needs, and if so can we hear a concrete elaboration of an example of this, so we can compare with what's possible now? I still sense there's a mismatch between the perception and reality of what's possible with the current API. Hence, there may be some really good news here. Need a way to run multiple data pipeline concurrently - Key: SPARK-2688 URL: https://issues.apache.org/jira/browse/SPARK-2688 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.1 Reporter: Xuefu Zhang Suppose we want to do the following data processing: {code} rdd1 - rdd2 - rdd3 | - rdd4 | - rdd5 \ - rdd6 {code} where - represents a transformation. rdd3 to rrdd6 are all derived from an intermediate rdd2. We use foreach(fn) with a dummy function to trigger the execution. However, rdd.foreach(fn) only trigger pipeline rdd1 - rdd2 - rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be recomputed. This is very inefficient. Ideally, we should be able to trigger the execution the whole graph and reuse rdd2, but there doesn't seem to be a way doing so. Tez already realized the importance of this (TEZ-391), so I think Spark should provide this too. This is required for Hive to support multi-insert queries. HIVE-7292. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292399#comment-14292399 ] Dmitriy Lyubimov commented on SPARK-5226: - All attempts to parallelize dbscan in literature lately (or similar DeLiClu type of things) i read about include partitioning the task into smaller subtasks, solving each on individual level and merging it all back (see MR.Scan paper for example). Merging is of course is the new and the tricky thing. As far as i understand, they all pretty much have limitations to reduce scope to euclidean distances and captitalize on notions of euclidean geometry resulting from that, in order to solve partition and merge problems. Which substantially reduces attractiveness of general algorithm. However, the naive straightforward port of simple DBScan algorithm is not terribly practical for big data because of total complexity of the problem (or impracticality of building something like huge distributed R-tree index system on shared-nothing programming models). Add DBSCAN Clustering Algorithm to MLlib Key: SPARK-5226 URL: https://issues.apache.org/jira/browse/SPARK-5226 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Muhammad-Ali A'rabi Priority: Minor Labels: DBSCAN MLlib is all k-means now, and I think we should add some new clustering algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently
[ https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292415#comment-14292415 ] Xuefu Zhang commented on SPARK-2688: #1 above is exactly what Hive needs badly. Need a way to run multiple data pipeline concurrently - Key: SPARK-2688 URL: https://issues.apache.org/jira/browse/SPARK-2688 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.1 Reporter: Xuefu Zhang Suppose we want to do the following data processing: {code} rdd1 - rdd2 - rdd3 | - rdd4 | - rdd5 \ - rdd6 {code} where - represents a transformation. rdd3 to rrdd6 are all derived from an intermediate rdd2. We use foreach(fn) with a dummy function to trigger the execution. However, rdd.foreach(fn) only trigger pipeline rdd1 - rdd2 - rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be recomputed. This is very inefficient. Ideally, we should be able to trigger the execution the whole graph and reuse rdd2, but there doesn't seem to be a way doing so. Tez already realized the importance of this (TEZ-391), so I think Spark should provide this too. This is required for Hive to support multi-insert queries. HIVE-7292. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3789) Python bindings for GraphX
[ https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292468#comment-14292468 ] Kushal Datta commented on SPARK-3789: - Hi Ameet, Sorry for asking this question again. What's the release plan for 1.3? -Kushal. Python bindings for GraphX -- Key: SPARK-3789 URL: https://issues.apache.org/jira/browse/SPARK-3789 Project: Spark Issue Type: New Feature Components: GraphX, PySpark Reporter: Ameet Talwalkar Assignee: Kushal Datta -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5416) Initialize Executor.threadPool before ExecutorSource
[ https://issues.apache.org/jira/browse/SPARK-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292784#comment-14292784 ] Apache Spark commented on SPARK-5416: - User 'ryan-williams' has created a pull request for this issue: https://github.com/apache/spark/pull/4212 Initialize Executor.threadPool before ExecutorSource Key: SPARK-5416 URL: https://issues.apache.org/jira/browse/SPARK-5416 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor I recently saw some NPEs from [{{ExecutorSource:44}}|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala#L44] in the first couple seconds of my executors' being initialized. I think that {{ExecutorSource}} was trying to report these metrics before its threadpool was initialized; there are a few LoC between the source being registered ([Executor.scala:82|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L82]) and the threadpool being initialized ([Executor.scala:106|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L106]). We should initialize the threapool before the ExecutorSource is registered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3562) Periodic cleanup event logs
[ https://issues.apache.org/jira/browse/SPARK-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292818#comment-14292818 ] Apache Spark commented on SPARK-3562: - User 'viper-kun' has created a pull request for this issue: https://github.com/apache/spark/pull/4214 Periodic cleanup event logs --- Key: SPARK-3562 URL: https://issues.apache.org/jira/browse/SPARK-3562 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: xukun If we run spark application frequently, it will write many spark event log into spark.eventLog.dir. After a long time later, there will be many spark event log that we do not concern in the spark.eventLog.dir.Periodic cleanups will ensure that logs older than this duration will be forgotten. It is no need to clean logs by hands. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan updated SPARK-3880: --- Attachment: SparkSQLOnHBase_v2.0.docx HBase as data source to SparkSQL Key: SPARK-3880 URL: https://issues.apache.org/jira/browse/SPARK-3880 Project: Spark Issue Type: New Feature Components: SQL Reporter: Yan Assignee: Yan Attachments: HBaseOnSpark.docx, SparkSQLOnHBase_v2.0.docx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan updated SPARK-3880: --- Attachment: (was: SparkSQLOnHBase_v2.docx) HBase as data source to SparkSQL Key: SPARK-3880 URL: https://issues.apache.org/jira/browse/SPARK-3880 Project: Spark Issue Type: New Feature Components: SQL Reporter: Yan Assignee: Yan Attachments: HBaseOnSpark.docx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway
[ https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292874#comment-14292874 ] Andrew Or commented on SPARK-5388: -- Hi Dale, thank you for your comments. Yes, in the design doc I used REST roughly interchangeably with HTTP/JSON. But the goal is not to provide a mechanism for other entities to communicate with the Master as you suggested; it is simply to provide a stable mechanism for Spark to work across multiple versions. For instance, you might have a long-running Master that outlives multiple Spark versions, in which case we want to guarantee that newer versions of Spark will still be able to submit to the long-running Master. I think your proposal to make this more REST-like is potentially a great idea. However, I find the alternative of simply putting the action in the JSON itself easier to reason about. This also allows us to add other messages in the future that are not strictly limited to the semantics of GET, POST, and DELETE. That said, my proposal is also not set in stone yet so if there is a reason compelling enough to change it then I will do so. Also, a first-cut implementation of my design is now posted at: https://github.com/apache/spark/pull/4216. Please take a look if you feel inclined. Provide a stable application submission gateway --- Key: SPARK-5388 URL: https://issues.apache.org/jira/browse/SPARK-5388 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker Attachments: Stable Spark Standalone Submission.pdf The existing submission gateway in standalone mode is not compatible across Spark versions. If you have a newer version of Spark submitting to an older version of the standalone Master, it is currently not guaranteed to work. The goal is to provide a stable REST interface to replace this channel. The first cut implementation will target standalone cluster mode because there are very few messages exchanged. The design, however, will be general enough to eventually support this for other cluster managers too. Note that this is not necessarily required in YARN because we already use YARN's stable interface to submit applications there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit
[ https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292853#comment-14292853 ] Joseph Tang edited comment on SPARK-4846 at 1/27/15 2:46 AM: - Sorry about the procrastination. I just thought you meant there is no need to implement a dynamic strategy. I'm still working on it and I'd like to quickly fix this issue. Regarding your previous comment, should I throw a customized error in Spark or just an OOM besides the hint about minCount and vectorSize? was (Author: josephtang): Sorry about the procrastination. I'm still working on this. Regarding your previous comment, should I throw a customized error in Spark or just an OOM besides the hint about minCount and vectorSize? When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit --- Key: SPARK-4846 URL: https://issues.apache.org/jira/browse/SPARK-4846 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.1, 1.2.0 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one partition. The corpus contains about 300 million words and its vocabulary size is about 10 million. Reporter: Joseph Tang Assignee: Joseph Tang Priority: Minor Exception in thread Driver java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610) at org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5422) Support sending to Graphite via UDP
[ https://issues.apache.org/jira/browse/SPARK-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292925#comment-14292925 ] Apache Spark commented on SPARK-5422: - User 'ryan-williams' has created a pull request for this issue: https://github.com/apache/spark/pull/4218 Support sending to Graphite via UDP --- Key: SPARK-5422 URL: https://issues.apache.org/jira/browse/SPARK-5422 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor {{io.dropwizard.metrics-graphite}} version {{3.1.0}} can send metrics to Graphite via UDP or TCP. After upgrading ([SPARK-5413|https://issues.apache.org/jira/browse/SPARK-5413]), we should support using this facility, presumably specified via a protocol field in the metrics config file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit
[ https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292926#comment-14292926 ] Joseph Tang commented on SPARK-4846: I've added some code at https://github.com/jinntrance/spark/compare/w2v-fix?diff=splitname=w2v-fix If it's OK, I would send a new PR to the branch `master`. When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit --- Key: SPARK-4846 URL: https://issues.apache.org/jira/browse/SPARK-4846 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.1, 1.2.0 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one partition. The corpus contains about 300 million words and its vocabulary size is about 10 million. Reporter: Joseph Tang Assignee: Joseph Tang Priority: Minor Exception in thread Driver java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610) at org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292998#comment-14292998 ] Luca Morandini commented on SPARK-1405: --- Indeed, I have a couple students whose assignments involve Twitter data, and I am considering adding LDA to the mix. I would like to test it on our corpus... provided this feature is usable by a Spark novice: is it ? parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Priority: Critical Labels: features Attachments: performance_comparison.png Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5417) Remove redundant executor-ID set() call
Ryan Williams created SPARK-5417: Summary: Remove redundant executor-ID set() call Key: SPARK-5417 URL: https://issues.apache.org/jira/browse/SPARK-5417 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor {{spark.executor.id}} no longer [needs to be set in Executor.scala|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L79], as of [#4194|https://github.com/apache/spark/pull/4194]; it is set upstream in [SparkEnv|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/SparkEnv.scala#L332]. Might as well remove the redundant set() in Executor.scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5119) java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree model
[ https://issues.apache.org/jira/browse/SPARK-5119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5119: - Assignee: Kai Sasaki java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree model --- Key: SPARK-5119 URL: https://issues.apache.org/jira/browse/SPARK-5119 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 1.1.0, 1.2.0 Environment: Linux ubuntu 14.04 Reporter: Vivek Kulkarni Assignee: Kai Sasaki Fix For: 1.3.0 First I tried to see if there was a bug raised before with similar trace. I found https://www.mail-archive.com/user@spark.apache.org/msg13708.html but the suggestion to upgarde to latest code bae ( I cloned from master branch) does not fix this issue. Issue: try to train a decision tree classifier on some data.After training and when it begins colllect, it crashes: 15/01/06 22:28:15 INFO BlockManagerMaster: Updated info of block rdd_52_1 15/01/06 22:28:15 ERROR Executor: Exception in task 1.0 in stage 31.0 (TID 1895) java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.mllib.tree.impurity.GiniAggregator.update(Gini.scala:93) at org.apache.spark.mllib.tree.impl.DTStatsAggregator.update(DTStatsAggregator.scala:100) at org.apache.spark.mllib.tree.DecisionTree$.orderedBinSeqOp(DecisionTree.scala:419) at org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$nodeBinSeqOp$1(DecisionTree.scala:511) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1$1.apply(DecisionTree.scala:536 ) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1$1.apply(DecisionTree.scala:533 ) at scala.collection.immutable.Map$Map1.foreach(Map.scala:109) at org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:533) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$6$$anonfun$apply$8.apply(DecisionTree.scala:628) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$6$$anonfun$apply$8.apply(DecisionTree.scala:628) at scala.collection.Iterator$class.foreach(Iterator.scala:727) Minimal code: data = MLUtils.loadLibSVMFile(sc, '/scratch1/vivek/datasets/private/a1a').cache() model = DecisionTree.trainClassifier(data, numClasses=2, categoricalFeaturesInfo={}, maxDepth=5, maxBins=100) Just download the data from: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a1a -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit
[ https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292853#comment-14292853 ] Joseph Tang commented on SPARK-4846: Sorry about the procrastination. I'm still working on this. Regarding your previous comment, should I throw an customized error in Spark or just OOM besides the hint about minCount and vectorSize? When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit --- Key: SPARK-4846 URL: https://issues.apache.org/jira/browse/SPARK-4846 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.1, 1.2.0 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one partition. The corpus contains about 300 million words and its vocabulary size is about 10 million. Reporter: Joseph Tang Assignee: Joseph Tang Priority: Minor Exception in thread Driver java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610) at org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit
[ https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292855#comment-14292855 ] Joseph Tang commented on SPARK-4846: Sorry about the procrastination. I'm still working on this. Regarding your previous comment, should I throw an customized error in Spark or just OOM besides the hint about minCount and vectorSize? When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit --- Key: SPARK-4846 URL: https://issues.apache.org/jira/browse/SPARK-4846 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.1, 1.2.0 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one partition. The corpus contains about 300 million words and its vocabulary size is about 10 million. Reporter: Joseph Tang Assignee: Joseph Tang Priority: Minor Exception in thread Driver java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610) at org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4979) Add streaming logistic regression
[ https://issues.apache.org/jira/browse/SPARK-4979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4979: - Assignee: Jeremy Freeman Add streaming logistic regression - Key: SPARK-4979 URL: https://issues.apache.org/jira/browse/SPARK-4979 Project: Spark Issue Type: New Feature Components: MLlib, Streaming Reporter: Jeremy Freeman Assignee: Jeremy Freeman Priority: Minor We currently support streaming linear regression and k-means clustering. We can add support for streaming logistic regression using a strategy similar to that used in streaming linear regression, applying gradient updates to batches of data from a DStream, and extending the existing mllib methods with minor modifications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5421) SparkSql throw OOM at shuffle
[ https://issues.apache.org/jira/browse/SPARK-5421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-5421: - Description: ExternalAppendOnlyMap if only for the spark job that aggregator isDefined, but sparkSQL's shuffledRDD haven't define aggregator, so sparkSQL won't spill at shuffle, it's very easy to throw OOM at shuffle. One of the executor's log, here is stderr: 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 1, fetching them 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker actor = Actor[akka.tcp://sparkDriver@10.196.128.140:40952/user/MapOutputTracker#1435377484] 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Got the output locations 15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Getting 143 non-empty blocks out of 143 blocks 15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Started 4 remote fetches in 72 ms 15/01/27 07:47:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM here is stdout: 2015-01-27T07:44:43.487+0800: [Full GC 3961343K-3959868K(3961344K), 29.8959290 secs] 2015-01-27T07:45:13.460+0800: [Full GC 3961343K-3959992K(3961344K), 27.9218150 secs] 2015-01-27T07:45:41.407+0800: [GC 3960347K(3961344K), 3.0457450 secs] 2015-01-27T07:45:52.950+0800: [Full GC 3961343K-3960113K(3961344K), 29.3894670 secs] 2015-01-27T07:46:22.393+0800: [Full GC 3961118K-3960240K(3961344K), 28.9879600 secs] 2015-01-27T07:46:51.393+0800: [Full GC 3960240K-3960213K(3961344K), 34.1530900 secs] # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError=kill %p # Executing /bin/sh -c kill 9050... 2015-01-27T07:47:25.921+0800: [GC 3960214K(3961344K), 3.3959300 secs] was:ExternalAppendOnlyMap if only for the spark job that aggregator isDefined, but sparkSQL's shuffledRDD haven't define aggregator, so sparkSQL won't spill at shuffle, it's very easy to throw OOM at shuffle. SparkSql throw OOM at shuffle - Key: SPARK-5421 URL: https://issues.apache.org/jira/browse/SPARK-5421 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Hong Shen ExternalAppendOnlyMap if only for the spark job that aggregator isDefined, but sparkSQL's shuffledRDD haven't define aggregator, so sparkSQL won't spill at shuffle, it's very easy to throw OOM at shuffle. One of the executor's log, here is stderr: 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 1, fetching them 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker actor = Actor[akka.tcp://sparkDriver@10.196.128.140:40952/user/MapOutputTracker#1435377484] 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Got the output locations 15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Getting 143 non-empty blocks out of 143 blocks 15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Started 4 remote fetches in 72 ms 15/01/27 07:47:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM here is stdout: 2015-01-27T07:44:43.487+0800: [Full GC 3961343K-3959868K(3961344K), 29.8959290 secs] 2015-01-27T07:45:13.460+0800: [Full GC 3961343K-3959992K(3961344K), 27.9218150 secs] 2015-01-27T07:45:41.407+0800: [GC 3960347K(3961344K), 3.0457450 secs] 2015-01-27T07:45:52.950+0800: [Full GC 3961343K-3960113K(3961344K), 29.3894670 secs] 2015-01-27T07:46:22.393+0800: [Full GC 3961118K-3960240K(3961344K), 28.9879600 secs] 2015-01-27T07:46:51.393+0800: [Full GC 3960240K-3960213K(3961344K), 34.1530900 secs] # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError=kill %p # Executing /bin/sh -c kill 9050... 2015-01-27T07:47:25.921+0800: [GC 3960214K(3961344K), 3.3959300 secs] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5421) SparkSql throw OOM at shuffle
[ https://issues.apache.org/jira/browse/SPARK-5421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-5421: - Description: ExternalAppendOnlyMap if only for the spark job that aggregator isDefined, but sparkSQL's shuffledRDD haven't define aggregator, so sparkSQL won't spill at shuffle, it's very easy to throw OOM at shuffle. I think sparkSQL also need spill at shuffle. One of the executor's log, here is stderr: 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 1, fetching them 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker actor = Actor[akka.tcp://sparkDriver@10.196.128.140:40952/user/MapOutputTracker#1435377484] 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Got the output locations 15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Getting 143 non-empty blocks out of 143 blocks 15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Started 4 remote fetches in 72 ms 15/01/27 07:47:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM here is stdout: 2015-01-27T07:44:43.487+0800: [Full GC 3961343K-3959868K(3961344K), 29.8959290 secs] 2015-01-27T07:45:13.460+0800: [Full GC 3961343K-3959992K(3961344K), 27.9218150 secs] 2015-01-27T07:45:41.407+0800: [GC 3960347K(3961344K), 3.0457450 secs] 2015-01-27T07:45:52.950+0800: [Full GC 3961343K-3960113K(3961344K), 29.3894670 secs] 2015-01-27T07:46:22.393+0800: [Full GC 3961118K-3960240K(3961344K), 28.9879600 secs] 2015-01-27T07:46:51.393+0800: [Full GC 3960240K-3960213K(3961344K), 34.1530900 secs] # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError=kill %p # Executing /bin/sh -c kill 9050... 2015-01-27T07:47:25.921+0800: [GC 3960214K(3961344K), 3.3959300 secs] was: ExternalAppendOnlyMap if only for the spark job that aggregator isDefined, but sparkSQL's shuffledRDD haven't define aggregator, so sparkSQL won't spill at shuffle, it's very easy to throw OOM at shuffle. One of the executor's log, here is stderr: 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 1, fetching them 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker actor = Actor[akka.tcp://sparkDriver@10.196.128.140:40952/user/MapOutputTracker#1435377484] 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Got the output locations 15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Getting 143 non-empty blocks out of 143 blocks 15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Started 4 remote fetches in 72 ms 15/01/27 07:47:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM here is stdout: 2015-01-27T07:44:43.487+0800: [Full GC 3961343K-3959868K(3961344K), 29.8959290 secs] 2015-01-27T07:45:13.460+0800: [Full GC 3961343K-3959992K(3961344K), 27.9218150 secs] 2015-01-27T07:45:41.407+0800: [GC 3960347K(3961344K), 3.0457450 secs] 2015-01-27T07:45:52.950+0800: [Full GC 3961343K-3960113K(3961344K), 29.3894670 secs] 2015-01-27T07:46:22.393+0800: [Full GC 3961118K-3960240K(3961344K), 28.9879600 secs] 2015-01-27T07:46:51.393+0800: [Full GC 3960240K-3960213K(3961344K), 34.1530900 secs] # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError=kill %p # Executing /bin/sh -c kill 9050... 2015-01-27T07:47:25.921+0800: [GC 3960214K(3961344K), 3.3959300 secs] SparkSql throw OOM at shuffle - Key: SPARK-5421 URL: https://issues.apache.org/jira/browse/SPARK-5421 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Hong Shen ExternalAppendOnlyMap if only for the spark job that aggregator isDefined, but sparkSQL's shuffledRDD haven't define aggregator, so sparkSQL won't spill at shuffle, it's very easy to throw OOM at shuffle. I think sparkSQL also need spill at shuffle. One of the executor's log, here is stderr: 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 1, fetching them 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker actor = Actor[akka.tcp://sparkDriver@10.196.128.140:40952/user/MapOutputTracker#1435377484] 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Got the output locations 15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Getting 143 non-empty blocks out of 143 blocks 15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Started 4 remote fetches in 72 ms 15/01/27 07:47:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM here is stdout: 2015-01-27T07:44:43.487+0800: [Full GC 3961343K-3959868K(3961344K), 29.8959290 secs] 2015-01-27T07:45:13.460+0800: [Full GC 3961343K-3959992K(3961344K), 27.9218150 secs] 2015-01-27T07:45:41.407+0800: [GC
[jira] [Updated] (SPARK-3726) RandomForest: Support for bootstrap options
[ https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3726: - Target Version/s: 1.3.0 RandomForest: Support for bootstrap options --- Key: SPARK-3726 URL: https://issues.apache.org/jira/browse/SPARK-3726 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Manoj Kumar Priority: Minor Fix For: 1.3.0 RandomForest uses BaggedPoint to simulate bootstrapped samples of the data. The expected size of each sample is the same as the original data (sampling rate = 1.0), and sampling is done with replacement. Adding support for other sampling rates and for sampling without replacement would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293001#comment-14293001 ] Joseph K. Bradley commented on SPARK-1405: -- It has not yet been merged into Spark master, but hopefully will be soon. The initial version should be usable. We will continue to add improvements to the API, especially helper functionality such as prediction and summaries about the learned model. parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Priority: Critical Labels: features Attachments: performance_comparison.png Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5416) Initialize Executor.threadPool before ExecutorSource
Ryan Williams created SPARK-5416: Summary: Initialize Executor.threadPool before ExecutorSource Key: SPARK-5416 URL: https://issues.apache.org/jira/browse/SPARK-5416 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor I recently saw some NPEs from [{{ExecutorSource:44}}|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala#L44] in the first couple seconds of my executors' being initialized. I think that {{ExecutorSource}} was trying to report these metrics before its threadpool was initialized; there are a few LoC between the source being registered ([Executor.scala:82|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L82]) and the threadpool being initialized ([Executor.scala:106|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L106]). We should initialize the threapool before the ExecutorSource is registered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5341) Support maven coordinates in spark-shell and spark-submit
[ https://issues.apache.org/jira/browse/SPARK-5341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292826#comment-14292826 ] Apache Spark commented on SPARK-5341: - User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/4215 Support maven coordinates in spark-shell and spark-submit - Key: SPARK-5341 URL: https://issues.apache.org/jira/browse/SPARK-5341 Project: Spark Issue Type: New Feature Components: Deploy, Spark Shell Reporter: Burak Yavuz This feature will allow users to provide the maven coordinates of jars they wish to use in their spark application. Coordinates can be a comma-delimited list and be supplied like: ```spark-submit --maven org.apache.example.a,org.apache.example.b``` This feature will also be added to spark-shell (where it is more critical to have this feature) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5119) java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree model
[ https://issues.apache.org/jira/browse/SPARK-5119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5119: - Target Version/s: 1.3.0 java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree model --- Key: SPARK-5119 URL: https://issues.apache.org/jira/browse/SPARK-5119 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 1.1.0, 1.2.0 Environment: Linux ubuntu 14.04 Reporter: Vivek Kulkarni Assignee: Kai Sasaki Fix For: 1.3.0 First I tried to see if there was a bug raised before with similar trace. I found https://www.mail-archive.com/user@spark.apache.org/msg13708.html but the suggestion to upgarde to latest code bae ( I cloned from master branch) does not fix this issue. Issue: try to train a decision tree classifier on some data.After training and when it begins colllect, it crashes: 15/01/06 22:28:15 INFO BlockManagerMaster: Updated info of block rdd_52_1 15/01/06 22:28:15 ERROR Executor: Exception in task 1.0 in stage 31.0 (TID 1895) java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.mllib.tree.impurity.GiniAggregator.update(Gini.scala:93) at org.apache.spark.mllib.tree.impl.DTStatsAggregator.update(DTStatsAggregator.scala:100) at org.apache.spark.mllib.tree.DecisionTree$.orderedBinSeqOp(DecisionTree.scala:419) at org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$nodeBinSeqOp$1(DecisionTree.scala:511) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1$1.apply(DecisionTree.scala:536 ) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1$1.apply(DecisionTree.scala:533 ) at scala.collection.immutable.Map$Map1.foreach(Map.scala:109) at org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:533) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$6$$anonfun$apply$8.apply(DecisionTree.scala:628) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$6$$anonfun$apply$8.apply(DecisionTree.scala:628) at scala.collection.Iterator$class.foreach(Iterator.scala:727) Minimal code: data = MLUtils.loadLibSVMFile(sc, '/scratch1/vivek/datasets/private/a1a').cache() model = DecisionTree.trainClassifier(data, numClasses=2, categoricalFeaturesInfo={}, maxDepth=5, maxBins=100) Just download the data from: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a1a -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5418) Output directory for shuffle should consider left space of each directory set in conf
ding created SPARK-5418: --- Summary: Output directory for shuffle should consider left space of each directory set in conf Key: SPARK-5418 URL: https://issues.apache.org/jira/browse/SPARK-5418 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Environment: Ubuntu, others should be similar Reporter: ding Priority: Minor I set multiple directorys in conf spark.local.dir as scratch space, one of them(eg. /mnt/disk1) have 30G left space while others(eg./mnt/disk2) have 100G. In current version, spark use hash to figure out which directory is used for scratch space. It means each directory has the same chance. After hounds of iteration of pagerank, there is No space left exception and driver crashes. It does not make sense since there is still 70G+ left space in other directorys. We should take consider left space on each directorys when figure out which directory should be map output dir. I will send a PR for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5119) java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree model
[ https://issues.apache.org/jira/browse/SPARK-5119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5119. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3975 [https://github.com/apache/spark/pull/3975] java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree model --- Key: SPARK-5119 URL: https://issues.apache.org/jira/browse/SPARK-5119 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 1.1.0, 1.2.0 Environment: Linux ubuntu 14.04 Reporter: Vivek Kulkarni Fix For: 1.3.0 First I tried to see if there was a bug raised before with similar trace. I found https://www.mail-archive.com/user@spark.apache.org/msg13708.html but the suggestion to upgarde to latest code bae ( I cloned from master branch) does not fix this issue. Issue: try to train a decision tree classifier on some data.After training and when it begins colllect, it crashes: 15/01/06 22:28:15 INFO BlockManagerMaster: Updated info of block rdd_52_1 15/01/06 22:28:15 ERROR Executor: Exception in task 1.0 in stage 31.0 (TID 1895) java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.mllib.tree.impurity.GiniAggregator.update(Gini.scala:93) at org.apache.spark.mllib.tree.impl.DTStatsAggregator.update(DTStatsAggregator.scala:100) at org.apache.spark.mllib.tree.DecisionTree$.orderedBinSeqOp(DecisionTree.scala:419) at org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$nodeBinSeqOp$1(DecisionTree.scala:511) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1$1.apply(DecisionTree.scala:536 ) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1$1.apply(DecisionTree.scala:533 ) at scala.collection.immutable.Map$Map1.foreach(Map.scala:109) at org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:533) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$6$$anonfun$apply$8.apply(DecisionTree.scala:628) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$6$$anonfun$apply$8.apply(DecisionTree.scala:628) at scala.collection.Iterator$class.foreach(Iterator.scala:727) Minimal code: data = MLUtils.loadLibSVMFile(sc, '/scratch1/vivek/datasets/private/a1a').cache() model = DecisionTree.trainClassifier(data, numClasses=2, categoricalFeaturesInfo={}, maxDepth=5, maxBins=100) Just download the data from: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a1a -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway
[ https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292866#comment-14292866 ] Apache Spark commented on SPARK-5388: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/4216 Provide a stable application submission gateway --- Key: SPARK-5388 URL: https://issues.apache.org/jira/browse/SPARK-5388 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker Attachments: Stable Spark Standalone Submission.pdf The existing submission gateway in standalone mode is not compatible across Spark versions. If you have a newer version of Spark submitting to an older version of the standalone Master, it is currently not guaranteed to work. The goal is to provide a stable REST interface to replace this channel. The first cut implementation will target standalone cluster mode because there are very few messages exchanged. The design, however, will be general enough to eventually support this for other cluster managers too. Note that this is not necessarily required in YARN because we already use YARN's stable interface to submit applications there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293019#comment-14293019 ] Aniket Bhatnagar commented on SPARK-2243: - I am also interested in having this fixed. Can someone please outline what are specific things that need to be fixed to make this work so that interested people can contribute? Support multiple SparkContexts in the same JVM -- Key: SPARK-2243 URL: https://issues.apache.org/jira/browse/SPARK-2243 Project: Spark Issue Type: New Feature Components: Block Manager, Spark Core Affects Versions: 0.7.0, 1.0.0, 1.1.0 Reporter: Miguel Angel Fernandez Diaz We're developing a platform where we create several Spark contexts for carrying out different calculations. Is there any restriction when using several Spark contexts? We have two contexts, one for Spark calculations and another one for Spark Streaming jobs. The next error arises when we first execute a Spark calculation and, once the execution is finished, a Spark Streaming job is launched: {code} 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at
[jira] [Updated] (SPARK-5420) Cross-langauge load/store functions for creating and saving DataFrames
[ https://issues.apache.org/jira/browse/SPARK-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5420: --- Summary: Cross-langauge load/store functions for creating and saving DataFrames (was: Create cross-langauge load/store functions for creating and saving DataFrames) Cross-langauge load/store functions for creating and saving DataFrames -- Key: SPARK-5420 URL: https://issues.apache.org/jira/browse/SPARK-5420 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Patrick Wendell We should have standard API's for loading or saving a table from a data store. One idea: {code} df = sc.loadTable(path.to.DataSource, {a: b, c: d}) sc.storeTable(path.to.DataSouce, {a:b, c:d}) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit
[ https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Tang updated SPARK-4846: --- Comment: was deleted (was: Sorry about the procrastination. I'm still working on this. Regarding your previous comment, should I throw an customized error in Spark or just OOM besides the hint about minCount and vectorSize? ) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit --- Key: SPARK-4846 URL: https://issues.apache.org/jira/browse/SPARK-4846 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.1, 1.2.0 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one partition. The corpus contains about 300 million words and its vocabulary size is about 10 million. Reporter: Joseph Tang Assignee: Joseph Tang Priority: Minor Exception in thread Driver java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610) at org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit
[ https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292853#comment-14292853 ] Joseph Tang edited comment on SPARK-4846 at 1/27/15 2:44 AM: - Sorry about the procrastination. I'm still working on this. Regarding your previous comment, should I throw a customized error in Spark or just an OOM besides the hint about minCount and vectorSize? was (Author: josephtang): Sorry about the procrastination. I'm still working on this. Regarding your previous comment, should I throw an customized error in Spark or just OOM besides the hint about minCount and vectorSize? When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit --- Key: SPARK-4846 URL: https://issues.apache.org/jira/browse/SPARK-4846 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.1, 1.2.0 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one partition. The corpus contains about 300 million words and its vocabulary size is about 10 million. Reporter: Joseph Tang Assignee: Joseph Tang Priority: Minor Exception in thread Driver java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610) at org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit
[ https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292886#comment-14292886 ] Joseph Tang commented on SPARK-4846: Hi Xiangrui, here is a problem. PR #3693 that added the `setMinCount ` was merged to the branch `master`, while my PR #3697 was sent to `branch-1.1`. Should I better close PR #3697 and send a new PR based on PR #3693? When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit --- Key: SPARK-4846 URL: https://issues.apache.org/jira/browse/SPARK-4846 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.1, 1.2.0 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one partition. The corpus contains about 300 million words and its vocabulary size is about 10 million. Reporter: Joseph Tang Assignee: Joseph Tang Priority: Minor Exception in thread Driver java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610) at org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5420) Create cross-langauge load/store functions for creating and saving DataFrames
Patrick Wendell created SPARK-5420: -- Summary: Create cross-langauge load/store functions for creating and saving DataFrames Key: SPARK-5420 URL: https://issues.apache.org/jira/browse/SPARK-5420 Project: Spark Issue Type: Sub-task Reporter: Patrick Wendell We should have standard API's for loading or saving a table from a data store. One idea: {code} df = sc.loadTable(path.to.DataSource, {a: b, c: d}) sc.storeTable(path.to.DataSouce, {a:b, c:d}) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5395) Large number of Python workers causing resource depletion
[ https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292913#comment-14292913 ] Sven Krasser commented on SPARK-5395: - Some additional findings from my side: I've managed to trigger the problem using a simpler job on production data that basically does a reduceByKey followed by a count action. I get 20 workers (2 cores per executor) before any tasks in the first stage (reduceByKey) complete (i.e. different from the stage transition behavior you noticed). However, this doesn't occur if I run over a smaller data set, i.e. fewer production data files. Before calling reduceByKey I have a coalesce call. Without that the error does not occur (at least in this smaller script). This at first glance looked potentially spilling related (more data per task), but attempting to force spills by setting the worker memory very low did not help with my attempts to get a repro on test data. Large number of Python workers causing resource depletion - Key: SPARK-5395 URL: https://issues.apache.org/jira/browse/SPARK-5395 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: AWS ElasticMapReduce Reporter: Sven Krasser During job execution a large number of Python worker accumulates eventually causing YARN to kill containers for being over their memory allocation (in the case below that is about 8G for executors plus 6G for overhead per container). In this instance, at the time of killing the container 97 pyspark.daemon processes had accumulated. {noformat} 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler (Logging.scala:logInfo(59)) - Container marked as failed: container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: Container [pid=35211,containerID=container_1421692415636_0052_01_30] is running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing container. Dump of the process-tree for container_1421692415636_0052_01_30 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m pyspark.daemon |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m pyspark.daemon |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m pyspark.daemon [...] {noformat} The configuration used uses 64 containers with 2 cores each. Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c Mailinglist discussion: https://www.mail-archive.com/user@spark.apache.org/msg20102.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5052) com.google.common.base.Optional binary has a wrong method signatures
[ https://issues.apache.org/jira/browse/SPARK-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5052. Resolution: Fixed Fix Version/s: 1.3.0 com.google.common.base.Optional binary has a wrong method signatures Key: SPARK-5052 URL: https://issues.apache.org/jira/browse/SPARK-5052 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Elmer Garduno Fix For: 1.3.0 PR https://github.com/apache/spark/pull/1813 shaded Guava jar file and moved Guava classes to package org.spark-project.guava when Spark is built by Maven. When a user jar uses the actual com.google.common.base.Optional transform(com.google.common.base.Function); method from Guava, a java.lang.NoSuchMethodError: com.google.common.base.Optional.transform(Lcom/google/common/base/Function;)Lcom/google/common/base/Optional; is thrown. The reason seems to be that the Optional class included on spark-assembly-1.2.0-hadoop1.0.4.jar has an incorrect method signature that includes the shaded class as an argument: Expected: javap -classpath target/scala-2.10/googlegenomics-spark-examples-assembly-1.0.jar com.google.common.base.Optional public abstract V extends java/lang/Object com.google.common.base.OptionalV transform(com.google.common.base.Function? super T, V); Found: javap -classpath lib/spark-assembly-1.2.0-hadoop1.0.4.jar com.google.common.base.Optional public abstract V extends java/lang/Object com.google.common.base.OptionalV transform(org.spark-project.guava.common.base.Function? super T, V); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5052) com.google.common.base.Optional binary has a wrong method signatures
[ https://issues.apache.org/jira/browse/SPARK-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5052: --- Assignee: Elmer Garduno com.google.common.base.Optional binary has a wrong method signatures Key: SPARK-5052 URL: https://issues.apache.org/jira/browse/SPARK-5052 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Elmer Garduno Assignee: Elmer Garduno Fix For: 1.3.0 PR https://github.com/apache/spark/pull/1813 shaded Guava jar file and moved Guava classes to package org.spark-project.guava when Spark is built by Maven. When a user jar uses the actual com.google.common.base.Optional transform(com.google.common.base.Function); method from Guava, a java.lang.NoSuchMethodError: com.google.common.base.Optional.transform(Lcom/google/common/base/Function;)Lcom/google/common/base/Optional; is thrown. The reason seems to be that the Optional class included on spark-assembly-1.2.0-hadoop1.0.4.jar has an incorrect method signature that includes the shaded class as an argument: Expected: javap -classpath target/scala-2.10/googlegenomics-spark-examples-assembly-1.0.jar com.google.common.base.Optional public abstract V extends java/lang/Object com.google.common.base.OptionalV transform(com.google.common.base.Function? super T, V); Found: javap -classpath lib/spark-assembly-1.2.0-hadoop1.0.4.jar com.google.common.base.Optional public abstract V extends java/lang/Object com.google.common.base.OptionalV transform(org.spark-project.guava.common.base.Function? super T, V); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5261) In some cases ,The value of word's vector representation is too big
[ https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292823#comment-14292823 ] Guoqiang Li commented on SPARK-5261: [~lewuathe] {code} normalize_text() { awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e s/'/ ' /g -e s/“/\/g -e s/”/\/g \ -e 's// /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \ -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \ -e 's/«/ /g' | tr 0-9 } wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz gzip -d news.2013.en.shuffled.gz normalize_text news.2013.en.shuffled data.txt {code} In some cases ,The value of word's vector representation is too big --- Key: SPARK-5261 URL: https://issues.apache.org/jira/browse/SPARK-5261 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Guoqiang Li {code} val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36) {code} The average absolute value of the word's vector representation is 60731.8 {code} val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(1) {code} The average absolute value of the word's vector representation is 0.13889 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5206) Accumulators are not re-registered during recovering from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292896#comment-14292896 ] Saisai Shao commented on SPARK-5206: IMHO I think this is a general problem in Spark Streaming, any variable which should be registered both in driver and executor side will lead to error when recovering from failure if the behavior of readObject lacks of driver re-register. Also object like broadcast variable will also meet exception when recovering from checkpoint, since actual data is lost in executor side, and recovery from driver side is not possible if I understand correctly. Accumulators are not re-registered during recovering from checkpoint Key: SPARK-5206 URL: https://issues.apache.org/jira/browse/SPARK-5206 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.1.0 Reporter: vincent ye I got exception as following while my streaming application restarts from crash from checkpoit: 15/01/12 10:31:06 sparkDriver-akka.actor.default-dispatcher-4 ERROR scheduler.DAGScheduler: Failed to update accumulators for ShuffleMapTask(41, 4) java.util.NoSuchElementException: key not found: 1 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:939) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:938) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1388) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) I guess that an Accumulator is registered to a singleton Accumulators in Line 58 of org.apache.spark.Accumulable: Accumulators.register(this, true) This code need to be executed in the driver once. But when the application is recovered from checkpoint. It won't be executed in the driver. So when the driver process it at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938), It can't find the Accumulator because it's not re-register during the recovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit
[ https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292926#comment-14292926 ] Joseph Tang edited comment on SPARK-4846 at 1/27/15 3:42 AM: - I've added some code at https://github.com/jinntrance/spark/compare/w2v-fix?diff=splitname=w2v-fix If it's OK, I would send a new PR to the branch `master`. BTW, sorry for the horrible readability of the difference because of the space indent. was (Author: josephtang): I've added some code at https://github.com/jinntrance/spark/compare/w2v-fix?diff=splitname=w2v-fix If it's OK, I would send a new PR to the branch `master`. When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit --- Key: SPARK-4846 URL: https://issues.apache.org/jira/browse/SPARK-4846 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.1, 1.2.0 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one partition. The corpus contains about 300 million words and its vocabulary size is about 10 million. Reporter: Joseph Tang Assignee: Joseph Tang Priority: Minor Exception in thread Driver java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610) at org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5417) Remove redundant executor-ID set() call
[ https://issues.apache.org/jira/browse/SPARK-5417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292790#comment-14292790 ] Apache Spark commented on SPARK-5417: - User 'ryan-williams' has created a pull request for this issue: https://github.com/apache/spark/pull/4213 Remove redundant executor-ID set() call --- Key: SPARK-5417 URL: https://issues.apache.org/jira/browse/SPARK-5417 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor {{spark.executor.id}} no longer [needs to be set in Executor.scala|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L79], as of [#4194|https://github.com/apache/spark/pull/4194]; it is set upstream in [SparkEnv|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/SparkEnv.scala#L332]. Might as well remove the redundant set() in Executor.scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5419) Fix the logic in Vectors.sqdist
[ https://issues.apache.org/jira/browse/SPARK-5419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292898#comment-14292898 ] Apache Spark commented on SPARK-5419: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/4217 Fix the logic in Vectors.sqdist --- Key: SPARK-5419 URL: https://issues.apache.org/jira/browse/SPARK-5419 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Liang-Chi Hsieh The current implementation of sqdist tries to convert sparse vectors to dense if they are close to dense. This is not efficient because we need to allocate temp arrays. We should simply implement sqdist without allocating new memory. The current implementation also contains a bug on deciding whether to convert a sparse vector to dense. {code} v1.indices.length / v1.size 0.5 {code} which should get removed with the changes described above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5422) Support sending to Graphite via UDP
Ryan Williams created SPARK-5422: Summary: Support sending to Graphite via UDP Key: SPARK-5422 URL: https://issues.apache.org/jira/browse/SPARK-5422 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor {{io.dropwizard.metrics-graphite}} version {{3.1.0}} can send metrics to Graphite via UDP or TCP. After upgrading ([SPARK-5413|https://issues.apache.org/jira/browse/SPARK-5413]), we should support using this facility, presumably specified via a protocol field in the metrics config file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5384) Vectors.sqdist return inconsistent result for sparse/dense vectors when the vectors have different lengths
[ https://issues.apache.org/jira/browse/SPARK-5384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5384: - Priority: Minor (was: Critical) Vectors.sqdist return inconsistent result for sparse/dense vectors when the vectors have different lengths -- Key: SPARK-5384 URL: https://issues.apache.org/jira/browse/SPARK-5384 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.1 Environment: centos, others should be similar Reporter: yuhao yang Assignee: yuhao yang Priority: Minor Fix For: 1.3.0 Original Estimate: 24h Remaining Estimate: 24h For two vectors of different lengths, Vectors.sqdist would return different result when the vectors are represented as sparse and dense respectively. Sample: val s1 = new SparseVector(4, Array(0,1,2,3), Array(1.0, 2.0, 3.0, 4.0)) val s2 = new SparseVector(1, Array(0), Array(9.0)) val d1 = new DenseVector(Array(1.0, 2.0, 3.0, 4.0)) val d2 = new DenseVector(Array(9.0)) println(s1 == d1 s2 == d2) println(Vectors.sqdist(s1, s2)) println(Vectors.sqdist(d1, d2)) result: true 93.0 64.0 More precisely, for the extra part, Vectors.sqdist would include it for sparse vectors and exclude it for dense vectors. I'll send a PR and we can have more detailed discussion there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3439) Add Canopy Clustering Algorithm
[ https://issues.apache.org/jira/browse/SPARK-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292688#comment-14292688 ] Xiangrui Meng commented on SPARK-3439: -- [~angellandros] The public API and the complexity analysis are more important than implementation details. What is the expected input and output of the algorithm and how does it scale? Add Canopy Clustering Algorithm --- Key: SPARK-3439 URL: https://issues.apache.org/jira/browse/SPARK-3439 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Yu Ishikawa Assignee: Muhammad-Ali A'rabi Priority: Minor The canopy clustering algorithm is an unsupervised pre-clustering algorithm. It is often used as a preprocessing step for the K-means algorithm or the Hierarchical clustering algorithm. It is intended to speed up clustering operations on large data sets, where using another algorithm directly may be impractical due to the size of the data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4587) Model export/import
[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4587: - Assignee: Joseph K. Bradley Model export/import --- Key: SPARK-4587 URL: https://issues.apache.org/jira/browse/SPARK-4587 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Joseph K. Bradley Priority: Critical This is an umbrella JIRA for one of the most requested features on the user mailing list. Model export/import can be done via Java serialization. But it doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we should provide save/load methods to every model. PMML is an option but it has its limitations. There are couple things we need to discuss: 1) data format, 2) how to preserve partitioning, 3) data compatibility between versions and language APIs, etc. UPDATE: [Design doc for model import/export | https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] This document sketches machine learning model import/export plans, including goals, an API, and development plans. The design doc proposes: * Support our own Spark-specific format. ** This is needed to (a) support distributed models and (b) get model import/export support into Spark quickly (while avoiding new dependencies). * Also support PMML ** This is needed since it is the only thing approaching an industry standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1856) Standardize MLlib interfaces
[ https://issues.apache.org/jira/browse/SPARK-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1856: - Target Version/s: (was: 1.3.0) Standardize MLlib interfaces Key: SPARK-1856 URL: https://issues.apache.org/jira/browse/SPARK-1856 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Blocker Instead of expanding MLlib based on the current class naming scheme (ProblemWithAlgorithm), we should standardize MLlib's interfaces that clearly separate datasets, formulations, algorithms, parameter sets, and models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1856) Standardize MLlib interfaces
[ https://issues.apache.org/jira/browse/SPARK-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1856: - Priority: Critical (was: Blocker) Standardize MLlib interfaces Key: SPARK-1856 URL: https://issues.apache.org/jira/browse/SPARK-1856 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Instead of expanding MLlib based on the current class naming scheme (ProblemWithAlgorithm), we should standardize MLlib's interfaces that clearly separate datasets, formulations, algorithms, parameter sets, and models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1486) Support multi-model training in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1486: - Target Version/s: (was: 1.3.0) Support multi-model training in MLlib - Key: SPARK-1486 URL: https://issues.apache.org/jira/browse/SPARK-1486 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Burak Yavuz Priority: Critical It is rare in practice to train just one model with a given set of parameters. Usually, this is done by training multiple models with different sets of parameters and then select the best based on their performance on the validation set. MLlib should provide native support for multi-model training/scoring. It requires decoupling of concepts like problem, formulation, algorithm, parameter set, and model, which are missing in MLlib now. MLI implements similar concepts, which we can borrow. There are different approaches for multi-model training: 0) Keep one copy of the data, and train models one after another (or maybe in parallel, depending on the scheduler). 1) Keep one copy of the data, and train multiple models at the same time (similar to `runs` in KMeans). 2) Make multiple copies of the data (still stored distributively), and use more cores to distribute the work. 3) Collect the data, make the entire dataset available on workers, and train one or more models on each worker. Users should be able to choose which execution mode they want to use. Note that 3) could cover many use cases in practice when the training data is not huge, e.g., 1GB. This task will be divided into sub-tasks and this JIRA is created to discuss the design and track the overall progress. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature
[ https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3717: - Target Version/s: (was: 1.3.0) DecisionTree, RandomForest: Partition by feature Key: SPARK-3717 URL: https://issues.apache.org/jira/browse/SPARK-3717 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley h1. Summary Currently, data are partitioned by row/instance for DecisionTree and RandomForest. This JIRA argues for partitioning by feature for training deep trees. This is especially relevant for random forests, which are often trained to be deeper than single decision trees. h1. Details Dataset dimensions and the depth of the tree to be trained are the main problem parameters determining whether it is better to partition features or instances. For random forests (training many deep trees), partitioning features could be much better. Notation: * P = # workers * N = # instances * M = # features * D = depth of tree h2. Partitioning Features Algorithm sketch: * Each worker stores: ** a subset of columns (i.e., a subset of features). If a worker stores feature j, then the worker stores the feature value for all instances (i.e., the whole column). ** all labels * Train one level at a time. * Invariants: ** Each worker stores a mapping: instance → node in current level * On each iteration: ** Each worker: For each node in level, compute (best feature to split, info gain). ** Reduce (P x M) values to M values to find best split for each node. ** Workers who have features used in best splits communicate left/right for relevant instances. Gather total of N bits to master, then broadcast. * Total communication: ** Depth D iterations ** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 bit each). ** Estimate: D * (M * 8 + N) h2. Partitioning Instances Algorithm sketch: * Train one group of nodes at a time. * Invariants: * Each worker stores a mapping: instance → node * On each iteration: ** Each worker: For each instance, add to aggregate statistics. ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes) *** (“# classes” is for classification. 3 for regression) ** Reduce aggregate. ** Master chooses best split for each node in group and broadcasts. * Local training: Once all instances for a node fit on one machine, it can be best to shuffle data and training subtrees locally. This can mean shuffling the entire dataset for each tree trained. * Summing over all iterations, reduce to total of: ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each) ** Estimate: 2^D * M * B * C * 8 h2. Comparing Partitioning Methods Partitioning features cost partitioning instances cost when: * D * (M * 8 + N) 2^D * M * B * C * 8 * D * N 2^D * M * B * C * 8 (assuming D * M * 8 is small compared to the right hand side) * N [ 2^D * M * B * C * 8 ] / D Example: many instances: * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 5) * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7 * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5321) Add transpose() method to Matrix
[ https://issues.apache.org/jira/browse/SPARK-5321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5321: - Assignee: Burak Yavuz Add transpose() method to Matrix Key: SPARK-5321 URL: https://issues.apache.org/jira/browse/SPARK-5321 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Burak Yavuz Assignee: Burak Yavuz While we are working on BlockMatrix, it will be nice to add the support to transpose matrices. .transpose() will just modify a private flag in local matrices. Operations that follow will be performed based on this flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5114) Should Evaluator be a PipelineStage
[ https://issues.apache.org/jira/browse/SPARK-5114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5114: - Summary: Should Evaluator be a PipelineStage (was: Should Evaluator by a PipelineStage) Should Evaluator be a PipelineStage --- Key: SPARK-5114 URL: https://issues.apache.org/jira/browse/SPARK-5114 Project: Spark Issue Type: Question Components: ML Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Pipelines can currently contain Estimators and Transformers. Question for debate: Should Pipelines be able to contain Evaluators? Pros: * Evaluators take input datasets with particular schema, which should perhaps be checked before running a Pipeline. Cons: * Evaluators do not transform datasets. They produce a scalar (or a few values), which makes it hard to say how they fit into a Pipeline or a PipelineModel. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5094) Python API for gradient-boosted trees
[ https://issues.apache.org/jira/browse/SPARK-5094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5094: - Assignee: Kazuki Taniguchi Python API for gradient-boosted trees - Key: SPARK-5094 URL: https://issues.apache.org/jira/browse/SPARK-5094 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Kazuki Taniguchi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5413) Upgrade metrics dependency to 3.1.0
[ https://issues.apache.org/jira/browse/SPARK-5413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Williams updated SPARK-5413: - Description: Spark currently uses Coda Hale's metrics library version {{3.0.0}}. Version {{3.1.0}} includes some useful improvements, like [batching metrics in TCP|https://github.com/dropwizard/metrics/issues/660#issuecomment-55626888] and [supporting Graphite's UDP interface|https://github.com/dropwizard/metrics/blob/v3.1.0/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteUDP.java]. I'd like to bump Spark's version to take advantage of these; I'm playing with GraphiteSink and seeing Graphite struggle to ingest all executors' metrics. was: Spark currently uses Coda Hale's metrics library version {{3.0.0}}. Version {{3.1.0}} includes some useful improvements, like [batching metrics in TCP|https://github.com/dropwizard/metrics/issues/660#issuecomment-55626888] and supporting Graphite's UDP interface. I'd like to bump Spark's version to take advantage of these; I'm playing with GraphiteSink and seeing Graphite struggle to ingest all executors' metrics. Upgrade metrics dependency to 3.1.0 - Key: SPARK-5413 URL: https://issues.apache.org/jira/browse/SPARK-5413 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Ryan Williams Spark currently uses Coda Hale's metrics library version {{3.0.0}}. Version {{3.1.0}} includes some useful improvements, like [batching metrics in TCP|https://github.com/dropwizard/metrics/issues/660#issuecomment-55626888] and [supporting Graphite's UDP interface|https://github.com/dropwizard/metrics/blob/v3.1.0/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteUDP.java]. I'd like to bump Spark's version to take advantage of these; I'm playing with GraphiteSink and seeing Graphite struggle to ingest all executors' metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5413) Upgrade metrics dependency to 3.1.0
[ https://issues.apache.org/jira/browse/SPARK-5413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Williams updated SPARK-5413: - Description: Spark currently uses Coda Hale's metrics library version {{3.0.0}}. Version {{3.1.0}} includes some useful improvements, like [batching metrics in TCP|https://github.com/dropwizard/metrics/issues/660#issuecomment-55626888] and supporting Graphite's UDP interface. I'd like to bump Spark's version to take advantage of these; I'm playing with GraphiteSink and seeing Graphite struggle to ingest all executors' metrics. was: Spark currently uses Coda Hale's metrics library version {{3.0.0}}. Version {{3.1.0}} includes some useful improvements, like batching metrics in TCP and supporting Graphite's UDP interface. I'd like to bump Spark's version to take advantage of these; I'm playing with GraphiteSink and seeing Graphite struggle to ingest all executors' metrics. Upgrade metrics dependency to 3.1.0 - Key: SPARK-5413 URL: https://issues.apache.org/jira/browse/SPARK-5413 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Ryan Williams Spark currently uses Coda Hale's metrics library version {{3.0.0}}. Version {{3.1.0}} includes some useful improvements, like [batching metrics in TCP|https://github.com/dropwizard/metrics/issues/660#issuecomment-55626888] and supporting Graphite's UDP interface. I'd like to bump Spark's version to take advantage of these; I'm playing with GraphiteSink and seeing Graphite struggle to ingest all executors' metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5413) Upgrade metrics dependency to 3.1.0
Ryan Williams created SPARK-5413: Summary: Upgrade metrics dependency to 3.1.0 Key: SPARK-5413 URL: https://issues.apache.org/jira/browse/SPARK-5413 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Ryan Williams Spark currently uses Coda Hale's metrics library version {{3.0.0}}. Version {{3.1.0}} includes some useful improvements, like batching metrics in TCP and supporting Graphite's UDP interface. I'd like to bump Spark's version to take advantage of these; I'm playing with GraphiteSink and seeing Graphite struggle to ingest all executors' metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit
[ https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292718#comment-14292718 ] Xiangrui Meng commented on SPARK-4846: -- [~josephtang] Are you working on this issue? If not, do you mind me sending a PR that throwing an exception if vectorSize is beyond limit? When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit --- Key: SPARK-4846 URL: https://issues.apache.org/jira/browse/SPARK-4846 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.1, 1.2.0 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one partition. The corpus contains about 300 million words and its vocabulary size is about 10 million. Reporter: Joseph Tang Assignee: Joseph Tang Priority: Minor Exception in thread Driver java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610) at org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5414) Add SparkListener implementation that allows users to receive all listener events in one method
[ https://issues.apache.org/jira/browse/SPARK-5414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5414: -- Component/s: Spark Core Add SparkListener implementation that allows users to receive all listener events in one method --- Key: SPARK-5414 URL: https://issues.apache.org/jira/browse/SPARK-5414 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen Currently, users don't have a very good way to write a SparkListener that receives all SparkListener events and which will be future-compatible (e.g. it will receive events introduced in newer versions of Spark without having to override new methods to process those events). Therefore, I think Spark should include a concrete SparkListener implementation that implements all of the message-handling methods and dispatches all of them to a single {{onEvent}} method. By putting this code in Spark, we isolate users from changes to the listener API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5415) Upgrade sbt to 0.13.7
Ryan Williams created SPARK-5415: Summary: Upgrade sbt to 0.13.7 Key: SPARK-5415 URL: https://issues.apache.org/jira/browse/SPARK-5415 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor Spark currently uses sbt {{0.13.6}}, which has a regression related to processing parent POM's in Maven projects. {{0.13.7}} does not have this issue (though it's unclear whether it was fixed intentionally), so I'd like to bump up one version. I ran into this while locally building a Spark assembly against a locally-built metrics JAR dependency; {{0.13.6}} could not build Spark but {{0.13.7}} worked fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5261) In some cases ,The value of word's vector representation is too big
[ https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292750#comment-14292750 ] Kai Sasaki commented on SPARK-5261: --- [~gq] Can you provide us data set? I tried with some patterns of num partitions but cannot reproduced it. In some cases ,The value of word's vector representation is too big --- Key: SPARK-5261 URL: https://issues.apache.org/jira/browse/SPARK-5261 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Guoqiang Li {code} val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36) {code} The average absolute value of the word's vector representation is 60731.8 {code} val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(1) {code} The average absolute value of the word's vector representation is 0.13889 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5409) Broken link in documentation
[ https://issues.apache.org/jira/browse/SPARK-5409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5409. -- Resolution: Duplicate Actually this was already fixed Broken link in documentation Key: SPARK-5409 URL: https://issues.apache.org/jira/browse/SPARK-5409 Project: Spark Issue Type: Documentation Reporter: Mauro Pirrone Priority: Minor https://spark.apache.org/docs/1.2.0/streaming-kafka-integration.html See the API docs and the example. Link to example is broken. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5395) Large number of Python workers causing resource depletion
[ https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292769#comment-14292769 ] Mark Khaitman commented on SPARK-5395: -- This may prove to be useful... I'm watching a presently running spark-submitted job, while watching the pyspark.daemon processes. The framework is permitted to only use 8 cores on each node with the default python worker memory of 512mb per node (not the executor memory which is set to higher than this). Ignoring the exact RDD actions for a moment, it looks like while it transitions from Stage 1 - Stage 2, it spawned up 8-10 additional pyspark.daemon processes making the box use more cores than it was even allowed to... A few seconds after that, the other 8 processes entered a sleeping state while still holding onto the physical memory it ate up in Stage 1. As soon as Stage 2 finished, practically all of the pyspark.daemons vanished and freed up the memory usage. I was keeping an eye on 2 random nodes and the exact same thing occurred on both. It was also the only currently executing job at the time so there was really no other interference/contention for resources. I will try to provide a bit more detail on the exact transformations/actions occurring between the 2 stages, although I know a PartionBy and cogroup are occurring at the very least without inspecting the spark-submitted code directly. Large number of Python workers causing resource depletion - Key: SPARK-5395 URL: https://issues.apache.org/jira/browse/SPARK-5395 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: AWS ElasticMapReduce Reporter: Sven Krasser During job execution a large number of Python worker accumulates eventually causing YARN to kill containers for being over their memory allocation (in the case below that is about 8G for executors plus 6G for overhead per container). In this instance, at the time of killing the container 97 pyspark.daemon processes had accumulated. {noformat} 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler (Logging.scala:logInfo(59)) - Container marked as failed: container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: Container [pid=35211,containerID=container_1421692415636_0052_01_30] is running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing container. Dump of the process-tree for container_1421692415636_0052_01_30 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m pyspark.daemon |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m pyspark.daemon |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m pyspark.daemon [...] {noformat} The configuration used uses 64 containers with 2 cores each. Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c Mailinglist discussion: https://www.mail-archive.com/user@spark.apache.org/msg20102.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5419) Fix the logic in Vectors.sqdist
Xiangrui Meng created SPARK-5419: Summary: Fix the logic in Vectors.sqdist Key: SPARK-5419 URL: https://issues.apache.org/jira/browse/SPARK-5419 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Liang-Chi Hsieh The current implementation of sqdist tries to convert sparse vectors to dense if they are close to dense. This is not efficient because we need to allocate temp arrays. We should simply implement sqdist without allocating new memory. The current implementation also contains a bug on deciding whether to convert a sparse vector to dense. {code} v1.indices.length / v1.size 0.5 {code} which should get removed with the changes described above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5421) SparkSql throw OOM at shuffle
Hong Shen created SPARK-5421: Summary: SparkSql throw OOM at shuffle Key: SPARK-5421 URL: https://issues.apache.org/jira/browse/SPARK-5421 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Hong Shen ExternalAppendOnlyMap if only for the spark job that aggregator isDefined, but sparkSQL's shuffledRDD haven't define aggregator, so sparkSQL won't spill at shuffle, it's very easy to throw OOM at shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3726) RandomForest: Support for bootstrap options
[ https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-3726. -- Resolution: Fixed Fix Version/s: (was: 1.2.0) 1.3.0 Issue resolved by pull request 4073 [https://github.com/apache/spark/pull/4073] RandomForest: Support for bootstrap options --- Key: SPARK-3726 URL: https://issues.apache.org/jira/browse/SPARK-3726 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Manoj Kumar Priority: Minor Fix For: 1.3.0 RandomForest uses BaggedPoint to simulate bootstrapped samples of the data. The expected size of each sample is the same as the original data (sampling rate = 1.0), and sampling is done with replacement. Adding support for other sampling rates and for sampling without replacement would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5267) Add a streaming module to ingest Apache Camel Messages from a configured endpoints
[ https://issues.apache.org/jira/browse/SPARK-5267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293043#comment-14293043 ] Tathagata Das commented on SPARK-5267: -- Hey this is a great initiative! However, we are trying to limit the number of external dependencies in the Spark umbrella project for better manageability. So while we want to add more functionality and integration with other systems, we are very cautious about adding these dependencies into the Spark project. A better place for such contributions is the http://spark-packages.org/ where people contribute such functionality and maintain them on their own. I strongly encourage you to add your camel integration to the spark-packages. Add a streaming module to ingest Apache Camel Messages from a configured endpoints -- Key: SPARK-5267 URL: https://issues.apache.org/jira/browse/SPARK-5267 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.2.0 Reporter: Steve Brewin Labels: features Original Estimate: 120h Remaining Estimate: 120h The number of input stream protocols supported by Spark Streaming is quite limited, which constrains the number of systems with which it can be integrated. This proposal solves the problem by adding an optional module that integrates Apache Camel, which supports many additional input protocols. Our tried and tested implementation of this proposal is spark-streaming-camel. An Apache Camel service is run on a separate Thread, consuming each http://camel.apache.org/maven/current/camel-core/apidocs/org/apache/camel/Message.html and storing it into Spark's memory. The provider of the Message is specified by any consuming component URI documented at http://camel.apache.org/components.html, making all of these protocols available to Spark Streaming. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4964) Exactly-once + WAL-free Kafka Support in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-4964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293140#comment-14293140 ] Tathagata Das commented on SPARK-4964: -- [~dibbhatt][~jerryshao][~hshreedharan][~c...@koeninger.org] Please take a look at the design doc and comment on it. Thank you very much! Exactly-once + WAL-free Kafka Support in Spark Streaming Key: SPARK-4964 URL: https://issues.apache.org/jira/browse/SPARK-4964 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Cody Koeninger There are two issues with the current Kafka support - Use of Write Ahead Logs in Spark Streaming to ensure no data is lost - Causes data replication in both Kafka AND Spark Streaming. - Lack of exactly-once semantics - For background, see http://apache-spark-developers-list.1001551.n3.nabble.com/Which-committers-care-about-Kafka-td9827.html We want to solve both these problem in JIRA. Please see the following design doc for the solution. https://docs.google.com/a/databricks.com/document/d/1IuvZhg9cOueTf1mq4qwc1fhPb5FVcaRLcyjrtG4XU1k/edit#heading=h.itproy77j3p -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4964) Exactly-once + WAL-free Kafka Support in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-4964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-4964: - Summary: Exactly-once + WAL-free Kafka Support in Spark Streaming (was: Exactly-once semantics for Kafka) Exactly-once + WAL-free Kafka Support in Spark Streaming Key: SPARK-4964 URL: https://issues.apache.org/jira/browse/SPARK-4964 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Cody Koeninger for background, see http://apache-spark-developers-list.1001551.n3.nabble.com/Which-committers-care-about-Kafka-td9827.html Requirements: - allow client code to implement exactly-once end-to-end semantics for Kafka messages, in cases where their output storage is either idempotent or transactional - allow client code access to Kafka offsets, rather than automatically committing them - do not assume Zookeeper as a repository for offsets (for the transactional case, offsets need to be stored in the same store as the data) - allow failure recovery without lost or duplicated messages, even in cases where a checkpoint cannot be restored (for instance, because code must be updated) Design: The basic idea is to make an rdd where each partition corresponds to a given Kafka topic, partition, starting offset, and ending offset. That allows for deterministic replay of data from Kafka (as long as there is enough log retention). Client code is responsible for committing offsets, either transactionally to the same store that data is being written to, or in the case of idempotent data, after data has been written. PR of a sample implementation for both the batch and dstream case is forthcoming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4964) Exactly-once semantics for Kafka
[ https://issues.apache.org/jira/browse/SPARK-4964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293129#comment-14293129 ] Tathagata Das commented on SPARK-4964: -- I am renaming this JIRA to Native Kafka Support because there are two problems that we are trying to solve, which gets solved by the associated PR. Exactly-once semantics for Kafka Key: SPARK-4964 URL: https://issues.apache.org/jira/browse/SPARK-4964 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Cody Koeninger for background, see http://apache-spark-developers-list.1001551.n3.nabble.com/Which-committers-care-about-Kafka-td9827.html Requirements: - allow client code to implement exactly-once end-to-end semantics for Kafka messages, in cases where their output storage is either idempotent or transactional - allow client code access to Kafka offsets, rather than automatically committing them - do not assume Zookeeper as a repository for offsets (for the transactional case, offsets need to be stored in the same store as the data) - allow failure recovery without lost or duplicated messages, even in cases where a checkpoint cannot be restored (for instance, because code must be updated) Design: The basic idea is to make an rdd where each partition corresponds to a given Kafka topic, partition, starting offset, and ending offset. That allows for deterministic replay of data from Kafka (as long as there is enough log retention). Client code is responsible for committing offsets, either transactionally to the same store that data is being written to, or in the case of idempotent data, after data has been written. PR of a sample implementation for both the batch and dstream case is forthcoming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4964) Exactly-once + WAL-free Kafka Support in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-4964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-4964: - Description: There are two issues with the current Kafka support - Use of Write Ahead Logs in Spark Streaming to ensure no data is lost - Causes data replication in both Kafka AND Spark Streaming. - Lack of exactly-once semantics - For background, see http://apache-spark-developers-list.1001551.n3.nabble.com/Which-committers-care-about-Kafka-td9827.html We want to solve both these problem in JIRA. Please see the following design doc for the solution. https://docs.google.com/a/databricks.com/document/d/1IuvZhg9cOueTf1mq4qwc1fhPb5FVcaRLcyjrtG4XU1k/edit#heading=h.itproy77j3p was: for background, see http://apache-spark-developers-list.1001551.n3.nabble.com/Which-committers-care-about-Kafka-td9827.html There ar Exactly-once + WAL-free Kafka Support in Spark Streaming Key: SPARK-4964 URL: https://issues.apache.org/jira/browse/SPARK-4964 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Cody Koeninger There are two issues with the current Kafka support - Use of Write Ahead Logs in Spark Streaming to ensure no data is lost - Causes data replication in both Kafka AND Spark Streaming. - Lack of exactly-once semantics - For background, see http://apache-spark-developers-list.1001551.n3.nabble.com/Which-committers-care-about-Kafka-td9827.html We want to solve both these problem in JIRA. Please see the following design doc for the solution. https://docs.google.com/a/databricks.com/document/d/1IuvZhg9cOueTf1mq4qwc1fhPb5FVcaRLcyjrtG4XU1k/edit#heading=h.itproy77j3p -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org