[jira] [Commented] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases
[ https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341005#comment-14341005 ] Corey J. Nolet commented on SPARK-5140: --- Wanted to mention, I did manage to get concurrency working by walking down the rdd DAG and firing off futures for concurrent work. It required some sleeping for threads to wait while parents may still have been doing their work. I have it so that it's checking for multiple children at each node and, if multiple children exist, is caching the node automatically. Two RDDs which are scheduled concurrently should be able to wait on parent in all cases --- Key: SPARK-5140 URL: https://issues.apache.org/jira/browse/SPARK-5140 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Corey J. Nolet Labels: features Not sure if this would change too much of the internals to be included in the 1.2.1 but it would be very helpful if it could be. This ticket is from a discussion between myself and [~ilikerps]. Here's the result of some testing that [~ilikerps] did: bq. I did some testing as well, and it turns out the wait for other guy to finish caching logic is on a per-task basis, and it only works on tasks that happen to be executing on the same machine. bq. Once a partition is cached, we will schedule tasks that touch that partition on that executor. The problem here, though, is that the cache is in progress, and so the tasks are still scheduled randomly (or with whatever locality the data source has), so tasks which end up on different machines will not see that the cache is already in progress. {code} Here was my test, by the way: import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent._ import scala.concurrent.duration._ val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache() val futures = (0 until 4).map { _ = Future { rdd.count } } Await.result(Future.sequence(futures), 120.second) {code} bq. Note that I run the future 4 times in parallel. I found that the first run has all tasks take 10 seconds. The second has about 50% of its tasks take 10 seconds, and the rest just wait for the first stage to finish. The last two runs have no tasks that take 10 seconds; all wait for the first two stages to finish. What we want is the ability to fire off a job and have the DAG figure out that two RDDs depend on the same parent so that when the children are scheduled concurrently, the first one to start will activate the parent and both will wait on the parent. When the parent is done, they will both be able to finish their work concurrently. We are trying to use this pattern by having the parent cache results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4320) JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object
[ https://issues.apache.org/jira/browse/SPARK-4320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337540#comment-14337540 ] Corey J. Nolet commented on SPARK-4320: --- Sorry- this ticket should have been closed a while ago. I'll go ahead and close it now. JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object Key: SPARK-4320 URL: https://issues.apache.org/jira/browse/SPARK-4320 Project: Spark Issue Type: Improvement Components: Input/Output, Spark Core Reporter: Corey J. Nolet I am outputting data to Accumulo using a custom OutputFormat. I have tried using saveAsNewHadoopFile() and that works- though passing an empty path is a bit weird. Being that it isn't really a file I'm storing, but rather a generic Pair dataset, I'd be inclined to use the saveAsHadoopDataset() method, though I'm not at all interested in using the legacy mapred API. Perhaps we could supply a saveAsNewHadoopDateset method. Personally, I think there should be two ways of calling into this method. Instead of forcing the user to always set up the Job object explicitly, I'm in the camp of having the following method signature: saveAsNewHadoopDataset(keyClass : Class[K], valueClass : Class[V], ofclass : Class[? extends OutputFormat], conf : Configuration). This way, if I'm writing spark jobs that are going from Hadoop back into Hadoop, I can construct my Configuration once. Perhaps an overloaded method signature could be: saveAsNewHadoopDataset(job : Job) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4320) JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object
[ https://issues.apache.org/jira/browse/SPARK-4320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corey J. Nolet closed SPARK-4320. - Resolution: Won't Fix Target Version/s: 1.2.1, 1.1.2 (was: 1.1.2, 1.2.1) JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object Key: SPARK-4320 URL: https://issues.apache.org/jira/browse/SPARK-4320 Project: Spark Issue Type: Improvement Components: Input/Output, Spark Core Reporter: Corey J. Nolet I am outputting data to Accumulo using a custom OutputFormat. I have tried using saveAsNewHadoopFile() and that works- though passing an empty path is a bit weird. Being that it isn't really a file I'm storing, but rather a generic Pair dataset, I'd be inclined to use the saveAsHadoopDataset() method, though I'm not at all interested in using the legacy mapred API. Perhaps we could supply a saveAsNewHadoopDateset method. Personally, I think there should be two ways of calling into this method. Instead of forcing the user to always set up the Job object explicitly, I'm in the camp of having the following method signature: saveAsNewHadoopDataset(keyClass : Class[K], valueClass : Class[V], ofclass : Class[? extends OutputFormat], conf : Configuration). This way, if I'm writing spark jobs that are going from Hadoop back into Hadoop, I can construct my Configuration once. Perhaps an overloaded method signature could be: saveAsNewHadoopDataset(job : Job) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases
[ https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335518#comment-14335518 ] Corey J. Nolet edited comment on SPARK-5140 at 2/24/15 9:50 PM: I have a framework (similar to cascading) where users wire together their RDD transformations in reusable components and the framework wires them up together based on dependencies that they define. There is a notion of sinks- or data outputs and they are also encapsulated in reusable componentry. A sink depends on sources- The sinks are actually what get executed. I wanted to be able to execute them in parallel and just have spark be smart enough to figure out what needs to block (whenever there's a cached rdd) in order to make that possible. We're comparing the same data transformations to MapReduce that was written in JAQL and the single-threaded execution of the sinks is causing absolutely wretched run times. I'm not saying an internal requirement from my framework should necessaily be levied against the spark features- however, this change (seeming like it would only affect the driver side scheduling) would allow the spark context to be truly thread-safe. I've tried running jobs that over-utilize the resources in parallel until one of them fully caches and it really just slows things down and uses too many resources- I can't see why anyone submitting rdds in different threads that depend on cached RDDs wouldn't want those threads to block for the parents but maybe I'm not thinking abstract enough. For now, I'm going to propagate down my data source tree breadth first and no-op those sources that return CachedRDDs so that their children can be scheduled in different threads. was (Author: sonixbp): I have a framework (similar to cascading) where users wire together their RDD transformations in reusable components and the framework wires them up together based on dependencies that they define. There is a notion of sinks- or data outputs and they are also encapsulated in reusable componentry. A sink depends on sources- The sinks are actually what get executed. I wanted to be able to execute them in parallel and just have spark be smart enough to figure out what needs to block (whenever there's a cached rdd) in order to make that possible. We're comparing the same data transformations to MapReduce that was written in JAQL and the single-threaded execution of the sinks is causing absolutely wretched run times. I'm not saying an internal requirement from my framework should necessaily be levied against the spark features- however, this would change (seeming like it would only affect the driver side scheduling) would allow the spark context to be truly thread-safe. I've tried running jobs that over-utilize the resources in parallel until one of them fully caches and it really just slows things down and uses too many resources- I can't see why anyone submitting rdds in different threads that depend on cached RDDs wouldn't want those threads to block for the parents but maybe I'm not thinking abstract enough. For now, I'm going to propagate down my data source tree breadth first and no-op those sources that return CachedRDDs so that their children can be scheduled in different threads. Two RDDs which are scheduled concurrently should be able to wait on parent in all cases --- Key: SPARK-5140 URL: https://issues.apache.org/jira/browse/SPARK-5140 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Corey J. Nolet Labels: features Not sure if this would change too much of the internals to be included in the 1.2.1 but it would be very helpful if it could be. This ticket is from a discussion between myself and [~ilikerps]. Here's the result of some testing that [~ilikerps] did: bq. I did some testing as well, and it turns out the wait for other guy to finish caching logic is on a per-task basis, and it only works on tasks that happen to be executing on the same machine. bq. Once a partition is cached, we will schedule tasks that touch that partition on that executor. The problem here, though, is that the cache is in progress, and so the tasks are still scheduled randomly (or with whatever locality the data source has), so tasks which end up on different machines will not see that the cache is already in progress. {code} Here was my test, by the way: import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent._ import scala.concurrent.duration._ val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache() val futures = (0 until 4).map { _ = Future { rdd.count } } Await.result(Future.sequence(futures),
[jira] [Commented] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases
[ https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335518#comment-14335518 ] Corey J. Nolet commented on SPARK-5140: --- I have a framework (similar to cascading) where users wire together their RDD transformations in reusable components and the framework wires them up together based on dependencies that they define. There is a notion of sinks- or data outputs and they are also encapsulated in reusable componentry. A sink depends on sources- The sinks are actually what get executed. I wanted to be able to execute them in parallel and just have spark be smart enough to figure out what needs to block (whenever there's a cached rdd) in order to make that possible. We're comparing the same data transformations to MapReduce that was written in JAQL and the single-threaded execution of the sinks is causing absolutely wretched run times. I'm not saying an internal requirement from my framework should necessaily be levied against the spark features- however, this would change (seeming like it would only affect the driver side scheduling) would allow the spark context to be truly thread-safe. I've tried running jobs that over-utilize the resources in parallel until one of them fully caches and it really just slows things down and uses too many resources- I can't see why anyone submitting rdds in different threads that depend on cached RDDs wouldn't want those threads to block for the parents but maybe I'm not thinking abstract enough. For now, I'm going to propagate down my data source tree breadth first and no-op those sources that return CachedRDDs so that their children can be scheduled in different threads. Two RDDs which are scheduled concurrently should be able to wait on parent in all cases --- Key: SPARK-5140 URL: https://issues.apache.org/jira/browse/SPARK-5140 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Corey J. Nolet Labels: features Not sure if this would change too much of the internals to be included in the 1.2.1 but it would be very helpful if it could be. This ticket is from a discussion between myself and [~ilikerps]. Here's the result of some testing that [~ilikerps] did: bq. I did some testing as well, and it turns out the wait for other guy to finish caching logic is on a per-task basis, and it only works on tasks that happen to be executing on the same machine. bq. Once a partition is cached, we will schedule tasks that touch that partition on that executor. The problem here, though, is that the cache is in progress, and so the tasks are still scheduled randomly (or with whatever locality the data source has), so tasks which end up on different machines will not see that the cache is already in progress. {code} Here was my test, by the way: import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent._ import scala.concurrent.duration._ val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache() val futures = (0 until 4).map { _ = Future { rdd.count } } Await.result(Future.sequence(futures), 120.second) {code} bq. Note that I run the future 4 times in parallel. I found that the first run has all tasks take 10 seconds. The second has about 50% of its tasks take 10 seconds, and the rest just wait for the first stage to finish. The last two runs have no tasks that take 10 seconds; all wait for the first two stages to finish. What we want is the ability to fire off a job and have the DAG figure out that two RDDs depend on the same parent so that when the children are scheduled concurrently, the first one to start will activate the parent and both will wait on the parent. When the parent is done, they will both be able to finish their work concurrently. We are trying to use this pattern by having the parent cache results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases
[ https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304510#comment-14304510 ] Corey J. Nolet commented on SPARK-5140: --- I think the problem is that when actions are performed on RDDs in multiple threads, the SparkContext on the driver that's scheduling the DAG should be able to see that the two RDDs depend on the same parents and synchronize them so that only one will run at a time, whether being cached or not (you'd assume the parent would be getting cached but I think this change would still affect cases where it hasn't been.). The fact that I did: val rdd1 = input data - transform data - groupBy - etc... - cache val rdd2 = future { rdd1.transform.groupBy.saveAsSequenceFile() } val rdd3 = future { rdd1.transform.groupBy.saveAsSequenceFile() } Has unexpected results when I find that rdd1 was assigned an id and run completely separately for rdd2 and rdd3. I would have expected, whether cached or not, that when run in separate threads, rdd1 would have been assigned an id, then rdd2 would have caused it to begin running through its stages, and rdd3 would have paused because it's waiting on rdd1's id to complete its stages. What I see is that, after rdd2 and rdd3 both run concurrently calculating rdd1, the storage for rdd1 = 200% cached. It causes issues when I have 50 or so rdds calling saveAsSequenceFile() that all have different shared dependencies on parent rdds (which may not always be known at creation time without introspecting them in my own tree). Now i've basically got to the scheduling myself- I've got to determine what depends on what and run things concurrently myself. It seems like the DAG should know this already and be able to make use of it. Two RDDs which are scheduled concurrently should be able to wait on parent in all cases --- Key: SPARK-5140 URL: https://issues.apache.org/jira/browse/SPARK-5140 Project: Spark Issue Type: New Feature Reporter: Corey J. Nolet Labels: features Not sure if this would change too much of the internals to be included in the 1.2.1 but it would be very helpful if it could be. This ticket is from a discussion between myself and [~ilikerps]. Here's the result of some testing that [~ilikerps] did: bq. I did some testing as well, and it turns out the wait for other guy to finish caching logic is on a per-task basis, and it only works on tasks that happen to be executing on the same machine. bq. Once a partition is cached, we will schedule tasks that touch that partition on that executor. The problem here, though, is that the cache is in progress, and so the tasks are still scheduled randomly (or with whatever locality the data source has), so tasks which end up on different machines will not see that the cache is already in progress. {code} Here was my test, by the way: import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent._ import scala.concurrent.duration._ val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache() val futures = (0 until 4).map { _ = Future { rdd.count } } Await.result(Future.sequence(futures), 120.second) {code} bq. Note that I run the future 4 times in parallel. I found that the first run has all tasks take 10 seconds. The second has about 50% of its tasks take 10 seconds, and the rest just wait for the first stage to finish. The last two runs have no tasks that take 10 seconds; all wait for the first two stages to finish. What we want is the ability to fire off a job and have the DAG figure out that two RDDs depend on the same parent so that when the children are scheduled concurrently, the first one to start will activate the parent and both will wait on the parent. When the parent is done, they will both be able to finish their work concurrently. We are trying to use this pattern by having the parent cache results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class
[ https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304577#comment-14304577 ] Corey J. Nolet commented on SPARK-5260: --- I'm thinking all the schema-specific functions should be pulled out into an object called JsonSchemaFunctions. allKeysWithValueTypes() and createSchema() functions should be exposed via the public API and commented well based on their use. For the project I have that's using these functions, I am actually using the allKeysWithValueTypes() over my entire RDD as it's being saved to a sequence file and I'm using an Accumulator[Set[(String, DataType)]] that is aggregating all the schema elements for the RDD into a final Set where I can then store off the schema and later call CreateSchema() to get the final StructType that can be used with the sql table. I had to write a isConflicted(Set[(String, DataType)]]) function as well to determine if it's possible that a JSON object or JSON array was also encountered as a primitive type in one of the records in the RDD or vice versa. Expose JsonRDD.allKeysWithValueTypes() in a utility class -- Key: SPARK-5260 URL: https://issues.apache.org/jira/browse/SPARK-5260 Project: Spark Issue Type: Improvement Components: SQL Reporter: Corey J. Nolet Assignee: Corey J. Nolet I have found this method extremely useful when implementing my own strategy for inferring a schema from parsed json. For now, I've actually copied the method right out of the JsonRDD class into my own project but I think it would be immensely useful to keep the code in Spark and expose it publicly somewhere else- like an object called JsonSchema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases
[ https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14303548#comment-14303548 ] Corey J. Nolet commented on SPARK-5140: --- Is anyone against this behavior for any reason? Two RDDs which are scheduled concurrently should be able to wait on parent in all cases --- Key: SPARK-5140 URL: https://issues.apache.org/jira/browse/SPARK-5140 Project: Spark Issue Type: New Feature Reporter: Corey J. Nolet Labels: features Fix For: 1.3.0, 1.2.1 Not sure if this would change too much of the internals to be included in the 1.2.1 but it would be very helpful if it could be. This ticket is from a discussion between myself and [~ilikerps]. Here's the result of some testing that [~ilikerps] did: bq. I did some testing as well, and it turns out the wait for other guy to finish caching logic is on a per-task basis, and it only works on tasks that happen to be executing on the same machine. bq. Once a partition is cached, we will schedule tasks that touch that partition on that executor. The problem here, though, is that the cache is in progress, and so the tasks are still scheduled randomly (or with whatever locality the data source has), so tasks which end up on different machines will not see that the cache is already in progress. {code} Here was my test, by the way: import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent._ import scala.concurrent.duration._ val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache() val futures = (0 until 4).map { _ = Future { rdd.count } } Await.result(Future.sequence(futures), 120.second) {code} bq. Note that I run the future 4 times in parallel. I found that the first run has all tasks take 10 seconds. The second has about 50% of its tasks take 10 seconds, and the rest just wait for the first stage to finish. The last two runs have no tasks that take 10 seconds; all wait for the first two stages to finish. What we want is the ability to fire off a job and have the DAG figure out that two RDDs depend on the same parent so that when the children are scheduled concurrently, the first one to start will activate the parent and both will wait on the parent. When the parent is done, they will both be able to finish their work concurrently. We are trying to use this pattern by having the parent cache results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class
[ https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14288621#comment-14288621 ] Corey J. Nolet commented on SPARK-5260: --- May I be added to the proper list so that I can assign this ticket to myself? Expose JsonRDD.allKeysWithValueTypes() in a utility class -- Key: SPARK-5260 URL: https://issues.apache.org/jira/browse/SPARK-5260 Project: Spark Issue Type: Improvement Components: SQL Reporter: Corey J. Nolet I have found this method extremely useful when implementing my own strategy for inferring a schema from parsed json. For now, I've actually copied the method right out of the JsonRDD class into my own project but I think it would be immensely useful to keep the code in Spark and expose it publicly somewhere else- like an object called JsonSchema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5305) Using a field in a WHERE clause that is not in the schema does not throw an exception.
Corey J. Nolet created SPARK-5305: - Summary: Using a field in a WHERE clause that is not in the schema does not throw an exception. Key: SPARK-5305 URL: https://issues.apache.org/jira/browse/SPARK-5305 Project: Spark Issue Type: Bug Components: SQL Reporter: Corey J. Nolet Given a schema: key1 = String key2 = Integer The following sql statement doesn't seem to throw an exception: SELECT * FROM myTable WHERE doesntExist = 'val1' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5306) Support for a NotEqualsFilter in the filter PrunedFilteredScan pushdown
[ https://issues.apache.org/jira/browse/SPARK-5306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corey J. Nolet updated SPARK-5306: -- Component/s: SQL Support for a NotEqualsFilter in the filter PrunedFilteredScan pushdown --- Key: SPARK-5306 URL: https://issues.apache.org/jira/browse/SPARK-5306 Project: Spark Issue Type: Improvement Components: SQL Reporter: Corey J. Nolet This would be a pretty significant addition to the Filters that get pushed down. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5306) Support for a NotEqualsFilter in the filter PrunedFilteredScan pushdown
Corey J. Nolet created SPARK-5306: - Summary: Support for a NotEqualsFilter in the filter PrunedFilteredScan pushdown Key: SPARK-5306 URL: https://issues.apache.org/jira/browse/SPARK-5306 Project: Spark Issue Type: Improvement Reporter: Corey J. Nolet This would be a pretty significant addition to the Filters that get pushed down. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5296) Predicate Pushdown (BaseRelation) to have an interface that will accept OR filters
[ https://issues.apache.org/jira/browse/SPARK-5296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281621#comment-14281621 ] Corey J. Nolet commented on SPARK-5296: --- The more I'm thinking about this- It would be nice if there was a tree pushed down for the filters instead of an Array. This is a significant change to the API so it would still probably be easiest to create a new class (PrunedFilteredTreeScan?). Probably easiest to have AndFilter and OrFilter parent nodes that can be arbitrarily nested with the leaf nodes being the filters that are already used (hopefully with the addition of the NotEqualsFilter from SPARK-5306). Predicate Pushdown (BaseRelation) to have an interface that will accept OR filters -- Key: SPARK-5296 URL: https://issues.apache.org/jira/browse/SPARK-5296 Project: Spark Issue Type: Improvement Components: SQL Reporter: Corey J. Nolet Currently, the BaseRelation API allows a FilteredRelation to handle an Array[Filter] which represents filter expressions that are applied as an AND operator. We should support OR operations in a BaseRelation as well. I'm not sure what this would look like in terms of API changes, but it almost seems like a FilteredUnionedScan BaseRelation (the name stinks but you get the idea) would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class
[ https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280802#comment-14280802 ] Corey J. Nolet edited comment on SPARK-5260 at 1/16/15 8:48 PM: bq. you can make the change and create a pull request. I've love to submit a pull request for this. Do you have a proposed name for the utility object? bq. We do not add fix version(s) until it has been merged into our code base. Noted. We're quite different in Accumulo- we require fix versions for each ticket. was (Author: sonixbp): bq. you can make the change and create a pull request. I've love to submit a pull request for this. Do you have a proposed name for the utility object? bq. We do not add fix version(s) until it has been merged into our code base. Noted, we're quite different in Accumulo- we require fix versions for each ticket. Expose JsonRDD.allKeysWithValueTypes() in a utility class -- Key: SPARK-5260 URL: https://issues.apache.org/jira/browse/SPARK-5260 Project: Spark Issue Type: Improvement Components: SQL Reporter: Corey J. Nolet I have found this method extremely useful when implementing my own strategy for inferring a schema from parsed json. For now, I've actually copied the method right out of the JsonRDD class into my own project but I think it would be immensely useful to keep the code in Spark and expose it publicly somewhere else- like an object called JsonSchema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class
[ https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280802#comment-14280802 ] Corey J. Nolet commented on SPARK-5260: --- bq. you can make the change and create a pull request. I've love to submit a pull request for this. Do you have a proposed name for the utility object? bq. We do not add fix version(s) until it has been merged into our code base. Noted, we're quite different in Accumulo- we require fix versions for each ticket. Expose JsonRDD.allKeysWithValueTypes() in a utility class -- Key: SPARK-5260 URL: https://issues.apache.org/jira/browse/SPARK-5260 Project: Spark Issue Type: Improvement Components: SQL Reporter: Corey J. Nolet I have found this method extremely useful when implementing my own strategy for inferring a schema from parsed json. For now, I've actually copied the method right out of the JsonRDD class into my own project but I think it would be immensely useful to keep the code in Spark and expose it publicly somewhere else- like an object called JsonSchema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class
[ https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280802#comment-14280802 ] Corey J. Nolet edited comment on SPARK-5260 at 1/16/15 8:52 PM: bq. you can make the change and create a pull request. I'd love to submit a pull request for this. Do you have a proposed name for the utility object? bq. We do not add fix version(s) until it has been merged into our code base. Noted. We're quite different in Accumulo- we require fix versions for each ticket. was (Author: sonixbp): bq. you can make the change and create a pull request. I've love to submit a pull request for this. Do you have a proposed name for the utility object? bq. We do not add fix version(s) until it has been merged into our code base. Noted. We're quite different in Accumulo- we require fix versions for each ticket. Expose JsonRDD.allKeysWithValueTypes() in a utility class -- Key: SPARK-5260 URL: https://issues.apache.org/jira/browse/SPARK-5260 Project: Spark Issue Type: Improvement Components: SQL Reporter: Corey J. Nolet I have found this method extremely useful when implementing my own strategy for inferring a schema from parsed json. For now, I've actually copied the method right out of the JsonRDD class into my own project but I think it would be immensely useful to keep the code in Spark and expose it publicly somewhere else- like an object called JsonSchema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5296) Predicate Pushdown (BaseRelation) to have an interface that will accept OR filters
Corey J. Nolet created SPARK-5296: - Summary: Predicate Pushdown (BaseRelation) to have an interface that will accept OR filters Key: SPARK-5296 URL: https://issues.apache.org/jira/browse/SPARK-5296 Project: Spark Issue Type: Improvement Components: SQL Reporter: Corey J. Nolet Currently, the BaseRelation API allows a FilteredRelation to handle an Array[Filter] which represents filter expressions that are applied as an AND operator. We should support OR operations in a BaseRelation as well. I'm not sure what this would look like in terms of API changes, but it almost seems like a FilteredUnionedScan BaseRelation (the name stinks but you get the idea) would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class
[ https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corey J. Nolet updated SPARK-5260: -- Description: I have found this method extremely useful when implementing my own strategy for inferring a schema from parsed json. For now, I've actually copied the method right out of the JsonRDD class into my own project but I think it would be immensely useful to keep the code in Spark and expose it publicly somewhere else- like an object called JsonSchema. (was: I have found this method extremely useful when implementing my own method for inferring a schema from parsed json. For now, I've actually copied the method right out of the JsonRDD class into my own project but I think it would be immensely useful to keep the code in Spark and expose it publicly somewhere else- like an object called JsonSchema.) Expose JsonRDD.allKeysWithValueTypes() in a utility class -- Key: SPARK-5260 URL: https://issues.apache.org/jira/browse/SPARK-5260 Project: Spark Issue Type: Improvement Components: SQL Reporter: Corey J. Nolet Fix For: 1.3.0 I have found this method extremely useful when implementing my own strategy for inferring a schema from parsed json. For now, I've actually copied the method right out of the JsonRDD class into my own project but I think it would be immensely useful to keep the code in Spark and expose it publicly somewhere else- like an object called JsonSchema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases
Corey J. Nolet created SPARK-5140: - Summary: Two RDDs which are scheduled concurrently should be able to wait on parent in all cases Key: SPARK-5140 URL: https://issues.apache.org/jira/browse/SPARK-5140 Project: Spark Issue Type: New Feature Reporter: Corey J. Nolet Fix For: 1.3.0, 1.2.1 Not sure if this would change too much of the internals to be included in the 1.2.1 but it would be very helpful if it could be. This ticket is from a discussion between myself and [~ilikerps]. Here's the result of some testing that [~ilikerps] did: pre I did some testing as well, and it turns out the wait for other guy to finish caching logic is on a per-task basis, and it only works on tasks that happen to be executing on the same machine. Once a partition is cached, we will schedule tasks that touch that partition on that executor. The problem here, though, is that the cache is in progress, and so the tasks are still scheduled randomly (or with whatever locality the data source has), so tasks which end up on different machines will not see that the cache is already in progress. Here was my test, by the way: import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent._ import scala.concurrent.duration._ val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache() val futures = (0 until 4).map { _ = Future { rdd.count } } Await.result(Future.sequence(futures), 120.second) Note that I run the future 4 times in parallel. I found that the first run has all tasks take 10 seconds. The second has about 50% of its tasks take 10 seconds, and the rest just wait for the first stage to finish. The last two runs have no tasks that take 10 seconds; all wait for the first two stages to finish. /pre What we want is the ability to fire off a job and have the DAG figure out that two RDDs depend on the same parent so that when the children are scheduled concurrently, the first one to start will activate the parent and both will wait on the parent. When the parent is done, they will both be able to finish their work concurrently. We are trying to use this pattern by having the parent cache results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases
[ https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corey J. Nolet updated SPARK-5140: -- Description: Not sure if this would change too much of the internals to be included in the 1.2.1 but it would be very helpful if it could be. This ticket is from a discussion between myself and [~ilikerps]. Here's the result of some testing that [~ilikerps] did: {code} I did some testing as well, and it turns out the wait for other guy to finish caching logic is on a per-task basis, and it only works on tasks that happen to be executing on the same machine. Once a partition is cached, we will schedule tasks that touch that partition on that executor. The problem here, though, is that the cache is in progress, and so the tasks are still scheduled randomly (or with whatever locality the data source has), so tasks which end up on different machines will not see that the cache is already in progress. Here was my test, by the way: import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent._ import scala.concurrent.duration._ val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache() val futures = (0 until 4).map { _ = Future { rdd.count } } Await.result(Future.sequence(futures), 120.second) Note that I run the future 4 times in parallel. I found that the first run has all tasks take 10 seconds. The second has about 50% of its tasks take 10 seconds, and the rest just wait for the first stage to finish. The last two runs have no tasks that take 10 seconds; all wait for the first two stages to finish. {code} What we want is the ability to fire off a job and have the DAG figure out that two RDDs depend on the same parent so that when the children are scheduled concurrently, the first one to start will activate the parent and both will wait on the parent. When the parent is done, they will both be able to finish their work concurrently. We are trying to use this pattern by having the parent cache results. was: Not sure if this would change too much of the internals to be included in the 1.2.1 but it would be very helpful if it could be. This ticket is from a discussion between myself and [~ilikerps]. Here's the result of some testing that [~ilikerps] did: {pre} I did some testing as well, and it turns out the wait for other guy to finish caching logic is on a per-task basis, and it only works on tasks that happen to be executing on the same machine. Once a partition is cached, we will schedule tasks that touch that partition on that executor. The problem here, though, is that the cache is in progress, and so the tasks are still scheduled randomly (or with whatever locality the data source has), so tasks which end up on different machines will not see that the cache is already in progress. Here was my test, by the way: import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent._ import scala.concurrent.duration._ val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache() val futures = (0 until 4).map { _ = Future { rdd.count } } Await.result(Future.sequence(futures), 120.second) Note that I run the future 4 times in parallel. I found that the first run has all tasks take 10 seconds. The second has about 50% of its tasks take 10 seconds, and the rest just wait for the first stage to finish. The last two runs have no tasks that take 10 seconds; all wait for the first two stages to finish. {pre} What we want is the ability to fire off a job and have the DAG figure out that two RDDs depend on the same parent so that when the children are scheduled concurrently, the first one to start will activate the parent and both will wait on the parent. When the parent is done, they will both be able to finish their work concurrently. We are trying to use this pattern by having the parent cache results. Two RDDs which are scheduled concurrently should be able to wait on parent in all cases --- Key: SPARK-5140 URL: https://issues.apache.org/jira/browse/SPARK-5140 Project: Spark Issue Type: New Feature Reporter: Corey J. Nolet Labels: features Fix For: 1.3.0, 1.2.1 Not sure if this would change too much of the internals to be included in the 1.2.1 but it would be very helpful if it could be. This ticket is from a discussion between myself and [~ilikerps]. Here's the result of some testing that [~ilikerps] did: {code} I did some testing as well, and it turns out the wait for other guy to finish caching logic is on a per-task basis, and it only works on tasks that happen to be executing on the same machine. Once a partition is cached, we will schedule tasks that touch that partition on that
[jira] [Updated] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases
[ https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corey J. Nolet updated SPARK-5140: -- Description: Not sure if this would change too much of the internals to be included in the 1.2.1 but it would be very helpful if it could be. This ticket is from a discussion between myself and [~ilikerps]. Here's the result of some testing that [~ilikerps] did: {pre} I did some testing as well, and it turns out the wait for other guy to finish caching logic is on a per-task basis, and it only works on tasks that happen to be executing on the same machine. Once a partition is cached, we will schedule tasks that touch that partition on that executor. The problem here, though, is that the cache is in progress, and so the tasks are still scheduled randomly (or with whatever locality the data source has), so tasks which end up on different machines will not see that the cache is already in progress. Here was my test, by the way: import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent._ import scala.concurrent.duration._ val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache() val futures = (0 until 4).map { _ = Future { rdd.count } } Await.result(Future.sequence(futures), 120.second) Note that I run the future 4 times in parallel. I found that the first run has all tasks take 10 seconds. The second has about 50% of its tasks take 10 seconds, and the rest just wait for the first stage to finish. The last two runs have no tasks that take 10 seconds; all wait for the first two stages to finish. {pre} What we want is the ability to fire off a job and have the DAG figure out that two RDDs depend on the same parent so that when the children are scheduled concurrently, the first one to start will activate the parent and both will wait on the parent. When the parent is done, they will both be able to finish their work concurrently. We are trying to use this pattern by having the parent cache results. was: Not sure if this would change too much of the internals to be included in the 1.2.1 but it would be very helpful if it could be. This ticket is from a discussion between myself and [~ilikerps]. Here's the result of some testing that [~ilikerps] did: pre I did some testing as well, and it turns out the wait for other guy to finish caching logic is on a per-task basis, and it only works on tasks that happen to be executing on the same machine. Once a partition is cached, we will schedule tasks that touch that partition on that executor. The problem here, though, is that the cache is in progress, and so the tasks are still scheduled randomly (or with whatever locality the data source has), so tasks which end up on different machines will not see that the cache is already in progress. Here was my test, by the way: import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent._ import scala.concurrent.duration._ val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache() val futures = (0 until 4).map { _ = Future { rdd.count } } Await.result(Future.sequence(futures), 120.second) Note that I run the future 4 times in parallel. I found that the first run has all tasks take 10 seconds. The second has about 50% of its tasks take 10 seconds, and the rest just wait for the first stage to finish. The last two runs have no tasks that take 10 seconds; all wait for the first two stages to finish. /pre What we want is the ability to fire off a job and have the DAG figure out that two RDDs depend on the same parent so that when the children are scheduled concurrently, the first one to start will activate the parent and both will wait on the parent. When the parent is done, they will both be able to finish their work concurrently. We are trying to use this pattern by having the parent cache results. Two RDDs which are scheduled concurrently should be able to wait on parent in all cases --- Key: SPARK-5140 URL: https://issues.apache.org/jira/browse/SPARK-5140 Project: Spark Issue Type: New Feature Reporter: Corey J. Nolet Labels: features Fix For: 1.3.0, 1.2.1 Not sure if this would change too much of the internals to be included in the 1.2.1 but it would be very helpful if it could be. This ticket is from a discussion between myself and [~ilikerps]. Here's the result of some testing that [~ilikerps] did: {pre} I did some testing as well, and it turns out the wait for other guy to finish caching logic is on a per-task basis, and it only works on tasks that happen to be executing on the same machine. Once a partition is cached, we will schedule tasks that touch that partition on that executor.
[jira] [Updated] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases
[ https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corey J. Nolet updated SPARK-5140: -- Description: Not sure if this would change too much of the internals to be included in the 1.2.1 but it would be very helpful if it could be. This ticket is from a discussion between myself and [~ilikerps]. Here's the result of some testing that [~ilikerps] did: bq. I did some testing as well, and it turns out the wait for other guy to finish caching logic is on a per-task basis, and it only works on tasks that happen to be executing on the same machine. bq. Once a partition is cached, we will schedule tasks that touch that partition on that executor. The problem here, though, is that the cache is in progress, and so the tasks are still scheduled randomly (or with whatever locality the data source has), so tasks which end up on different machines will not see that the cache is already in progress. {code} Here was my test, by the way: import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent._ import scala.concurrent.duration._ val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache() val futures = (0 until 4).map { _ = Future { rdd.count } } Await.result(Future.sequence(futures), 120.second) {code} bq. Note that I run the future 4 times in parallel. I found that the first run has all tasks take 10 seconds. The second has about 50% of its tasks take 10 seconds, and the rest just wait for the first stage to finish. The last two runs have no tasks that take 10 seconds; all wait for the first two stages to finish. {code} What we want is the ability to fire off a job and have the DAG figure out that two RDDs depend on the same parent so that when the children are scheduled concurrently, the first one to start will activate the parent and both will wait on the parent. When the parent is done, they will both be able to finish their work concurrently. We are trying to use this pattern by having the parent cache results. was: Not sure if this would change too much of the internals to be included in the 1.2.1 but it would be very helpful if it could be. This ticket is from a discussion between myself and [~ilikerps]. Here's the result of some testing that [~ilikerps] did: {code} I did some testing as well, and it turns out the wait for other guy to finish caching logic is on a per-task basis, and it only works on tasks that happen to be executing on the same machine. Once a partition is cached, we will schedule tasks that touch that partition on that executor. The problem here, though, is that the cache is in progress, and so the tasks are still scheduled randomly (or with whatever locality the data source has), so tasks which end up on different machines will not see that the cache is already in progress. Here was my test, by the way: import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent._ import scala.concurrent.duration._ val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache() val futures = (0 until 4).map { _ = Future { rdd.count } } Await.result(Future.sequence(futures), 120.second) Note that I run the future 4 times in parallel. I found that the first run has all tasks take 10 seconds. The second has about 50% of its tasks take 10 seconds, and the rest just wait for the first stage to finish. The last two runs have no tasks that take 10 seconds; all wait for the first two stages to finish. {code} What we want is the ability to fire off a job and have the DAG figure out that two RDDs depend on the same parent so that when the children are scheduled concurrently, the first one to start will activate the parent and both will wait on the parent. When the parent is done, they will both be able to finish their work concurrently. We are trying to use this pattern by having the parent cache results. Two RDDs which are scheduled concurrently should be able to wait on parent in all cases --- Key: SPARK-5140 URL: https://issues.apache.org/jira/browse/SPARK-5140 Project: Spark Issue Type: New Feature Reporter: Corey J. Nolet Labels: features Fix For: 1.3.0, 1.2.1 Not sure if this would change too much of the internals to be included in the 1.2.1 but it would be very helpful if it could be. This ticket is from a discussion between myself and [~ilikerps]. Here's the result of some testing that [~ilikerps] did: bq. I did some testing as well, and it turns out the wait for other guy to finish caching logic is on a per-task basis, and it only works on tasks that happen to be executing on the same machine. bq. Once a partition is cached, we will schedule tasks that touch that
[jira] [Updated] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases
[ https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corey J. Nolet updated SPARK-5140: -- Description: Not sure if this would change too much of the internals to be included in the 1.2.1 but it would be very helpful if it could be. This ticket is from a discussion between myself and [~ilikerps]. Here's the result of some testing that [~ilikerps] did: bq. I did some testing as well, and it turns out the wait for other guy to finish caching logic is on a per-task basis, and it only works on tasks that happen to be executing on the same machine. bq. Once a partition is cached, we will schedule tasks that touch that partition on that executor. The problem here, though, is that the cache is in progress, and so the tasks are still scheduled randomly (or with whatever locality the data source has), so tasks which end up on different machines will not see that the cache is already in progress. {code} Here was my test, by the way: import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent._ import scala.concurrent.duration._ val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache() val futures = (0 until 4).map { _ = Future { rdd.count } } Await.result(Future.sequence(futures), 120.second) {code} bq. Note that I run the future 4 times in parallel. I found that the first run has all tasks take 10 seconds. The second has about 50% of its tasks take 10 seconds, and the rest just wait for the first stage to finish. The last two runs have no tasks that take 10 seconds; all wait for the first two stages to finish. What we want is the ability to fire off a job and have the DAG figure out that two RDDs depend on the same parent so that when the children are scheduled concurrently, the first one to start will activate the parent and both will wait on the parent. When the parent is done, they will both be able to finish their work concurrently. We are trying to use this pattern by having the parent cache results. was: Not sure if this would change too much of the internals to be included in the 1.2.1 but it would be very helpful if it could be. This ticket is from a discussion between myself and [~ilikerps]. Here's the result of some testing that [~ilikerps] did: bq. I did some testing as well, and it turns out the wait for other guy to finish caching logic is on a per-task basis, and it only works on tasks that happen to be executing on the same machine. bq. Once a partition is cached, we will schedule tasks that touch that partition on that executor. The problem here, though, is that the cache is in progress, and so the tasks are still scheduled randomly (or with whatever locality the data source has), so tasks which end up on different machines will not see that the cache is already in progress. {code} Here was my test, by the way: import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent._ import scala.concurrent.duration._ val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache() val futures = (0 until 4).map { _ = Future { rdd.count } } Await.result(Future.sequence(futures), 120.second) {code} bq. Note that I run the future 4 times in parallel. I found that the first run has all tasks take 10 seconds. The second has about 50% of its tasks take 10 seconds, and the rest just wait for the first stage to finish. The last two runs have no tasks that take 10 seconds; all wait for the first two stages to finish. {code} What we want is the ability to fire off a job and have the DAG figure out that two RDDs depend on the same parent so that when the children are scheduled concurrently, the first one to start will activate the parent and both will wait on the parent. When the parent is done, they will both be able to finish their work concurrently. We are trying to use this pattern by having the parent cache results. Two RDDs which are scheduled concurrently should be able to wait on parent in all cases --- Key: SPARK-5140 URL: https://issues.apache.org/jira/browse/SPARK-5140 Project: Spark Issue Type: New Feature Reporter: Corey J. Nolet Labels: features Fix For: 1.3.0, 1.2.1 Not sure if this would change too much of the internals to be included in the 1.2.1 but it would be very helpful if it could be. This ticket is from a discussion between myself and [~ilikerps]. Here's the result of some testing that [~ilikerps] did: bq. I did some testing as well, and it turns out the wait for other guy to finish caching logic is on a per-task basis, and it only works on tasks that happen to be executing on the same machine. bq. Once a partition is cached, we will schedule tasks that
[jira] [Commented] (SPARK-4320) JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object
[ https://issues.apache.org/jira/browse/SPARK-4320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208084#comment-14208084 ] Corey J. Nolet commented on SPARK-4320: --- Since this is a simple change, I wanted to work on this myself to get more familiar with the code base. Could someone w/ the proper privileges give me access to be able to assign this ticket to myself? JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object Key: SPARK-4320 URL: https://issues.apache.org/jira/browse/SPARK-4320 Project: Spark Issue Type: Improvement Components: Input/Output, Spark Core Reporter: Corey J. Nolet Fix For: 1.1.1, 1.2.0 I am outputting data to Accumulo using a custom OutputFormat. I have tried using saveAsNewHadoopFile() and that works- though passing an empty path is a bit weird. Being that it isn't really a file I'm storing, but rather a generic Pair dataset, I'd be inclined to use the saveAsHadoopDataset() method, though I'm not at all interested in using the legacy mapred API. Perhaps we could supply a saveAsNewHadoopDateset method. Personally, I think there should be two ways of calling into this method. Instead of forcing the user to always set up the Job object explicitly, I'm in the camp of having the following method signature: saveAsNewHadoopDataset(keyClass : Class[K], valueClass : Class[V], ofclass : Class[? extends OutputFormat], conf : Configuration). This way, if I'm writing spark jobs that are going from Hadoop back into Hadoop, I can construct my Configuration once. Perhaps an overloaded method signature could be: saveAsNewHadoopDataset(job : Job) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4320) JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object
Corey J. Nolet created SPARK-4320: - Summary: JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object Key: SPARK-4320 URL: https://issues.apache.org/jira/browse/SPARK-4320 Project: Spark Issue Type: Improvement Components: Input/Output, Spark Core Reporter: Corey J. Nolet Fix For: 1.1.1, 1.2.0 I am outputting data to Accumulo using a custom outputformat. I have tried using saveAsNewHadoopFile() and that works- though passing an empty path is a bit weird. Being that it isn't really a file I'm store, but rather a dataset, I'd be inclined to use the saveAsHadoopDataset() method, though I'm not at all interested in using the legacy mapred API. Perhaps we could supply a saveAsNewHadoopDateset method. Personally, I think there should be two ways of calling into this method. Instead of needing to set up the Job object explicitly, I'm in the camp of having the following method signature: saveAsNewHadoopDataset(keyClass : Class[K], valueClass : Class[V], ofclass : Class[? extends OutputFormat], conf : Configuration). This way, if I'm writing spark jobs that are going from Hadoop back into Hadoop, I can construct my Configuration once. Perhaps an overloaded method signature could be: saveAsNewHadoopDataset(job : Job) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4320) JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object
[ https://issues.apache.org/jira/browse/SPARK-4320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corey J. Nolet updated SPARK-4320: -- Description: I am outputting data to Accumulo using a custom OutputFormat. I have tried using saveAsNewHadoopFile() and that works- though passing an empty path is a bit weird. Being that it isn't really a file I'm storing, but rather a generic Pair dataset, I'd be inclined to use the saveAsHadoopDataset() method, though I'm not at all interested in using the legacy mapred API. Perhaps we could supply a saveAsNewHadoopDateset method. Personally, I think there should be two ways of calling into this method. Instead of forcing the user to always set up the Job object explicitly, I'm in the camp of having the following method signature: saveAsNewHadoopDataset(keyClass : Class[K], valueClass : Class[V], ofclass : Class[? extends OutputFormat], conf : Configuration). This way, if I'm writing spark jobs that are going from Hadoop back into Hadoop, I can construct my Configuration once. Perhaps an overloaded method signature could be: saveAsNewHadoopDataset(job : Job) was: I am outputting data to Accumulo using a custom outputformat. I have tried using saveAsNewHadoopFile() and that works- though passing an empty path is a bit weird. Being that it isn't really a file I'm store, but rather a dataset, I'd be inclined to use the saveAsHadoopDataset() method, though I'm not at all interested in using the legacy mapred API. Perhaps we could supply a saveAsNewHadoopDateset method. Personally, I think there should be two ways of calling into this method. Instead of needing to set up the Job object explicitly, I'm in the camp of having the following method signature: saveAsNewHadoopDataset(keyClass : Class[K], valueClass : Class[V], ofclass : Class[? extends OutputFormat], conf : Configuration). This way, if I'm writing spark jobs that are going from Hadoop back into Hadoop, I can construct my Configuration once. Perhaps an overloaded method signature could be: saveAsNewHadoopDataset(job : Job) JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object Key: SPARK-4320 URL: https://issues.apache.org/jira/browse/SPARK-4320 Project: Spark Issue Type: Improvement Components: Input/Output, Spark Core Reporter: Corey J. Nolet Fix For: 1.1.1, 1.2.0 I am outputting data to Accumulo using a custom OutputFormat. I have tried using saveAsNewHadoopFile() and that works- though passing an empty path is a bit weird. Being that it isn't really a file I'm storing, but rather a generic Pair dataset, I'd be inclined to use the saveAsHadoopDataset() method, though I'm not at all interested in using the legacy mapred API. Perhaps we could supply a saveAsNewHadoopDateset method. Personally, I think there should be two ways of calling into this method. Instead of forcing the user to always set up the Job object explicitly, I'm in the camp of having the following method signature: saveAsNewHadoopDataset(keyClass : Class[K], valueClass : Class[V], ofclass : Class[? extends OutputFormat], conf : Configuration). This way, if I'm writing spark jobs that are going from Hadoop back into Hadoop, I can construct my Configuration once. Perhaps an overloaded method signature could be: saveAsNewHadoopDataset(job : Job) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4289) Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance.
[ https://issues.apache.org/jira/browse/SPARK-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203183#comment-14203183 ] Corey J. Nolet commented on SPARK-4289: --- I suppose we could look @ it as a Hadoop issue- though newing up a Job works fine without the Scala shell doing the toString(). I'd have to dive in deeper to find out why the states seem to be different between the constructor and the toString()- and even more importantly, why it cares... I think :silent will work for the short term. Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance. -- Key: SPARK-4289 URL: https://issues.apache.org/jira/browse/SPARK-4289 Project: Spark Issue Type: Bug Reporter: Corey J. Nolet This one is easy to reproduce. preval job = new Job(sc.hadoopConfiguration)/pre I'm not sure what the solution would be off hand as it's happening when the shell is calling toString() on the instance of Job. The problem is, because of the failure, the instance is never actually assigned to the job val. java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283) at org.apache.hadoop.mapreduce.Job.toString(Job.java:452) at scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324) at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329) at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337) at .init(console:10) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4289) Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance.
Corey J. Nolet created SPARK-4289: - Summary: Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance. Key: SPARK-4289 URL: https://issues.apache.org/jira/browse/SPARK-4289 Project: Spark Issue Type: Bug Reporter: Corey J. Nolet This one is easy to reproduce. preval job = new Job(sc.hadoopConfiguration)/pre I'm not sure what the solution would be off hand as it's happening when the shell is calling toString() on the instance of Job. The problem is, because of the failure, the instance is never actually assigned to the job val. java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283) at org.apache.hadoop.mapreduce.Job.toString(Job.java:452) at scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324) at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329) at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337) at .init(console:10) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org