[jira] [Commented] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases

2015-02-27 Thread Corey J. Nolet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341005#comment-14341005
 ] 

Corey J. Nolet commented on SPARK-5140:
---

Wanted to mention, I did manage to get concurrency working by walking down the 
rdd DAG and firing off futures for concurrent work. It required some sleeping 
for threads to wait while parents may still have been doing their work.

I have it so that it's checking for multiple children at each node and, if 
multiple children exist, is caching the node automatically.


 Two RDDs which are scheduled concurrently should be able to wait on parent in 
 all cases
 ---

 Key: SPARK-5140
 URL: https://issues.apache.org/jira/browse/SPARK-5140
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Corey J. Nolet
  Labels: features

 Not sure if this would change too much of the internals to be included in the 
 1.2.1 but it would be very helpful if it could be.
 This ticket is from a discussion between myself and [~ilikerps]. Here's the 
 result of some testing that [~ilikerps] did:
 bq. I did some testing as well, and it turns out the wait for other guy to 
 finish caching logic is on a per-task basis, and it only works on tasks that 
 happen to be executing on the same machine. 
 bq. Once a partition is cached, we will schedule tasks that touch that 
 partition on that executor. The problem here, though, is that the cache is in 
 progress, and so the tasks are still scheduled randomly (or with whatever 
 locality the data source has), so tasks which end up on different machines 
 will not see that the cache is already in progress.
 {code}
 Here was my test, by the way:
 import scala.concurrent.ExecutionContext.Implicits.global
 import scala.concurrent._
 import scala.concurrent.duration._
 val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i 
 }).cache()
 val futures = (0 until 4).map { _ = Future { rdd.count } }
 Await.result(Future.sequence(futures), 120.second)
 {code}
 bq. Note that I run the future 4 times in parallel. I found that the first 
 run has all tasks take 10 seconds. The second has about 50% of its tasks take 
 10 seconds, and the rest just wait for the first stage to finish. The last 
 two runs have no tasks that take 10 seconds; all wait for the first two 
 stages to finish.
 What we want is the ability to fire off a job and have the DAG figure out 
 that two RDDs depend on the same parent so that when the children are 
 scheduled concurrently, the first one to start will activate the parent and 
 both will wait on the parent. When the parent is done, they will both be able 
 to finish their work concurrently. We are trying to use this pattern by 
 having the parent cache results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4320) JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object

2015-02-25 Thread Corey J. Nolet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337540#comment-14337540
 ] 

Corey J. Nolet commented on SPARK-4320:
---

Sorry- this ticket should have been closed a while ago. I'll go ahead and close 
it now. 

 JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object 
 

 Key: SPARK-4320
 URL: https://issues.apache.org/jira/browse/SPARK-4320
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output, Spark Core
Reporter: Corey J. Nolet

 I am outputting data to Accumulo using a custom OutputFormat. I have tried 
 using saveAsNewHadoopFile() and that works- though passing an empty path is a 
 bit weird. Being that it isn't really a file I'm storing, but rather a  
 generic Pair dataset, I'd be inclined to use the saveAsHadoopDataset() 
 method, though I'm not at all interested in using the legacy mapred API.
 Perhaps we could supply a saveAsNewHadoopDateset method. Personally, I think 
 there should be two ways of calling into this method. Instead of forcing the 
 user to always set up the Job object explicitly, I'm in the camp of having 
 the following method signature:
 saveAsNewHadoopDataset(keyClass : Class[K], valueClass : Class[V], ofclass : 
 Class[? extends OutputFormat], conf : Configuration). This way, if I'm 
 writing spark jobs that are going from Hadoop back into Hadoop, I can 
 construct my Configuration once.
 Perhaps an overloaded method signature could be:
 saveAsNewHadoopDataset(job : Job)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4320) JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object

2015-02-25 Thread Corey J. Nolet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corey J. Nolet closed SPARK-4320.
-
  Resolution: Won't Fix
Target Version/s: 1.2.1, 1.1.2  (was: 1.1.2, 1.2.1)

 JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object 
 

 Key: SPARK-4320
 URL: https://issues.apache.org/jira/browse/SPARK-4320
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output, Spark Core
Reporter: Corey J. Nolet

 I am outputting data to Accumulo using a custom OutputFormat. I have tried 
 using saveAsNewHadoopFile() and that works- though passing an empty path is a 
 bit weird. Being that it isn't really a file I'm storing, but rather a  
 generic Pair dataset, I'd be inclined to use the saveAsHadoopDataset() 
 method, though I'm not at all interested in using the legacy mapred API.
 Perhaps we could supply a saveAsNewHadoopDateset method. Personally, I think 
 there should be two ways of calling into this method. Instead of forcing the 
 user to always set up the Job object explicitly, I'm in the camp of having 
 the following method signature:
 saveAsNewHadoopDataset(keyClass : Class[K], valueClass : Class[V], ofclass : 
 Class[? extends OutputFormat], conf : Configuration). This way, if I'm 
 writing spark jobs that are going from Hadoop back into Hadoop, I can 
 construct my Configuration once.
 Perhaps an overloaded method signature could be:
 saveAsNewHadoopDataset(job : Job)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases

2015-02-24 Thread Corey J. Nolet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335518#comment-14335518
 ] 

Corey J. Nolet edited comment on SPARK-5140 at 2/24/15 9:50 PM:


I have a framework (similar to cascading) where users wire together their RDD 
transformations in reusable components and the framework wires them up together 
based on dependencies that they define. There is a notion of sinks- or data 
outputs and they are also encapsulated in reusable componentry. A sink depends 
on sources- The sinks are actually what get executed. I wanted to be able to 
execute them in parallel and just have spark be smart enough to figure out what 
needs to block (whenever there's a cached rdd) in order to make that possible. 
We're comparing the same data transformations to MapReduce that was written in 
JAQL and the single-threaded execution of the sinks is causing absolutely 
wretched run times.

I'm not saying an internal requirement from my framework should necessaily be 
levied against the spark features- however, this change (seeming like it would 
only affect the driver side scheduling) would allow the spark context to be 
truly thread-safe. I've tried running jobs that over-utilize the resources in 
parallel until one of them fully caches and it really just slows things down 
and uses too many resources- I can't see why anyone submitting rdds in 
different threads that depend on cached RDDs wouldn't want those threads to 
block for the parents but maybe I'm not thinking abstract enough. 

For now, I'm going to propagate down my data source tree breadth first and 
no-op those sources that return CachedRDDs so that their children can be 
scheduled in different threads. 


was (Author: sonixbp):
I have a framework (similar to cascading) where users wire together their RDD 
transformations in reusable components and the framework wires them up together 
based on dependencies that they define. There is a notion of sinks- or data 
outputs and they are also encapsulated in reusable componentry. A sink depends 
on sources- The sinks are actually what get executed. I wanted to be able to 
execute them in parallel and just have spark be smart enough to figure out what 
needs to block (whenever there's a cached rdd) in order to make that possible. 
We're comparing the same data transformations to MapReduce that was written in 
JAQL and the single-threaded execution of the sinks is causing absolutely 
wretched run times.

I'm not saying an internal requirement from my framework should necessaily be 
levied against the spark features- however, this would change (seeming like it 
would only affect the driver side scheduling) would allow the spark context to 
be truly thread-safe. I've tried running jobs that over-utilize the resources 
in parallel until one of them fully caches and it really just slows things down 
and uses too many resources- I can't see why anyone submitting rdds in 
different threads that depend on cached RDDs wouldn't want those threads to 
block for the parents but maybe I'm not thinking abstract enough. 

For now, I'm going to propagate down my data source tree breadth first and 
no-op those sources that return CachedRDDs so that their children can be 
scheduled in different threads. 

 Two RDDs which are scheduled concurrently should be able to wait on parent in 
 all cases
 ---

 Key: SPARK-5140
 URL: https://issues.apache.org/jira/browse/SPARK-5140
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Corey J. Nolet
  Labels: features

 Not sure if this would change too much of the internals to be included in the 
 1.2.1 but it would be very helpful if it could be.
 This ticket is from a discussion between myself and [~ilikerps]. Here's the 
 result of some testing that [~ilikerps] did:
 bq. I did some testing as well, and it turns out the wait for other guy to 
 finish caching logic is on a per-task basis, and it only works on tasks that 
 happen to be executing on the same machine. 
 bq. Once a partition is cached, we will schedule tasks that touch that 
 partition on that executor. The problem here, though, is that the cache is in 
 progress, and so the tasks are still scheduled randomly (or with whatever 
 locality the data source has), so tasks which end up on different machines 
 will not see that the cache is already in progress.
 {code}
 Here was my test, by the way:
 import scala.concurrent.ExecutionContext.Implicits.global
 import scala.concurrent._
 import scala.concurrent.duration._
 val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i 
 }).cache()
 val futures = (0 until 4).map { _ = Future { rdd.count } }
 Await.result(Future.sequence(futures), 

[jira] [Commented] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases

2015-02-24 Thread Corey J. Nolet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335518#comment-14335518
 ] 

Corey J. Nolet commented on SPARK-5140:
---

I have a framework (similar to cascading) where users wire together their RDD 
transformations in reusable components and the framework wires them up together 
based on dependencies that they define. There is a notion of sinks- or data 
outputs and they are also encapsulated in reusable componentry. A sink depends 
on sources- The sinks are actually what get executed. I wanted to be able to 
execute them in parallel and just have spark be smart enough to figure out what 
needs to block (whenever there's a cached rdd) in order to make that possible. 
We're comparing the same data transformations to MapReduce that was written in 
JAQL and the single-threaded execution of the sinks is causing absolutely 
wretched run times.

I'm not saying an internal requirement from my framework should necessaily be 
levied against the spark features- however, this would change (seeming like it 
would only affect the driver side scheduling) would allow the spark context to 
be truly thread-safe. I've tried running jobs that over-utilize the resources 
in parallel until one of them fully caches and it really just slows things down 
and uses too many resources- I can't see why anyone submitting rdds in 
different threads that depend on cached RDDs wouldn't want those threads to 
block for the parents but maybe I'm not thinking abstract enough. 

For now, I'm going to propagate down my data source tree breadth first and 
no-op those sources that return CachedRDDs so that their children can be 
scheduled in different threads. 

 Two RDDs which are scheduled concurrently should be able to wait on parent in 
 all cases
 ---

 Key: SPARK-5140
 URL: https://issues.apache.org/jira/browse/SPARK-5140
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Corey J. Nolet
  Labels: features

 Not sure if this would change too much of the internals to be included in the 
 1.2.1 but it would be very helpful if it could be.
 This ticket is from a discussion between myself and [~ilikerps]. Here's the 
 result of some testing that [~ilikerps] did:
 bq. I did some testing as well, and it turns out the wait for other guy to 
 finish caching logic is on a per-task basis, and it only works on tasks that 
 happen to be executing on the same machine. 
 bq. Once a partition is cached, we will schedule tasks that touch that 
 partition on that executor. The problem here, though, is that the cache is in 
 progress, and so the tasks are still scheduled randomly (or with whatever 
 locality the data source has), so tasks which end up on different machines 
 will not see that the cache is already in progress.
 {code}
 Here was my test, by the way:
 import scala.concurrent.ExecutionContext.Implicits.global
 import scala.concurrent._
 import scala.concurrent.duration._
 val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i 
 }).cache()
 val futures = (0 until 4).map { _ = Future { rdd.count } }
 Await.result(Future.sequence(futures), 120.second)
 {code}
 bq. Note that I run the future 4 times in parallel. I found that the first 
 run has all tasks take 10 seconds. The second has about 50% of its tasks take 
 10 seconds, and the rest just wait for the first stage to finish. The last 
 two runs have no tasks that take 10 seconds; all wait for the first two 
 stages to finish.
 What we want is the ability to fire off a job and have the DAG figure out 
 that two RDDs depend on the same parent so that when the children are 
 scheduled concurrently, the first one to start will activate the parent and 
 both will wait on the parent. When the parent is done, they will both be able 
 to finish their work concurrently. We are trying to use this pattern by 
 having the parent cache results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases

2015-02-03 Thread Corey J. Nolet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304510#comment-14304510
 ] 

Corey J. Nolet commented on SPARK-5140:
---

I think the problem is that when actions are performed on RDDs in multiple 
threads, the SparkContext on the driver that's scheduling the DAG should be 
able to see that the two RDDs depend on the same parents and synchronize them 
so that only one will run at a time, whether being cached or not (you'd assume 
the parent would be getting cached but I think this change would still affect 
cases where it hasn't been.). 

The fact that I did:

val rdd1 = input data - transform data - groupBy - etc... - cache
val rdd2 = future { rdd1.transform.groupBy.saveAsSequenceFile() }
val rdd3 = future { rdd1.transform.groupBy.saveAsSequenceFile() }

Has unexpected results when I find that rdd1 was assigned an id and run 
completely separately for rdd2 and rdd3. I would have expected, whether cached 
or not, that when run in separate threads, rdd1 would have been assigned an id, 
then rdd2 would have caused it to begin running through its stages, and rdd3 
would have paused because it's waiting on rdd1's id to complete its stages. 
What  I see is that, after rdd2 and rdd3 both run concurrently calculating 
rdd1, the storage for rdd1 = 200% cached. It causes issues when I have 50 or so 
rdds calling saveAsSequenceFile() that all have different shared dependencies 
on parent rdds (which may not always be known at creation time without 
introspecting them in my own tree). 

Now i've basically got to the scheduling myself- I've got to determine what 
depends on what and run things concurrently myself. It seems like the DAG 
should know this already and be able to make use of it.

 Two RDDs which are scheduled concurrently should be able to wait on parent in 
 all cases
 ---

 Key: SPARK-5140
 URL: https://issues.apache.org/jira/browse/SPARK-5140
 Project: Spark
  Issue Type: New Feature
Reporter: Corey J. Nolet
  Labels: features

 Not sure if this would change too much of the internals to be included in the 
 1.2.1 but it would be very helpful if it could be.
 This ticket is from a discussion between myself and [~ilikerps]. Here's the 
 result of some testing that [~ilikerps] did:
 bq. I did some testing as well, and it turns out the wait for other guy to 
 finish caching logic is on a per-task basis, and it only works on tasks that 
 happen to be executing on the same machine. 
 bq. Once a partition is cached, we will schedule tasks that touch that 
 partition on that executor. The problem here, though, is that the cache is in 
 progress, and so the tasks are still scheduled randomly (or with whatever 
 locality the data source has), so tasks which end up on different machines 
 will not see that the cache is already in progress.
 {code}
 Here was my test, by the way:
 import scala.concurrent.ExecutionContext.Implicits.global
 import scala.concurrent._
 import scala.concurrent.duration._
 val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i 
 }).cache()
 val futures = (0 until 4).map { _ = Future { rdd.count } }
 Await.result(Future.sequence(futures), 120.second)
 {code}
 bq. Note that I run the future 4 times in parallel. I found that the first 
 run has all tasks take 10 seconds. The second has about 50% of its tasks take 
 10 seconds, and the rest just wait for the first stage to finish. The last 
 two runs have no tasks that take 10 seconds; all wait for the first two 
 stages to finish.
 What we want is the ability to fire off a job and have the DAG figure out 
 that two RDDs depend on the same parent so that when the children are 
 scheduled concurrently, the first one to start will activate the parent and 
 both will wait on the parent. When the parent is done, they will both be able 
 to finish their work concurrently. We are trying to use this pattern by 
 having the parent cache results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class

2015-02-03 Thread Corey J. Nolet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304577#comment-14304577
 ] 

Corey J. Nolet commented on SPARK-5260:
---

I'm thinking all the schema-specific functions should be pulled out into an 
object called JsonSchemaFunctions. allKeysWithValueTypes() and createSchema() 
functions should be exposed via the public API and commented well based on 
their use. 

For the project I have that's using these functions, I am actually using the 
allKeysWithValueTypes() over my entire RDD as it's being saved to a sequence 
file and I'm using an Accumulator[Set[(String, DataType)]] that is aggregating 
all the schema elements for the RDD into a final Set where I can then store off 
the schema and later call CreateSchema() to get the final StructType that can 
be used with the sql table. I had to write a isConflicted(Set[(String, 
DataType)]]) function as well to determine if it's possible that a JSON object 
or JSON array was also encountered as a primitive type in one of the records in 
the RDD or vice versa.

 Expose JsonRDD.allKeysWithValueTypes() in a utility class 
 --

 Key: SPARK-5260
 URL: https://issues.apache.org/jira/browse/SPARK-5260
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Corey J. Nolet
Assignee: Corey J. Nolet

 I have found this method extremely useful when implementing my own strategy 
 for inferring a schema from parsed json. For now, I've actually copied the 
 method right out of the JsonRDD class into my own project but I think it 
 would be immensely useful to keep the code in Spark and expose it publicly 
 somewhere else- like an object called JsonSchema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases

2015-02-03 Thread Corey J. Nolet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14303548#comment-14303548
 ] 

Corey J. Nolet commented on SPARK-5140:
---

Is anyone against this behavior for any reason?

 Two RDDs which are scheduled concurrently should be able to wait on parent in 
 all cases
 ---

 Key: SPARK-5140
 URL: https://issues.apache.org/jira/browse/SPARK-5140
 Project: Spark
  Issue Type: New Feature
Reporter: Corey J. Nolet
  Labels: features
 Fix For: 1.3.0, 1.2.1


 Not sure if this would change too much of the internals to be included in the 
 1.2.1 but it would be very helpful if it could be.
 This ticket is from a discussion between myself and [~ilikerps]. Here's the 
 result of some testing that [~ilikerps] did:
 bq. I did some testing as well, and it turns out the wait for other guy to 
 finish caching logic is on a per-task basis, and it only works on tasks that 
 happen to be executing on the same machine. 
 bq. Once a partition is cached, we will schedule tasks that touch that 
 partition on that executor. The problem here, though, is that the cache is in 
 progress, and so the tasks are still scheduled randomly (or with whatever 
 locality the data source has), so tasks which end up on different machines 
 will not see that the cache is already in progress.
 {code}
 Here was my test, by the way:
 import scala.concurrent.ExecutionContext.Implicits.global
 import scala.concurrent._
 import scala.concurrent.duration._
 val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i 
 }).cache()
 val futures = (0 until 4).map { _ = Future { rdd.count } }
 Await.result(Future.sequence(futures), 120.second)
 {code}
 bq. Note that I run the future 4 times in parallel. I found that the first 
 run has all tasks take 10 seconds. The second has about 50% of its tasks take 
 10 seconds, and the rest just wait for the first stage to finish. The last 
 two runs have no tasks that take 10 seconds; all wait for the first two 
 stages to finish.
 What we want is the ability to fire off a job and have the DAG figure out 
 that two RDDs depend on the same parent so that when the children are 
 scheduled concurrently, the first one to start will activate the parent and 
 both will wait on the parent. When the parent is done, they will both be able 
 to finish their work concurrently. We are trying to use this pattern by 
 having the parent cache results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class

2015-01-22 Thread Corey J. Nolet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14288621#comment-14288621
 ] 

Corey J. Nolet commented on SPARK-5260:
---

May I be added to the proper list so that I can assign this ticket to myself?

 Expose JsonRDD.allKeysWithValueTypes() in a utility class 
 --

 Key: SPARK-5260
 URL: https://issues.apache.org/jira/browse/SPARK-5260
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Corey J. Nolet

 I have found this method extremely useful when implementing my own strategy 
 for inferring a schema from parsed json. For now, I've actually copied the 
 method right out of the JsonRDD class into my own project but I think it 
 would be immensely useful to keep the code in Spark and expose it publicly 
 somewhere else- like an object called JsonSchema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5305) Using a field in a WHERE clause that is not in the schema does not throw an exception.

2015-01-17 Thread Corey J. Nolet (JIRA)
Corey J. Nolet created SPARK-5305:
-

 Summary: Using a field in a WHERE clause that is not in the schema 
does not throw an exception.
 Key: SPARK-5305
 URL: https://issues.apache.org/jira/browse/SPARK-5305
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Corey J. Nolet


Given a schema:

key1 = String
key2 = Integer

The following sql statement doesn't seem to throw an exception:

SELECT * FROM myTable WHERE doesntExist = 'val1'




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5306) Support for a NotEqualsFilter in the filter PrunedFilteredScan pushdown

2015-01-17 Thread Corey J. Nolet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corey J. Nolet updated SPARK-5306:
--
Component/s: SQL

 Support for a NotEqualsFilter in the filter PrunedFilteredScan pushdown
 ---

 Key: SPARK-5306
 URL: https://issues.apache.org/jira/browse/SPARK-5306
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Corey J. Nolet

 This would be a pretty significant addition to the Filters that get pushed 
 down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5306) Support for a NotEqualsFilter in the filter PrunedFilteredScan pushdown

2015-01-17 Thread Corey J. Nolet (JIRA)
Corey J. Nolet created SPARK-5306:
-

 Summary: Support for a NotEqualsFilter in the filter 
PrunedFilteredScan pushdown
 Key: SPARK-5306
 URL: https://issues.apache.org/jira/browse/SPARK-5306
 Project: Spark
  Issue Type: Improvement
Reporter: Corey J. Nolet


This would be a pretty significant addition to the Filters that get pushed down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5296) Predicate Pushdown (BaseRelation) to have an interface that will accept OR filters

2015-01-17 Thread Corey J. Nolet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281621#comment-14281621
 ] 

Corey J. Nolet commented on SPARK-5296:
---

The more I'm thinking about this- It would be nice if there was a tree pushed 
down for the filters instead of an Array. This is a significant change to the 
API so it would still probably be easiest to create a new class 
(PrunedFilteredTreeScan?).

Probably easiest to have AndFilter and OrFilter parent nodes that can be 
arbitrarily nested with the leaf nodes being the filters that are already used 
(hopefully with the addition of the NotEqualsFilter from SPARK-5306).

 Predicate Pushdown (BaseRelation) to have an interface that will accept OR 
 filters
 --

 Key: SPARK-5296
 URL: https://issues.apache.org/jira/browse/SPARK-5296
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Corey J. Nolet

 Currently, the BaseRelation API allows a FilteredRelation to handle an 
 Array[Filter] which represents filter expressions that are applied as an AND 
 operator.
 We should support OR operations in a BaseRelation as well. I'm not sure what 
 this would look like in terms of API changes, but it almost seems like a 
 FilteredUnionedScan BaseRelation (the name stinks but you get the idea) would 
 be useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class

2015-01-16 Thread Corey J. Nolet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280802#comment-14280802
 ] 

Corey J. Nolet edited comment on SPARK-5260 at 1/16/15 8:48 PM:


bq.  you can make the change and create a pull request.

I've love to submit a pull request for this. Do you have a proposed name for 
the utility object?

bq. We do not add fix version(s) until it has been merged into our code base.

Noted. We're quite different in Accumulo- we require fix versions for each 
ticket.


was (Author: sonixbp):
bq.  you can make the change and create a pull request.

I've love to submit a pull request for this. Do you have a proposed name for 
the utility object?

bq. We do not add fix version(s) until it has been merged into our code base.

Noted, we're quite different in Accumulo- we require fix versions for each 
ticket.

 Expose JsonRDD.allKeysWithValueTypes() in a utility class 
 --

 Key: SPARK-5260
 URL: https://issues.apache.org/jira/browse/SPARK-5260
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Corey J. Nolet

 I have found this method extremely useful when implementing my own strategy 
 for inferring a schema from parsed json. For now, I've actually copied the 
 method right out of the JsonRDD class into my own project but I think it 
 would be immensely useful to keep the code in Spark and expose it publicly 
 somewhere else- like an object called JsonSchema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class

2015-01-16 Thread Corey J. Nolet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280802#comment-14280802
 ] 

Corey J. Nolet commented on SPARK-5260:
---

bq.  you can make the change and create a pull request.

I've love to submit a pull request for this. Do you have a proposed name for 
the utility object?

bq. We do not add fix version(s) until it has been merged into our code base.

Noted, we're quite different in Accumulo- we require fix versions for each 
ticket.

 Expose JsonRDD.allKeysWithValueTypes() in a utility class 
 --

 Key: SPARK-5260
 URL: https://issues.apache.org/jira/browse/SPARK-5260
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Corey J. Nolet

 I have found this method extremely useful when implementing my own strategy 
 for inferring a schema from parsed json. For now, I've actually copied the 
 method right out of the JsonRDD class into my own project but I think it 
 would be immensely useful to keep the code in Spark and expose it publicly 
 somewhere else- like an object called JsonSchema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class

2015-01-16 Thread Corey J. Nolet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280802#comment-14280802
 ] 

Corey J. Nolet edited comment on SPARK-5260 at 1/16/15 8:52 PM:


bq.  you can make the change and create a pull request.

I'd love to submit a pull request for this. Do you have a proposed name for the 
utility object?

bq. We do not add fix version(s) until it has been merged into our code base.

Noted. We're quite different in Accumulo- we require fix versions for each 
ticket.


was (Author: sonixbp):
bq.  you can make the change and create a pull request.

I've love to submit a pull request for this. Do you have a proposed name for 
the utility object?

bq. We do not add fix version(s) until it has been merged into our code base.

Noted. We're quite different in Accumulo- we require fix versions for each 
ticket.

 Expose JsonRDD.allKeysWithValueTypes() in a utility class 
 --

 Key: SPARK-5260
 URL: https://issues.apache.org/jira/browse/SPARK-5260
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Corey J. Nolet

 I have found this method extremely useful when implementing my own strategy 
 for inferring a schema from parsed json. For now, I've actually copied the 
 method right out of the JsonRDD class into my own project but I think it 
 would be immensely useful to keep the code in Spark and expose it publicly 
 somewhere else- like an object called JsonSchema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5296) Predicate Pushdown (BaseRelation) to have an interface that will accept OR filters

2015-01-16 Thread Corey J. Nolet (JIRA)
Corey J. Nolet created SPARK-5296:
-

 Summary: Predicate Pushdown (BaseRelation) to have an interface 
that will accept OR filters
 Key: SPARK-5296
 URL: https://issues.apache.org/jira/browse/SPARK-5296
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Corey J. Nolet


Currently, the BaseRelation API allows a FilteredRelation to handle an 
Array[Filter] which represents filter expressions that are applied as an AND 
operator.

We should support OR operations in a BaseRelation as well. I'm not sure what 
this would look like in terms of API changes, but it almost seems like a 
FilteredUnionedScan BaseRelation (the name stinks but you get the idea) would 
be useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class

2015-01-14 Thread Corey J. Nolet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corey J. Nolet updated SPARK-5260:
--
Description: I have found this method extremely useful when implementing my 
own strategy for inferring a schema from parsed json. For now, I've actually 
copied the method right out of the JsonRDD class into my own project but I 
think it would be immensely useful to keep the code in Spark and expose it 
publicly somewhere else- like an object called JsonSchema.  (was: I have found 
this method extremely useful when implementing my own method for inferring a 
schema from parsed json. For now, I've actually copied the method right out of 
the JsonRDD class into my own project but I think it would be immensely useful 
to keep the code in Spark and expose it publicly somewhere else- like an object 
called JsonSchema.)

 Expose JsonRDD.allKeysWithValueTypes() in a utility class 
 --

 Key: SPARK-5260
 URL: https://issues.apache.org/jira/browse/SPARK-5260
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Corey J. Nolet
 Fix For: 1.3.0


 I have found this method extremely useful when implementing my own strategy 
 for inferring a schema from parsed json. For now, I've actually copied the 
 method right out of the JsonRDD class into my own project but I think it 
 would be immensely useful to keep the code in Spark and expose it publicly 
 somewhere else- like an object called JsonSchema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases

2015-01-07 Thread Corey J. Nolet (JIRA)
Corey J. Nolet created SPARK-5140:
-

 Summary: Two RDDs which are scheduled concurrently should be able 
to wait on parent in all cases
 Key: SPARK-5140
 URL: https://issues.apache.org/jira/browse/SPARK-5140
 Project: Spark
  Issue Type: New Feature
Reporter: Corey J. Nolet
 Fix For: 1.3.0, 1.2.1


Not sure if this would change too much of the internals to be included in the 
1.2.1 but it would be very helpful if it could be.

This ticket is from a discussion between myself and [~ilikerps]. Here's the 
result of some testing that [~ilikerps] did:

pre
I did some testing as well, and it turns out the wait for other guy to finish 
caching logic is on a per-task basis, and it only works on tasks that happen 
to be executing on the same machine. 

Once a partition is cached, we will schedule tasks that touch that partition on 
that executor. The problem here, though, is that the cache is in progress, and 
so the tasks are still scheduled randomly (or with whatever locality the data 
source has), so tasks which end up on different machines will not see that the 
cache is already in progress.

Here was my test, by the way:
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent._
import scala.concurrent.duration._

val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache()
val futures = (0 until 4).map { _ = Future { rdd.count } }
Await.result(Future.sequence(futures), 120.second)

Note that I run the future 4 times in parallel. I found that the first run has 
all tasks take 10 seconds. The second has about 50% of its tasks take 10 
seconds, and the rest just wait for the first stage to finish. The last two 
runs have no tasks that take 10 seconds; all wait for the first two stages to 
finish.
/pre


What we want is the ability to fire off a job and have the DAG figure out that 
two RDDs depend on the same parent so that when the children are scheduled 
concurrently, the first one to start will activate the parent and both will 
wait on the parent. When the parent is done, they will both be able to finish 
their work concurrently. We are trying to use this pattern by having the parent 
cache results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases

2015-01-07 Thread Corey J. Nolet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corey J. Nolet updated SPARK-5140:
--
Description: 
Not sure if this would change too much of the internals to be included in the 
1.2.1 but it would be very helpful if it could be.

This ticket is from a discussion between myself and [~ilikerps]. Here's the 
result of some testing that [~ilikerps] did:

{code}

I did some testing as well, and it turns out the wait for other guy to finish 
caching logic is on a per-task basis, and it only works on tasks that happen 
to be executing on the same machine. 

Once a partition is cached, we will schedule tasks that touch that partition on 
that executor. The problem here, though, is that the cache is in progress, and 
so the tasks are still scheduled randomly (or with whatever locality the data 
source has), so tasks which end up on different machines will not see that the 
cache is already in progress.

Here was my test, by the way:
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent._
import scala.concurrent.duration._

val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache()
val futures = (0 until 4).map { _ = Future { rdd.count } }
Await.result(Future.sequence(futures), 120.second)

Note that I run the future 4 times in parallel. I found that the first run has 
all tasks take 10 seconds. The second has about 50% of its tasks take 10 
seconds, and the rest just wait for the first stage to finish. The last two 
runs have no tasks that take 10 seconds; all wait for the first two stages to 
finish.
{code}


What we want is the ability to fire off a job and have the DAG figure out that 
two RDDs depend on the same parent so that when the children are scheduled 
concurrently, the first one to start will activate the parent and both will 
wait on the parent. When the parent is done, they will both be able to finish 
their work concurrently. We are trying to use this pattern by having the parent 
cache results.

  was:
Not sure if this would change too much of the internals to be included in the 
1.2.1 but it would be very helpful if it could be.

This ticket is from a discussion between myself and [~ilikerps]. Here's the 
result of some testing that [~ilikerps] did:

{pre}

I did some testing as well, and it turns out the wait for other guy to finish 
caching logic is on a per-task basis, and it only works on tasks that happen 
to be executing on the same machine. 

Once a partition is cached, we will schedule tasks that touch that partition on 
that executor. The problem here, though, is that the cache is in progress, and 
so the tasks are still scheduled randomly (or with whatever locality the data 
source has), so tasks which end up on different machines will not see that the 
cache is already in progress.

Here was my test, by the way:
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent._
import scala.concurrent.duration._

val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache()
val futures = (0 until 4).map { _ = Future { rdd.count } }
Await.result(Future.sequence(futures), 120.second)

Note that I run the future 4 times in parallel. I found that the first run has 
all tasks take 10 seconds. The second has about 50% of its tasks take 10 
seconds, and the rest just wait for the first stage to finish. The last two 
runs have no tasks that take 10 seconds; all wait for the first two stages to 
finish.
{pre}


What we want is the ability to fire off a job and have the DAG figure out that 
two RDDs depend on the same parent so that when the children are scheduled 
concurrently, the first one to start will activate the parent and both will 
wait on the parent. When the parent is done, they will both be able to finish 
their work concurrently. We are trying to use this pattern by having the parent 
cache results.


 Two RDDs which are scheduled concurrently should be able to wait on parent in 
 all cases
 ---

 Key: SPARK-5140
 URL: https://issues.apache.org/jira/browse/SPARK-5140
 Project: Spark
  Issue Type: New Feature
Reporter: Corey J. Nolet
  Labels: features
 Fix For: 1.3.0, 1.2.1


 Not sure if this would change too much of the internals to be included in the 
 1.2.1 but it would be very helpful if it could be.
 This ticket is from a discussion between myself and [~ilikerps]. Here's the 
 result of some testing that [~ilikerps] did:
 {code}
 I did some testing as well, and it turns out the wait for other guy to 
 finish caching logic is on a per-task basis, and it only works on tasks that 
 happen to be executing on the same machine. 
 Once a partition is cached, we will schedule tasks that touch that partition 
 on that 

[jira] [Updated] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases

2015-01-07 Thread Corey J. Nolet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corey J. Nolet updated SPARK-5140:
--
Description: 
Not sure if this would change too much of the internals to be included in the 
1.2.1 but it would be very helpful if it could be.

This ticket is from a discussion between myself and [~ilikerps]. Here's the 
result of some testing that [~ilikerps] did:

{pre}

I did some testing as well, and it turns out the wait for other guy to finish 
caching logic is on a per-task basis, and it only works on tasks that happen 
to be executing on the same machine. 

Once a partition is cached, we will schedule tasks that touch that partition on 
that executor. The problem here, though, is that the cache is in progress, and 
so the tasks are still scheduled randomly (or with whatever locality the data 
source has), so tasks which end up on different machines will not see that the 
cache is already in progress.

Here was my test, by the way:
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent._
import scala.concurrent.duration._

val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache()
val futures = (0 until 4).map { _ = Future { rdd.count } }
Await.result(Future.sequence(futures), 120.second)

Note that I run the future 4 times in parallel. I found that the first run has 
all tasks take 10 seconds. The second has about 50% of its tasks take 10 
seconds, and the rest just wait for the first stage to finish. The last two 
runs have no tasks that take 10 seconds; all wait for the first two stages to 
finish.
{pre}


What we want is the ability to fire off a job and have the DAG figure out that 
two RDDs depend on the same parent so that when the children are scheduled 
concurrently, the first one to start will activate the parent and both will 
wait on the parent. When the parent is done, they will both be able to finish 
their work concurrently. We are trying to use this pattern by having the parent 
cache results.

  was:
Not sure if this would change too much of the internals to be included in the 
1.2.1 but it would be very helpful if it could be.

This ticket is from a discussion between myself and [~ilikerps]. Here's the 
result of some testing that [~ilikerps] did:

pre
I did some testing as well, and it turns out the wait for other guy to finish 
caching logic is on a per-task basis, and it only works on tasks that happen 
to be executing on the same machine. 

Once a partition is cached, we will schedule tasks that touch that partition on 
that executor. The problem here, though, is that the cache is in progress, and 
so the tasks are still scheduled randomly (or with whatever locality the data 
source has), so tasks which end up on different machines will not see that the 
cache is already in progress.

Here was my test, by the way:
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent._
import scala.concurrent.duration._

val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache()
val futures = (0 until 4).map { _ = Future { rdd.count } }
Await.result(Future.sequence(futures), 120.second)

Note that I run the future 4 times in parallel. I found that the first run has 
all tasks take 10 seconds. The second has about 50% of its tasks take 10 
seconds, and the rest just wait for the first stage to finish. The last two 
runs have no tasks that take 10 seconds; all wait for the first two stages to 
finish.
/pre


What we want is the ability to fire off a job and have the DAG figure out that 
two RDDs depend on the same parent so that when the children are scheduled 
concurrently, the first one to start will activate the parent and both will 
wait on the parent. When the parent is done, they will both be able to finish 
their work concurrently. We are trying to use this pattern by having the parent 
cache results.


 Two RDDs which are scheduled concurrently should be able to wait on parent in 
 all cases
 ---

 Key: SPARK-5140
 URL: https://issues.apache.org/jira/browse/SPARK-5140
 Project: Spark
  Issue Type: New Feature
Reporter: Corey J. Nolet
  Labels: features
 Fix For: 1.3.0, 1.2.1


 Not sure if this would change too much of the internals to be included in the 
 1.2.1 but it would be very helpful if it could be.
 This ticket is from a discussion between myself and [~ilikerps]. Here's the 
 result of some testing that [~ilikerps] did:
 {pre}
 I did some testing as well, and it turns out the wait for other guy to 
 finish caching logic is on a per-task basis, and it only works on tasks that 
 happen to be executing on the same machine. 
 Once a partition is cached, we will schedule tasks that touch that partition 
 on that executor. 

[jira] [Updated] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases

2015-01-07 Thread Corey J. Nolet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corey J. Nolet updated SPARK-5140:
--
Description: 
Not sure if this would change too much of the internals to be included in the 
1.2.1 but it would be very helpful if it could be.

This ticket is from a discussion between myself and [~ilikerps]. Here's the 
result of some testing that [~ilikerps] did:


bq. I did some testing as well, and it turns out the wait for other guy to 
finish caching logic is on a per-task basis, and it only works on tasks that 
happen to be executing on the same machine. 

bq. Once a partition is cached, we will schedule tasks that touch that 
partition on that executor. The problem here, though, is that the cache is in 
progress, and so the tasks are still scheduled randomly (or with whatever 
locality the data source has), so tasks which end up on different machines will 
not see that the cache is already in progress.

{code}
Here was my test, by the way:
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent._
import scala.concurrent.duration._

val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache()
val futures = (0 until 4).map { _ = Future { rdd.count } }
Await.result(Future.sequence(futures), 120.second)
{code}

bq. Note that I run the future 4 times in parallel. I found that the first run 
has all tasks take 10 seconds. The second has about 50% of its tasks take 10 
seconds, and the rest just wait for the first stage to finish. The last two 
runs have no tasks that take 10 seconds; all wait for the first two stages to 
finish.
{code}


What we want is the ability to fire off a job and have the DAG figure out that 
two RDDs depend on the same parent so that when the children are scheduled 
concurrently, the first one to start will activate the parent and both will 
wait on the parent. When the parent is done, they will both be able to finish 
their work concurrently. We are trying to use this pattern by having the parent 
cache results.

  was:
Not sure if this would change too much of the internals to be included in the 
1.2.1 but it would be very helpful if it could be.

This ticket is from a discussion between myself and [~ilikerps]. Here's the 
result of some testing that [~ilikerps] did:

{code}

I did some testing as well, and it turns out the wait for other guy to finish 
caching logic is on a per-task basis, and it only works on tasks that happen 
to be executing on the same machine. 

Once a partition is cached, we will schedule tasks that touch that partition on 
that executor. The problem here, though, is that the cache is in progress, and 
so the tasks are still scheduled randomly (or with whatever locality the data 
source has), so tasks which end up on different machines will not see that the 
cache is already in progress.

Here was my test, by the way:
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent._
import scala.concurrent.duration._

val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache()
val futures = (0 until 4).map { _ = Future { rdd.count } }
Await.result(Future.sequence(futures), 120.second)

Note that I run the future 4 times in parallel. I found that the first run has 
all tasks take 10 seconds. The second has about 50% of its tasks take 10 
seconds, and the rest just wait for the first stage to finish. The last two 
runs have no tasks that take 10 seconds; all wait for the first two stages to 
finish.
{code}


What we want is the ability to fire off a job and have the DAG figure out that 
two RDDs depend on the same parent so that when the children are scheduled 
concurrently, the first one to start will activate the parent and both will 
wait on the parent. When the parent is done, they will both be able to finish 
their work concurrently. We are trying to use this pattern by having the parent 
cache results.


 Two RDDs which are scheduled concurrently should be able to wait on parent in 
 all cases
 ---

 Key: SPARK-5140
 URL: https://issues.apache.org/jira/browse/SPARK-5140
 Project: Spark
  Issue Type: New Feature
Reporter: Corey J. Nolet
  Labels: features
 Fix For: 1.3.0, 1.2.1


 Not sure if this would change too much of the internals to be included in the 
 1.2.1 but it would be very helpful if it could be.
 This ticket is from a discussion between myself and [~ilikerps]. Here's the 
 result of some testing that [~ilikerps] did:
 bq. I did some testing as well, and it turns out the wait for other guy to 
 finish caching logic is on a per-task basis, and it only works on tasks that 
 happen to be executing on the same machine. 
 bq. Once a partition is cached, we will schedule tasks that touch that 
 

[jira] [Updated] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases

2015-01-07 Thread Corey J. Nolet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corey J. Nolet updated SPARK-5140:
--
Description: 
Not sure if this would change too much of the internals to be included in the 
1.2.1 but it would be very helpful if it could be.

This ticket is from a discussion between myself and [~ilikerps]. Here's the 
result of some testing that [~ilikerps] did:


bq. I did some testing as well, and it turns out the wait for other guy to 
finish caching logic is on a per-task basis, and it only works on tasks that 
happen to be executing on the same machine. 

bq. Once a partition is cached, we will schedule tasks that touch that 
partition on that executor. The problem here, though, is that the cache is in 
progress, and so the tasks are still scheduled randomly (or with whatever 
locality the data source has), so tasks which end up on different machines will 
not see that the cache is already in progress.

{code}
Here was my test, by the way:
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent._
import scala.concurrent.duration._

val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache()
val futures = (0 until 4).map { _ = Future { rdd.count } }
Await.result(Future.sequence(futures), 120.second)
{code}

bq. Note that I run the future 4 times in parallel. I found that the first run 
has all tasks take 10 seconds. The second has about 50% of its tasks take 10 
seconds, and the rest just wait for the first stage to finish. The last two 
runs have no tasks that take 10 seconds; all wait for the first two stages to 
finish.


What we want is the ability to fire off a job and have the DAG figure out that 
two RDDs depend on the same parent so that when the children are scheduled 
concurrently, the first one to start will activate the parent and both will 
wait on the parent. When the parent is done, they will both be able to finish 
their work concurrently. We are trying to use this pattern by having the parent 
cache results.

  was:
Not sure if this would change too much of the internals to be included in the 
1.2.1 but it would be very helpful if it could be.

This ticket is from a discussion between myself and [~ilikerps]. Here's the 
result of some testing that [~ilikerps] did:


bq. I did some testing as well, and it turns out the wait for other guy to 
finish caching logic is on a per-task basis, and it only works on tasks that 
happen to be executing on the same machine. 

bq. Once a partition is cached, we will schedule tasks that touch that 
partition on that executor. The problem here, though, is that the cache is in 
progress, and so the tasks are still scheduled randomly (or with whatever 
locality the data source has), so tasks which end up on different machines will 
not see that the cache is already in progress.

{code}
Here was my test, by the way:
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent._
import scala.concurrent.duration._

val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i }).cache()
val futures = (0 until 4).map { _ = Future { rdd.count } }
Await.result(Future.sequence(futures), 120.second)
{code}

bq. Note that I run the future 4 times in parallel. I found that the first run 
has all tasks take 10 seconds. The second has about 50% of its tasks take 10 
seconds, and the rest just wait for the first stage to finish. The last two 
runs have no tasks that take 10 seconds; all wait for the first two stages to 
finish.
{code}


What we want is the ability to fire off a job and have the DAG figure out that 
two RDDs depend on the same parent so that when the children are scheduled 
concurrently, the first one to start will activate the parent and both will 
wait on the parent. When the parent is done, they will both be able to finish 
their work concurrently. We are trying to use this pattern by having the parent 
cache results.


 Two RDDs which are scheduled concurrently should be able to wait on parent in 
 all cases
 ---

 Key: SPARK-5140
 URL: https://issues.apache.org/jira/browse/SPARK-5140
 Project: Spark
  Issue Type: New Feature
Reporter: Corey J. Nolet
  Labels: features
 Fix For: 1.3.0, 1.2.1


 Not sure if this would change too much of the internals to be included in the 
 1.2.1 but it would be very helpful if it could be.
 This ticket is from a discussion between myself and [~ilikerps]. Here's the 
 result of some testing that [~ilikerps] did:
 bq. I did some testing as well, and it turns out the wait for other guy to 
 finish caching logic is on a per-task basis, and it only works on tasks that 
 happen to be executing on the same machine. 
 bq. Once a partition is cached, we will schedule tasks that 

[jira] [Commented] (SPARK-4320) JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object

2014-11-12 Thread Corey J. Nolet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208084#comment-14208084
 ] 

Corey J. Nolet commented on SPARK-4320:
---

Since this is a simple change, I wanted to work on this myself to get more 
familiar with the code base. Could someone w/ the proper privileges give me 
access to be able to assign this ticket to myself?

 JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object 
 

 Key: SPARK-4320
 URL: https://issues.apache.org/jira/browse/SPARK-4320
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output, Spark Core
Reporter: Corey J. Nolet
 Fix For: 1.1.1, 1.2.0


 I am outputting data to Accumulo using a custom OutputFormat. I have tried 
 using saveAsNewHadoopFile() and that works- though passing an empty path is a 
 bit weird. Being that it isn't really a file I'm storing, but rather a  
 generic Pair dataset, I'd be inclined to use the saveAsHadoopDataset() 
 method, though I'm not at all interested in using the legacy mapred API.
 Perhaps we could supply a saveAsNewHadoopDateset method. Personally, I think 
 there should be two ways of calling into this method. Instead of forcing the 
 user to always set up the Job object explicitly, I'm in the camp of having 
 the following method signature:
 saveAsNewHadoopDataset(keyClass : Class[K], valueClass : Class[V], ofclass : 
 Class[? extends OutputFormat], conf : Configuration). This way, if I'm 
 writing spark jobs that are going from Hadoop back into Hadoop, I can 
 construct my Configuration once.
 Perhaps an overloaded method signature could be:
 saveAsNewHadoopDataset(job : Job)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4320) JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object

2014-11-10 Thread Corey J. Nolet (JIRA)
Corey J. Nolet created SPARK-4320:
-

 Summary: JavaPairRDD should supply a saveAsNewHadoopDataset which 
takes a Job object 
 Key: SPARK-4320
 URL: https://issues.apache.org/jira/browse/SPARK-4320
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output, Spark Core
Reporter: Corey J. Nolet
 Fix For: 1.1.1, 1.2.0


I am outputting data to Accumulo using a custom outputformat. I have tried 
using saveAsNewHadoopFile() and that works- though passing an empty path is a 
bit weird. Being that it isn't really a file I'm store, but rather a dataset, 
I'd be inclined to use the saveAsHadoopDataset() method, though I'm not at all 
interested in using the legacy mapred API.

Perhaps we could supply a saveAsNewHadoopDateset method. Personally, I think 
there should be two ways of calling into this method. Instead of needing to set 
up the Job object explicitly, I'm in the camp of having the following method 
signature:

saveAsNewHadoopDataset(keyClass : Class[K], valueClass : Class[V], ofclass : 
Class[? extends OutputFormat], conf : Configuration). This way, if I'm writing 
spark jobs that are going from Hadoop back into Hadoop, I can construct my 
Configuration once.

Perhaps an overloaded method signature could be:

saveAsNewHadoopDataset(job : Job)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4320) JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object

2014-11-10 Thread Corey J. Nolet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corey J. Nolet updated SPARK-4320:
--
Description: 
I am outputting data to Accumulo using a custom OutputFormat. I have tried 
using saveAsNewHadoopFile() and that works- though passing an empty path is a 
bit weird. Being that it isn't really a file I'm storing, but rather a  generic 
Pair dataset, I'd be inclined to use the saveAsHadoopDataset() method, though 
I'm not at all interested in using the legacy mapred API.

Perhaps we could supply a saveAsNewHadoopDateset method. Personally, I think 
there should be two ways of calling into this method. Instead of forcing the 
user to always set up the Job object explicitly, I'm in the camp of having the 
following method signature:

saveAsNewHadoopDataset(keyClass : Class[K], valueClass : Class[V], ofclass : 
Class[? extends OutputFormat], conf : Configuration). This way, if I'm writing 
spark jobs that are going from Hadoop back into Hadoop, I can construct my 
Configuration once.

Perhaps an overloaded method signature could be:

saveAsNewHadoopDataset(job : Job)


  was:
I am outputting data to Accumulo using a custom outputformat. I have tried 
using saveAsNewHadoopFile() and that works- though passing an empty path is a 
bit weird. Being that it isn't really a file I'm store, but rather a dataset, 
I'd be inclined to use the saveAsHadoopDataset() method, though I'm not at all 
interested in using the legacy mapred API.

Perhaps we could supply a saveAsNewHadoopDateset method. Personally, I think 
there should be two ways of calling into this method. Instead of needing to set 
up the Job object explicitly, I'm in the camp of having the following method 
signature:

saveAsNewHadoopDataset(keyClass : Class[K], valueClass : Class[V], ofclass : 
Class[? extends OutputFormat], conf : Configuration). This way, if I'm writing 
spark jobs that are going from Hadoop back into Hadoop, I can construct my 
Configuration once.

Perhaps an overloaded method signature could be:

saveAsNewHadoopDataset(job : Job)



 JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object 
 

 Key: SPARK-4320
 URL: https://issues.apache.org/jira/browse/SPARK-4320
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output, Spark Core
Reporter: Corey J. Nolet
 Fix For: 1.1.1, 1.2.0


 I am outputting data to Accumulo using a custom OutputFormat. I have tried 
 using saveAsNewHadoopFile() and that works- though passing an empty path is a 
 bit weird. Being that it isn't really a file I'm storing, but rather a  
 generic Pair dataset, I'd be inclined to use the saveAsHadoopDataset() 
 method, though I'm not at all interested in using the legacy mapred API.
 Perhaps we could supply a saveAsNewHadoopDateset method. Personally, I think 
 there should be two ways of calling into this method. Instead of forcing the 
 user to always set up the Job object explicitly, I'm in the camp of having 
 the following method signature:
 saveAsNewHadoopDataset(keyClass : Class[K], valueClass : Class[V], ofclass : 
 Class[? extends OutputFormat], conf : Configuration). This way, if I'm 
 writing spark jobs that are going from Hadoop back into Hadoop, I can 
 construct my Configuration once.
 Perhaps an overloaded method signature could be:
 saveAsNewHadoopDataset(job : Job)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4289) Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance.

2014-11-07 Thread Corey J. Nolet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203183#comment-14203183
 ] 

Corey J. Nolet commented on SPARK-4289:
---

I suppose we could look @ it as a Hadoop issue- though newing up a Job works 
fine without the Scala shell doing the toString(). I'd have to dive in deeper 
to find out why the states seem to be different between the constructor and the 
toString()- and even more importantly, why it cares...

I think :silent will work for the short term.

 Creating an instance of Hadoop Job fails in the Spark shell when toString() 
 is called on the instance.
 --

 Key: SPARK-4289
 URL: https://issues.apache.org/jira/browse/SPARK-4289
 Project: Spark
  Issue Type: Bug
Reporter: Corey J. Nolet

 This one is easy to reproduce.
 preval job = new Job(sc.hadoopConfiguration)/pre
 I'm not sure what the solution would be off hand as it's happening when the 
 shell is calling toString() on the instance of Job. The problem is, because 
 of the failure, the instance is never actually assigned to the job val.
 java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
   at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283)
   at org.apache.hadoop.mapreduce.Job.toString(Job.java:452)
   at 
 scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)
   at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)
   at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
   at .init(console:10)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4289) Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance.

2014-11-06 Thread Corey J. Nolet (JIRA)
Corey J. Nolet created SPARK-4289:
-

 Summary: Creating an instance of Hadoop Job fails in the Spark 
shell when toString() is called on the instance.
 Key: SPARK-4289
 URL: https://issues.apache.org/jira/browse/SPARK-4289
 Project: Spark
  Issue Type: Bug
Reporter: Corey J. Nolet


This one is easy to reproduce.

preval job = new Job(sc.hadoopConfiguration)/pre

I'm not sure what the solution would be off hand as it's happening when the 
shell is calling toString() on the instance of Job. The problem is, because of 
the failure, the instance is never actually assigned to the job val.

java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283)
at org.apache.hadoop.mapreduce.Job.toString(Job.java:452)
at 
scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)
at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)
at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
at .init(console:10)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org