Re: optimize multiple filter operations

2014-11-29 Thread Imran Rashid
Rishi's approach will work, but its worth mentioning that because all of
the data goes into only two groups, you will only process the resulting
data with two tasks and so you're losing almost all parallelism.
Presumably you're processing a lot of data, since you only want to do one
pass, so I doubt that would actually be helpful.

Unfortunately I don't think there is a better approach than doing two
passes currently.  Given some more info about the downstream processes,
there may be alternatives, but in general I think you are stuck.

Eg., here's a slight variation on Rishi's proposal, that may or may not
work:

initial.groupBy{x = (if (x == something) key1 else key2),
util.Random.nextInt(500))}

which splits the data by a compound key -- first just a label of whether or
not it matches, and then subdivides into another 500 groups.  This will
result in nicely balanced tasks within each group, but also results in a
shuffle of all the data, which can be pretty expensive.  You might be
better off just doing two passes over the raw data.

Imran

On Fri, Nov 28, 2014 at 7:08 PM, Rishi Yadav ri...@infoobjects.com wrote:

 you can try (scala version = you convert to python)

 val set = initial.groupBy( x = if (x == something) key1 else key2)

 This would do one pass over original data.

 On Fri, Nov 28, 2014 at 8:21 AM, mrm ma...@skimlinks.com wrote:

 Hi,

 My question is:

 I have multiple filter operations where I split my initial rdd into two
 different groups. The two groups cover the whole initial set. In code,
 it's
 something like:

 set1 = initial.filter(lambda x: x == something)
 set2 = initial.filter(lambda x: x != something)

 By doing this, I am doing two passes over the data. Is there any way to
 optimise this to do it in a single pass?

 Note: I was trying to look in the mailing list to see if this question has
 been asked already, but could not find it.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/optimize-multiple-filter-operations-tp20010.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





optimize multiple filter operations

2014-11-28 Thread mrm
Hi, 

My question is:

I have multiple filter operations where I split my initial rdd into two
different groups. The two groups cover the whole initial set. In code, it's
something like:

set1 = initial.filter(lambda x: x == something)
set2 = initial.filter(lambda x: x != something)

By doing this, I am doing two passes over the data. Is there any way to
optimise this to do it in a single pass?

Note: I was trying to look in the mailing list to see if this question has
been asked already, but could not find it.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/optimize-multiple-filter-operations-tp20010.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: optimize multiple filter operations

2014-11-28 Thread Rishi Yadav
you can try (scala version = you convert to python)

val set = initial.groupBy( x = if (x == something) key1 else key2)

This would do one pass over original data.

On Fri, Nov 28, 2014 at 8:21 AM, mrm ma...@skimlinks.com wrote:

 Hi,

 My question is:

 I have multiple filter operations where I split my initial rdd into two
 different groups. The two groups cover the whole initial set. In code, it's
 something like:

 set1 = initial.filter(lambda x: x == something)
 set2 = initial.filter(lambda x: x != something)

 By doing this, I am doing two passes over the data. Is there any way to
 optimise this to do it in a single pass?

 Note: I was trying to look in the mailing list to see if this question has
 been asked already, but could not find it.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/optimize-multiple-filter-operations-tp20010.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org