Hi Jonathan,
Does that guarantee a result? I do not see that it is really optimized.
Hi Carsten,
How does the following code work:
data.filter(qualifying_function).take(n).count() = n
Also, as per my understanding, in both the approaches you mentioned the
qualifying function will be
I don't think countApprox is appropriate here unless approximation is OK.
But more generally, counting everything matching a filter requires applying
the filter to the whole data set, which seems like the thing to be avoided
here.
The take approach is better since it would stop after finding n
Yes, but in the take() approach we will be bringing the data to the driver
and is no longer distributed.
Also, the take() takes only count as argument which means that every time
we would transferring the redundant elements.
Regards,
Sandeep Giri,
+1 347 781 4573 (US)
+91-953-899-8962 (IN)
take only brings n elements to the driver, which is probably still a win if
n is small. I'm not sure what you mean by only taking a count argument --
what else would be an arg to take?
On Wed, Aug 5, 2015 at 4:49 PM, Sandeep Giri sand...@knowbigdata.com
wrote:
Yes, but in the take() approach we
Okay. I think I got it now. Yes take() does not need to be called more than
once. I got the impression that we wanted to bring elements to the driver
node and then run out qualifying_function on driver_node.
Now, I am back to my question which I started with: Could there be an
approach where the
Following up on this thread to see if anyone has some thoughts or opinions on
the mentioned approach.
Guru Medasani
gdm...@gmail.com
On Aug 3, 2015, at 10:20 PM, Guru Medasani gdm...@gmail.com wrote:
Hi,
I was looking at the spark-submit and spark-shell --help on both (Spark
1.3.1
qualifying_function() will be executed on each partition in parallel;
stopping all parallel execution after the first instance satisfying
qualifying_function() would mean that you would have to effectively make
the computation sequential.
On Wed, Aug 5, 2015 at 9:05 AM, Sandeep Giri
Hello !
You could try something like that :
def exists[T](rdd:RDD[T])(f:T=Boolean, n:Long):Boolean = {
val context: SparkContext = rdd.sparkContext
val grp: String = Random.alphanumeric.take(10).mkString
context.setJobGroup(grp, exist)
val count: Accumulator[Long] =
Hey All,
Was wondering if people would be willing to avoid merging build
changes until we have put the tests in better shape. The reason is
that build changes are the most likely to cause downstream issues with
the test matrix and it's very difficult to reverse engineer which
patches caused which
+1. I've been holding off on reviewing / merging patches like the
run-tests-jenkins Python refactoring for exactly this reason.
On 8/5/15 11:24 AM, Patrick Wendell wrote:
Hey All,
Was wondering if people would be willing to avoid merging build
changes until we have put the tests in better
10 matches
Mail list logo