I don't think countApprox is appropriate here unless approximation is OK. But more generally, counting everything matching a filter requires applying the filter to the whole data set, which seems like the thing to be avoided here.
The take approach is better since it would stop after finding n matching elements (it might do a little extra work given partitioning and buffering). It would not filter the whole data set. The only downside there is that it would copy n elements to the driver. On Wed, Aug 5, 2015 at 10:34 AM, Sandeep Giri <sand...@knowbigdata.com> wrote: > Hi Jonathan, > > Does that guarantee a result? I do not see that it is really optimized. > > Hi Carsten, > > > How does the following code work: > > data.filter(qualifying_function).take(n).count() >= n > > > Also, as per my understanding, in both the approaches you mentioned the > qualifying function will be executed on whole dataset even if the value was > already found in the first element of RDD: > > > - data.filter(qualifying_function).take(n).count() >= n > - val contains1MatchingElement = !(data.filter(qualifying_ > function).isEmpty()) > > Isn't it? Am I missing something? > > > Regards, > Sandeep Giri, > +1 347 781 4573 (US) > +91-953-899-8962 (IN) > > www.KnowBigData.com. <http://KnowBigData.com.> > Phone: +1-253-397-1945 (Office) > > [image: linkedin icon] <https://linkedin.com/company/knowbigdata> [image: > other site icon] <http://knowbigdata.com> [image: facebook icon] > <https://facebook.com/knowbigdata> [image: twitter icon] > <https://twitter.com/IKnowBigData> <https://twitter.com/IKnowBigData> > > > On Fri, Jul 31, 2015 at 3:37 PM, Jonathan Winandy < > jonathan.wina...@gmail.com> wrote: > >> Hello ! >> >> You could try something like that : >> >> def exists[T](rdd:RDD[T])(f:T=>Boolean, n:Int):Boolean = { >> rdd.filter(f).countApprox(timeout = 10000).getFinalValue().low > n >> } >> >> If would work for large datasets and large value of n. >> >> Have a nice day, >> >> Jonathan >> >> >> >> On 31 July 2015 at 11:29, Carsten Schnober < >> schno...@ukp.informatik.tu-darmstadt.de> wrote: >> >>> Hi, >>> the RDD class does not have an exist()-method (in the Scala API), but >>> the functionality you need seems easy to resemble with the existing >>> methods: >>> >>> val containsNMatchingElements = >>> data.filter(qualifying_function).take(n).count() >= n >>> >>> Note: I am not sure whether the intermediate take(n) really increases >>> performance, but the idea is to arbitrarily reduce the number of >>> elements in the RDD before counting because we are not interested in the >>> full count. >>> >>> If you need to check specifically whether there is at least one matching >>> occurrence, it is probably preferable to use isEmpty() instead of >>> count() and check whether the result is false: >>> >>> val contains1MatchingElement = >>> !(data.filter(qualifying_function).isEmpty()) >>> >>> Best, >>> Carsten >>> >>> >>> >>> Am 31.07.2015 um 11:11 schrieb Sandeep Giri: >>> > Dear Spark Dev Community, >>> > >>> > I am wondering if there is already a function to solve my problem. If >>> > not, then should I work on this? >>> > >>> > Say you just want to check if a word exists in a huge text file. I >>> could >>> > not find better ways than those mentioned here >>> > < >>> http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2#q6 >>> >. >>> > >>> > So, I was proposing if we have a function called /exists /in RDD with >>> > the following signature: >>> > >>> > #returns the true if n elements exist which qualify our criteria. >>> > #qualifying function would receive the element and its index and return >>> > true or false. >>> > def /exists/(qualifying_function, n): >>> > .... >>> > >>> > >>> > Regards, >>> > Sandeep Giri, >>> > +1 347 781 4573 (US) >>> > +91-953-899-8962 (IN) >>> > >>> > www.KnowBigData.com. <http://KnowBigData.com.> >>> > Phone: +1-253-397-1945 (Office) >>> > >>> > linkedin icon <https://linkedin.com/company/knowbigdata> other site >>> icon >>> > <http://knowbigdata.com> facebook icon >>> > <https://facebook.com/knowbigdata>twitter icon >>> > <https://twitter.com/IKnowBigData><https://twitter.com/IKnowBigData> >>> > >>> >>> -- >>> Carsten Schnober >>> Doctoral Researcher >>> Ubiquitous Knowledge Processing (UKP) Lab >>> FB 20 / Computer Science Department >>> Technische Universität Darmstadt >>> Hochschulstr. 10, D-64289 Darmstadt, Germany >>> phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111 >>> schno...@ukp.informatik.tu-darmstadt.de >>> www.ukp.tu-darmstadt.de >>> >>> Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de >>> GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources >>> (AIPHES): www.aiphes.tu-darmstadt.de >>> PhD program: Knowledge Discovery in Scientific Literature (KDSL) >>> www.kdsl.tu-darmstadt.de >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>> For additional commands, e-mail: dev-h...@spark.apache.org >>> >>> >> >