Re: Performance problem on collect
It did the job. Thanks. :) Le 19 août 2014 à 10:20, Sean Owen a écrit : > In that case, why not collectAsMap() and have the whole result as a > simple Map in memory? then lookups are trivial. RDDs aren't > distributed maps. > > On Tue, Aug 19, 2014 at 9:17 AM, Emmanuel Castanier > wrote: >> Thanks for your answer. >> In my case, that’s sad cause we have only 60 entries in the final RDD, I was >> thinking it will be fast to get the needed one. >> >> >> Le 19 août 2014 à 09:58, Sean Owen a écrit : >> >>> You can use the function lookup() to accomplish this too; it may be a >>> bit faster. >>> >>> It will never be efficient like a database lookup since this is >>> implemented by scanning through all of the data. There is no index or >>> anything. >>> >>> On Tue, Aug 19, 2014 at 8:43 AM, Emmanuel Castanier >>> wrote: Hi all, I’m totally newbie on Spark, so my question may be a dumb one. I tried Spark to compute values, on this side all works perfectly (and it's fast :) ). At the end of the process, I have an RDD with Key(String)/Values(Array of String), on this I want to get only one entry like this : myRdd.filter(t => t._1.equals(param)) If I make a collect to get the only « tuple » , It takes about 12 seconds to execute, I imagine that’s because Spark may be used differently... Best regards, Emmanuel - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org >> - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Performance problem on collect
In that case, why not collectAsMap() and have the whole result as a simple Map in memory? then lookups are trivial. RDDs aren't distributed maps. On Tue, Aug 19, 2014 at 9:17 AM, Emmanuel Castanier wrote: > Thanks for your answer. > In my case, that’s sad cause we have only 60 entries in the final RDD, I was > thinking it will be fast to get the needed one. > > > Le 19 août 2014 à 09:58, Sean Owen a écrit : > >> You can use the function lookup() to accomplish this too; it may be a >> bit faster. >> >> It will never be efficient like a database lookup since this is >> implemented by scanning through all of the data. There is no index or >> anything. >> >> On Tue, Aug 19, 2014 at 8:43 AM, Emmanuel Castanier >> wrote: >>> Hi all, >>> >>> I’m totally newbie on Spark, so my question may be a dumb one. >>> I tried Spark to compute values, on this side all works perfectly (and it's >>> fast :) ). >>> >>> At the end of the process, I have an RDD with Key(String)/Values(Array >>> of String), on this I want to get only one entry like this : >>> >>> myRdd.filter(t => t._1.equals(param)) >>> >>> If I make a collect to get the only « tuple » , It takes about 12 seconds >>> to execute, I imagine that’s because Spark may be used differently... >>> >>> Best regards, >>> >>> Emmanuel >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Performance problem on collect
Thanks for your answer. In my case, that’s sad cause we have only 60 entries in the final RDD, I was thinking it will be fast to get the needed one. Le 19 août 2014 à 09:58, Sean Owen a écrit : > You can use the function lookup() to accomplish this too; it may be a > bit faster. > > It will never be efficient like a database lookup since this is > implemented by scanning through all of the data. There is no index or > anything. > > On Tue, Aug 19, 2014 at 8:43 AM, Emmanuel Castanier > wrote: >> Hi all, >> >> I’m totally newbie on Spark, so my question may be a dumb one. >> I tried Spark to compute values, on this side all works perfectly (and it's >> fast :) ). >> >> At the end of the process, I have an RDD with Key(String)/Values(Array >> of String), on this I want to get only one entry like this : >> >> myRdd.filter(t => t._1.equals(param)) >> >> If I make a collect to get the only « tuple » , It takes about 12 seconds to >> execute, I imagine that’s because Spark may be used differently... >> >> Best regards, >> >> Emmanuel >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Performance problem on collect
You can use the function lookup() to accomplish this too; it may be a bit faster. It will never be efficient like a database lookup since this is implemented by scanning through all of the data. There is no index or anything. On Tue, Aug 19, 2014 at 8:43 AM, Emmanuel Castanier wrote: > Hi all, > > I’m totally newbie on Spark, so my question may be a dumb one. > I tried Spark to compute values, on this side all works perfectly (and it's > fast :) ). > > At the end of the process, I have an RDD with Key(String)/Values(Array > of String), on this I want to get only one entry like this : > > myRdd.filter(t => t._1.equals(param)) > > If I make a collect to get the only « tuple » , It takes about 12 seconds to > execute, I imagine that’s because Spark may be used differently... > > Best regards, > > Emmanuel > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Performance problem on collect
Hi all, I’m totally newbie on Spark, so my question may be a dumb one. I tried Spark to compute values, on this side all works perfectly (and it's fast :) ). At the end of the process, I have an RDD with Key(String)/Values(Array of String), on this I want to get only one entry like this : myRdd.filter(t => t._1.equals(param)) If I make a collect to get the only « tuple » , It takes about 12 seconds to execute, I imagine that’s because Spark may be used differently... Best regards, Emmanuel - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org