Thanks! On Thu, Jul 16, 2015 at 1:59 PM Vetle Leinonen-Roeim <ve...@roeim.net> wrote:
> By the way - if you're going this route, see > https://github.com/datastax/spark-cassandra-connector > > On Thu, Jul 16, 2015 at 2:40 PM Vetle Leinonen-Roeim <ve...@roeim.net> > wrote: > >> You'll probably have to install it separately. >> >> On Thu, Jul 16, 2015 at 2:29 PM Jem Tucker <jem.tuc...@gmail.com> wrote: >> >>> Hi Vetle, >>> >>> IndexedRDD is persisted in the same way RDDs are as far as I am aware. >>> Are you aware if Cassandra can be built into my application or has to be a >>> stand alone database which is installed separately? >>> >>> Thanks, >>> >>> Jem >>> >>> On Thu, Jul 16, 2015 at 12:59 PM Vetle Leinonen-Roeim <ve...@roeim.net> >>> wrote: >>> >>>> Hi, >>>> >>>> Not sure how IndexedRDD is persisted, but perhaps you're better off >>>> using a NOSQL database for lookups (perhaps using Cassandra, with the >>>> Cassandra connector)? That should give you good performance on lookups, but >>>> persisting those billion records sounds like something that will take some >>>> time in any case. >>>> >>>> Regards, >>>> Vetle >>>> >>>> >>>> On Thu, Jul 16, 2015 at 10:02 AM Jem Tucker <jem.tuc...@gmail.com> >>>> wrote: >>>> >>>>> Hello, >>>>> >>>>> I have been using IndexedRDD as a large lookup (1 billion records) to >>>>> join with small tables (1 million rows). The performance of indexedrdd is >>>>> great until it has to be persisted on disk. Are there any alternatives to >>>>> IndexedRDD or any changes to how I use it to improve performance with big >>>>> data volumes? >>>>> >>>>> Kindest Regards, >>>>> >>>>> Jem >>>>> >>>>