Have you tried setting the partitioning ? Best Regards, Sonal Nube Technologies <http://www.nubetech.co>
<http://in.linkedin.com/in/sonalgoyal> On Thu, Mar 27, 2014 at 10:04 AM, lannyripple <lanny.rip...@gmail.com>wrote: > Hi all, > > I've got something which I think should be straightforward but it's not so > I'm not getting it. > > I have an 8 node spark 0.9.0 cluster also running HDFS. Workers have 16g > of > memory using 8 cores. > > In HDFS I have a CSV file of 110M lines of 9 columns (e.g., > [key,a,b,c...]). > I have another file of 25K lines containing some number of keys which might > be in my CSV file. (Yes, I know I should use an RDBMS or shark or > something. I'll get to that but this is toy problem that I'm using to get > some intuition with spark.) > > Working on each file individually spark has no problem manipulating the > files. If I try and join or union+filter though I can't seem to find the > join of the two files. Code is along the lines of > > val fileA = > sc.textFile("hdfs://.../fileA_110M.csv").map{_.split(",")}.keyBy{_(0)} > val fileB = sc.textFile("hdfs://.../fileB_25k.csv").keyBy{x => x} > > And trying things like fileA.join(fileB) gives me heap OOM. Trying > > (fileA ++ fileB.map{case (k,v) => (k, Array(v))}).groupBy{_._1}.filter{case > (k, (_, xs)) => xs.exists{_.length == 1} > > just causes spark to freeze. (In all the cases I'm trying I just use a > final .count to force the results.) > > I suspect I'm missing something fundamental about bringing the keyed data > together into the same partitions so it can be efficiently joined but I've > given up for now. If anyone can shed some light (Beyond, "No really. Use > shark.") on what I'm not understanding it would be most helpful. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Not-getting-it-tp3316.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >