I've played around with it. The CSV file looks like it gives 130 partitions. I'm assuming that's the standard 64MB split size for HDFS files. I have increased number of partitions and number of tasks for things like groupByKey and such. Usually I start blowing up on GC Overlimit or sometimes Heap OOM. I recently tried throwing coalesce with shuffle = true, into the mix thinking it would bring the keys into the same partition. E.g.,
(fileA ++ fileB.map{case (k,v) => (k, Array(v)}).coalesce(fileA.partitions.length + fileB.partitions.length, shuffle = true).groupBy... (Which should effectively be imitating map-reduce) but I see GC Overlimit when I do that. I've got a stock install with num cores and worker memory set as mentioned but even something like this fileA.sortByKey().map{_ => 1}.reduce{_ + _} blows up with GC Overlimit (as did .count instead of the by-hand count). fileA.count works. It seems to be able to load the file as an RDD but not manipulate it. On Fri, Mar 28, 2014 at 3:04 AM, Sonal Goyal [via Apache Spark User List] < ml-node+s1001560n3417...@n3.nabble.com> wrote: > Have you tried setting the partitioning ? > > Best Regards, > Sonal > Nube Technologies <http://www.nubetech.co> > > <http://in.linkedin.com/in/sonalgoyal> > > > > > On Thu, Mar 27, 2014 at 10:04 AM, lannyripple <[hidden > email]<http://user/SendEmail.jtp?type=node&node=3417&i=0> > > wrote: > >> Hi all, >> >> I've got something which I think should be straightforward but it's not so >> I'm not getting it. >> >> I have an 8 node spark 0.9.0 cluster also running HDFS. Workers have 16g >> of >> memory using 8 cores. >> >> In HDFS I have a CSV file of 110M lines of 9 columns (e.g., >> [key,a,b,c...]). >> I have another file of 25K lines containing some number of keys which >> might >> be in my CSV file. (Yes, I know I should use an RDBMS or shark or >> something. I'll get to that but this is toy problem that I'm using to get >> some intuition with spark.) >> >> Working on each file individually spark has no problem manipulating the >> files. If I try and join or union+filter though I can't seem to find the >> join of the two files. Code is along the lines of >> >> val fileA = >> sc.textFile("hdfs://.../fileA_110M.csv").map{_.split(",")}.keyBy{_(0)} >> val fileB = sc.textFile("hdfs://.../fileB_25k.csv").keyBy{x => x} >> >> And trying things like fileA.join(fileB) gives me heap OOM. Trying >> >> (fileA ++ fileB.map{case (k,v) => (k, >> Array(v))}).groupBy{_._1}.filter{case >> (k, (_, xs)) => xs.exists{_.length == 1} >> >> just causes spark to freeze. (In all the cases I'm trying I just use a >> final .count to force the results.) >> >> I suspect I'm missing something fundamental about bringing the keyed data >> together into the same partitions so it can be efficiently joined but I've >> given up for now. If anyone can shed some light (Beyond, "No really. Use >> shark.") on what I'm not understanding it would be most helpful. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Not-getting-it-tp3316.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/Not-getting-it-tp3316p3417.html > To unsubscribe from Not getting it, click > here<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3316&code=bGFubnkucmlwcGxlQGdtYWlsLmNvbXwzMzE2fDExMzI5OTY5Nzc=> > . > NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Not-getting-it-tp3316p3437.html Sent from the Apache Spark User List mailing list archive at Nabble.com.