Have you tried setting the partitioning ?

Best Regards,
Sonal
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>




On Thu, Mar 27, 2014 at 10:04 AM, lannyripple <lanny.rip...@gmail.com>wrote:

> Hi all,
>
> I've got something which I think should be straightforward but it's not so
> I'm not getting it.
>
> I have an 8 node spark 0.9.0 cluster also running HDFS.  Workers have 16g
> of
> memory using 8 cores.
>
> In HDFS I have a CSV file of 110M lines of 9 columns (e.g.,
> [key,a,b,c...]).
> I have another file of 25K lines containing some number of keys which might
> be in my CSV file.  (Yes, I know I should use an RDBMS or shark or
> something.  I'll get to that but this is toy problem that I'm using to get
> some intuition with spark.)
>
> Working on each file individually spark has no problem manipulating the
> files.  If I try and join or union+filter though I can't seem to find the
> join of the two files.  Code is along the lines of
>
> val fileA =
> sc.textFile("hdfs://.../fileA_110M.csv").map{_.split(",")}.keyBy{_(0)}
> val fileB = sc.textFile("hdfs://.../fileB_25k.csv").keyBy{x => x}
>
> And trying things like fileA.join(fileB) gives me heap OOM.  Trying
>
> (fileA ++ fileB.map{case (k,v) => (k, Array(v))}).groupBy{_._1}.filter{case
> (k, (_, xs)) => xs.exists{_.length == 1}
>
> just causes spark to freeze.  (In all the cases I'm trying I just use a
> final .count to force the results.)
>
> I suspect I'm missing something fundamental about bringing the keyed data
> together into the same partitions so it can be efficiently joined but I've
> given up for now.  If anyone can shed some light (Beyond, "No really.  Use
> shark.") on what I'm not understanding it would be most helpful.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Not-getting-it-tp3316.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to