Help in merging a RDD agaisnt itself using the V of a (K,V).

2014-07-23 Thread Roch Denis
Hello, Most of the tasks I've accomplished in Spark were fairly straightforward but I can't figure the following problem using the Spark API: Basically, I have an IP with a bunch of user ID associated to it. I want to create a list of all user id that are associated together, even if some are

Re: Help in merging a RDD agaisnt itself using the V of a (K,V).

2014-07-23 Thread Roch Denis
Ah yes, you're quite right with partitions I could probably process a good chunk of the data but I didn't think a reduce would work? Sorry, I'm still new to Spark and map reduce in general but I thought that the reduce result wasn't an RDD and had to fit into memory. If the result of a reduce can

Re: Help in merging a RDD agaisnt itself using the V of a (K,V).

2014-07-23 Thread Roch Denis
For what it's worth, I got it to work with a Cartesian product even if it's very inefficient it worked out alright for me. The trick was to flat map it (step4) after the cartesian product so that I could do a reduce by key and find the commonalities. After I was done, I could check if any Value

Re: Last step of processing is using too much memory.

2014-07-18 Thread Roch Denis
Well, for what it's worth, I found the issue after spending the whole night running experiments;). Basically, I needed to give a higher number of partition for the groupByKey. I was simply using the default, which generated only 4 partitions and so the whole thing blew up. -- View this

Python: saving/reloading RDD

2014-07-18 Thread Roch Denis
Hello, Just to make sure I correctly read the doc and the forums. It's my understanding that currently in python with Spark 1.0.1 there is no way to save my RDD to disk that I can just reload. The hadoop RDD are not yet present in Python. Is that correct? I just want to make sure that's the case

Re: Python: saving/reloading RDD

2014-07-18 Thread Roch Denis
Yeah but I would still have to do a map pass with an ast.litteral_eval() for each line, correct? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-saving-reloading-RDD-tp10172p10179.html Sent from the Apache Spark User List mailing list archive at

Last step of processing is using too much memory.

2014-07-17 Thread Roch Denis
Hello, I have an issue where my spark code is using too much memory in the final step ( a count for testing purpose, it will write the result to a db when it works ). I'm really not too sure how I can break down the last step to use less RAM. So, basically my data is log lines and each log line

No parallelism in map transformation

2014-07-15 Thread Roch Denis
Hello, Obviously I'm new to spark and I assume I'm missing something really obvious but all my map operations are run on only one processor even if they have many partitions. I've tried to google for the issue but everything seems good, I use local[8] and my file has more than one partition (

Re: No parallelism in map transformation

2014-07-15 Thread Roch Denis
Well, for what it's worth I found the answer on the Mesos spark documentation: https://github.com/mesos/spark/wiki/Spark-Programming-GuideThe quick start guide, say to use --master local[4] with spark-submit and that implies that it would indicate to use more than on processor. However that