Hello,
Most of the tasks I've accomplished in Spark were fairly straightforward but
I can't figure the following problem using the Spark API:
Basically, I have an IP with a bunch of user ID associated to it. I want to
create a list of all user id that are associated together, even if some are
Ah yes, you're quite right with partitions I could probably process a good
chunk of the data but I didn't think a reduce would work? Sorry, I'm still
new to Spark and map reduce in general but I thought that the reduce result
wasn't an RDD and had to fit into memory. If the result of a reduce can
For what it's worth, I got it to work with a Cartesian product even if it's
very inefficient it worked out alright for me. The trick was to flat map it
(step4) after the cartesian product so that I could do a reduce by key and
find the commonalities. After I was done, I could check if any Value
Well, for what it's worth, I found the issue after spending the whole night
running experiments;).
Basically, I needed to give a higher number of partition for the groupByKey.
I was simply using the default, which generated only 4 partitions and so the
whole thing blew up.
--
View this
Hello,
Just to make sure I correctly read the doc and the forums. It's my
understanding that currently in python with Spark 1.0.1 there is no way to
save my RDD to disk that I can just reload. The hadoop RDD are not yet
present in Python.
Is that correct? I just want to make sure that's the case
Yeah but I would still have to do a map pass with an ast.litteral_eval() for
each line, correct?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Python-saving-reloading-RDD-tp10172p10179.html
Sent from the Apache Spark User List mailing list archive at
Hello,
I have an issue where my spark code is using too much memory in the final
step ( a count for testing purpose, it will write the result to a db when it
works ). I'm really not too sure how I can break down the last step to use
less RAM.
So, basically my data is log lines and each log line
Hello,
Obviously I'm new to spark and I assume I'm missing something really obvious
but all my map operations are run on only one processor even if they have
many partitions. I've tried to google for the issue but everything seems
good, I use local[8] and my file has more than one partition (
Well, for what it's worth I found the answer on the Mesos spark
documentation:
https://github.com/mesos/spark/wiki/Spark-Programming-GuideThe quick start
guide, say to use --master local[4] with spark-submit and that implies
that it would indicate to use more than on processor. However that