Re: GroupByKey results in OOM - Any other alternative

2014-06-15 Thread Surendranauth Hiraman
Vivek, If the foldByKey solution doesn't work for you, my team uses RDD.persist(DISK_ONLY) to avoid OOM errors. It's slower, of course, and requires tuning other config parameters. It can also be a problem if you do not have enough disk space, meaning that you have to unpersist at the right

Using custom class as a key for groupByKey() or reduceByKey()

2014-06-15 Thread Gaurav Jain
I have a simple Java class as follows, that I want to use as a key while applying groupByKey or reduceByKey functions: private static class FlowId { public String dcxId; public String trxId; public String msgType;

Re: Using custom class as a key for groupByKey() or reduceByKey()

2014-06-15 Thread Sean Owen
In Java at large, you must always implement hashCode() when you implement equals(). This is not specific to Spark. This is to maintain the contract that two equals() instances have the same hash code, and that's not the case for your class now. This causes weird things to happen wherever the hash

Re: long GC pause during file.cache()

2014-06-15 Thread Hao Wang
Hi, Wei You may try to set JVM opts in *spark-env.sh* as follow to prevent or mitigate GC pause: export SPARK_JAVA_OPTS=-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC -Xmx2g -XX:MaxPermSize=256m There are more options you could add, please just Google :) Regards, Wang Hao(王灏) CloudTeam |

Re: long GC pause during file.cache()

2014-06-15 Thread Nan Zhu
SPARK_JAVA_OPTS is deprecated in 1.0, though it works fine if you don’t mind the WARNING in the logs you can set spark.executor.extraJavaOpts in your SparkConf obj Best, -- Nan Zhu On Sunday, June 15, 2014 at 12:13 PM, Hao Wang wrote: Hi, Wei You may try to set JVM opts in

Re: MLLib : Decision Tree not getting built for 5 or more levels(maxDepth=5) and the one built for 3 levels is performing poorly

2014-06-15 Thread Manish Amde
Hi Suraj, I don't see any logs from mllib. You might need to explicit set the logging to DEBUG for mllib. Adding this line for log4j.properties might fix the problem. log4j.logger.org.apache.spark.mllib.tree=DEBUG Also, please let me know if you can encounter similar problems with the Spark

Akka listens to hostname while user may spark-submit with master in IP url

2014-06-15 Thread Hao Wang
Hi, All In Spark the spark.driver.host is driver hostname in default, thus, akka actor system will listen to a URL like akka.tcp://hostname:port. However, when a user tries to use spark-submit to run application, the user may set --master spark://192.168.1.12:7077. Then, the *AppClient* in

Re: long GC pause during file.cache()

2014-06-15 Thread Surendranauth Hiraman
Is SPARK_DAEMON_JAVA_OPTS valid in 1.0.0? On Sun, Jun 15, 2014 at 4:59 PM, Nan Zhu zhunanmcg...@gmail.com wrote: SPARK_JAVA_OPTS is deprecated in 1.0, though it works fine if you don’t mind the WARNING in the logs you can set spark.executor.extraJavaOpts in your SparkConf obj Best, --

Re: long GC pause during file.cache()

2014-06-15 Thread Nan Zhu
Yes, I think in the spark-env.sh.template, it is listed in the comments (didn’t check….) Best, -- Nan Zhu On Sunday, June 15, 2014 at 5:21 PM, Surendranauth Hiraman wrote: Is SPARK_DAEMON_JAVA_OPTS valid in 1.0.0? On Sun, Jun 15, 2014 at 4:59 PM, Nan Zhu

pyspark serializer can't handle functions?

2014-06-15 Thread madeleine
It seems that the default serializer used by pyspark can't serialize a list of functions. I've seen some posts about trying to fix this by using dill to serialize rather than pickle. Does anyone know what the status of that project is, or whether there's another easy workaround? I've pasted a

Re: GroupByKey results in OOM - Any other alternative

2014-06-15 Thread Ian O'Connell
Depending on your requirements when doing hourly metrics calculating distinct cardinality, a much more scalable method would be to use a hyper log log data structure. a scala impl people have used with spark would be

Re: long GC pause during file.cache()

2014-06-15 Thread Aaron Davidson
Note also that Java does not work well with very large JVMs due to this exact issue. There are two commonly used workarounds: 1) Spawn multiple (smaller) executors on the same machine. This can be done by creating multiple Workers (via SPARK_WORKER_INSTANCES in standalone mode[1]). 2) Use Tachyon

Re: GroupByKey results in OOM - Any other alternative

2014-06-15 Thread Krishna Sankar
Ian, Yep, HLL is an appropriate mechanism. The countApproxDistinctByKey is a wrapper around the com.clearspring.analytics.stream.cardinality.HyperLogLogPlus. Cheers k/ On Sun, Jun 15, 2014 at 4:50 PM, Ian O'Connell i...@ianoconnell.com wrote: Depending on your requirements when doing hourly