Vivek,
If the foldByKey solution doesn't work for you, my team uses
RDD.persist(DISK_ONLY) to avoid OOM errors.
It's slower, of course, and requires tuning other config parameters. It can
also be a problem if you do not have enough disk space, meaning that you
have to unpersist at the right
I have a simple Java class as follows, that I want to use as a key while
applying groupByKey or reduceByKey functions:
private static class FlowId {
public String dcxId;
public String trxId;
public String msgType;
In Java at large, you must always implement hashCode() when you implement
equals(). This is not specific to Spark. This is to maintain the contract
that two equals() instances have the same hash code, and that's not the
case for your class now. This causes weird things to happen wherever the
hash
Hi, Wei
You may try to set JVM opts in *spark-env.sh* as follow to prevent or
mitigate GC pause:
export SPARK_JAVA_OPTS=-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC
-Xmx2g -XX:MaxPermSize=256m
There are more options you could add, please just Google :)
Regards,
Wang Hao(王灏)
CloudTeam |
SPARK_JAVA_OPTS is deprecated in 1.0, though it works fine if you don’t mind
the WARNING in the logs
you can set spark.executor.extraJavaOpts in your SparkConf obj
Best,
--
Nan Zhu
On Sunday, June 15, 2014 at 12:13 PM, Hao Wang wrote:
Hi, Wei
You may try to set JVM opts in
Hi Suraj,
I don't see any logs from mllib. You might need to explicit set the logging
to DEBUG for mllib. Adding this line for log4j.properties might fix the
problem.
log4j.logger.org.apache.spark.mllib.tree=DEBUG
Also, please let me know if you can encounter similar problems with the
Spark
Hi, All
In Spark the spark.driver.host is driver hostname in default, thus, akka
actor system will listen to a URL like akka.tcp://hostname:port. However,
when a user tries to use spark-submit to run application, the user may set
--master spark://192.168.1.12:7077.
Then, the *AppClient* in
Is SPARK_DAEMON_JAVA_OPTS valid in 1.0.0?
On Sun, Jun 15, 2014 at 4:59 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
SPARK_JAVA_OPTS is deprecated in 1.0, though it works fine if you
don’t mind the WARNING in the logs
you can set spark.executor.extraJavaOpts in your SparkConf obj
Best,
--
Yes, I think in the spark-env.sh.template, it is listed in the comments (didn’t
check….)
Best,
--
Nan Zhu
On Sunday, June 15, 2014 at 5:21 PM, Surendranauth Hiraman wrote:
Is SPARK_DAEMON_JAVA_OPTS valid in 1.0.0?
On Sun, Jun 15, 2014 at 4:59 PM, Nan Zhu
It seems that the default serializer used by pyspark can't serialize a list
of functions.
I've seen some posts about trying to fix this by using dill to serialize
rather than pickle.
Does anyone know what the status of that project is, or whether there's
another easy workaround?
I've pasted a
Depending on your requirements when doing hourly metrics calculating
distinct cardinality, a much more scalable method would be to use a hyper
log log data structure.
a scala impl people have used with spark would be
Note also that Java does not work well with very large JVMs due to this
exact issue. There are two commonly used workarounds:
1) Spawn multiple (smaller) executors on the same machine. This can be done
by creating multiple Workers (via SPARK_WORKER_INSTANCES in standalone
mode[1]).
2) Use Tachyon
Ian,
Yep, HLL is an appropriate mechanism. The countApproxDistinctByKey is a
wrapper around the
com.clearspring.analytics.stream.cardinality.HyperLogLogPlus.
Cheers
k/
On Sun, Jun 15, 2014 at 4:50 PM, Ian O'Connell i...@ianoconnell.com wrote:
Depending on your requirements when doing hourly
13 matches
Mail list logo