I'm still a bit new to Spark and am struggilng to figure out the best way to
Dedupe my events.
I load my Avro files from HDFS and then I want to dedupe events that have
the same nonce.
For example my code so far:
JavaRDDAnalyticsEvent events = ((JavaRDDAvroKeylt;AnalyticsEvent)
A bit more context on this issue. From the container logs on the executor
Given my cluster specs above what would be appropriate parameters to pass
into :
--num-executors --num-cores --executor-memory
I had tried it with --executor-memory 2500MB
015-02-20 06:50:09,056 WARN
I'm a bit new to Spark, but had a question on performance. I suspect a lot of
my issue is due to tuning and parameters. I have a Hive external table on
this data and to run queries against it runs in minutes
The Job:
+ 40gb of avro events on HDFS (100 million+ avro events)
+ Read in the files
I'm fairly new to Spark.
We have data in avro files on hdfs.
We are trying to load up all the avro files (28 gigs worth right now) and do
an aggregation.
When we have less than 200 tasks the data all runs and produces the proper
results. If there are more than 200 tasks (as stated in the logs by