Deduping events using Spark

2015-06-04 Thread lbierman
I'm still a bit new to Spark and am struggilng to figure out the best way to Dedupe my events. I load my Avro files from HDFS and then I want to dedupe events that have the same nonce. For example my code so far: JavaRDDAnalyticsEvent events = ((JavaRDDAvroKeylt;AnalyticsEvent)

Re: Spark Performance on Yarn

2015-02-20 Thread lbierman
A bit more context on this issue. From the container logs on the executor Given my cluster specs above what would be appropriate parameters to pass into : --num-executors --num-cores --executor-memory I had tried it with --executor-memory 2500MB 015-02-20 06:50:09,056 WARN

Spark Performance on Yarn

2015-02-19 Thread lbierman
I'm a bit new to Spark, but had a question on performance. I suspect a lot of my issue is due to tuning and parameters. I have a Hive external table on this data and to run queries against it runs in minutes The Job: + 40gb of avro events on HDFS (100 million+ avro events) + Read in the files

Spark data incorrect when more than 200 tasks

2015-02-18 Thread lbierman
I'm fairly new to Spark. We have data in avro files on hdfs. We are trying to load up all the avro files (28 gigs worth right now) and do an aggregation. When we have less than 200 tasks the data all runs and produces the proper results. If there are more than 200 tasks (as stated in the logs by