Hi, A newbie here. I am trying to do etl on spark. Few questions. I have csv file with header. 1) How do I parse this file (as it has a header..) 2) I was trying to follow the tutorial here: http://ampcamp.berkeley.edu/3/exercises/data-exploration-using-spark.html
3) I am trying to do a frequency count.. rows_key_value = rows.map(lambda x:(x[1],1)).reduceByKey(lambda x,y:x+y,1).collect() After waiting for like few minutes I see this error: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:178) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:50) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:223) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/02/24 13:06:45 INFO TaskSetManager: Starting task 13.0:0 as TID 331 on executor 2: node07 (PROCESS_LOCAL) 14/02/24 13:06:45 INFO TaskSetManager: Serialized task 13.0:0 as 3809 bytes in 0 ms How do i fix this? Thanks