Trouble using MultipleTextOutputFormat with Spark

2014-12-30 Thread Arpan Ghosh
Hi, I am trying to use the MultipleTextOutputFormat to rename the output files of my Spark job something different from the default part-N. I have implemented a custom MultipleTextOutputFormat class as follows: *class DriveOutputRenameMultipleTextOutputFormat extends

using MultipleOutputFormat to ensure one output file per key

2014-11-25 Thread Arpan Ghosh
Hi, How can I implement a custom MultipleOutputFormat and specify it as the output of my Spark job so that I can ensure that there is a unique output file per key (instead of a a unique output file per reducer)? Thanks Arpan

Read timeout while running a Job on data in S3

2014-08-25 Thread Arpan Ghosh
I am running a spark job on ~ 124 GB of data in a S3 bucket. The Job runs fine but occasionally returns the following exception during the first map stage which involves reading and transforming the data from S3. Is there a config parameter I can set to increase this timeout limit? *14/08/23

[PySpark] order of values in GroupByKey()

2014-08-22 Thread Arpan Ghosh
Is there any way to control the ordering of values for each key during a groupByKey() operation? Is there some sort of implicit ordering in place already? Thanks Arpan

Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-14 Thread Arpan Ghosh
or 1.1 branch with the feature of spilling in Python. Davies On Wed, Aug 13, 2014 at 4:08 PM, Arpan Ghosh ar...@automatic.com wrote: Here are the biggest keys: [ (17634, 87874097), (8407, 38395833), (20092, 14403311), (9295, 4142636), (14359, 3129206

Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-14 Thread Arpan Ghosh
The errors are occurring in the exact same time in the job as well..right at the end of the groupByKey() when 5 tasks are left. On Thu, Aug 14, 2014 at 12:59 PM, Arpan Ghosh ar...@automatic.com wrote: Hi Davies, I tried the second option and launched my ec2 cluster with master on all

Getting hadoop distcp to work on ephemeral-hsfs in spark-ec2 cluster

2014-08-14 Thread Arpan Ghosh
Hi, I have launched an AWS Spark cluster using the spark-ec2 script (--hadoop-major-version=1). The ephemeral-HDFS is setup correctly and I can see the name node at master hostname:50070. When I try to copy files from S3 into ephemeral-HDFS using distcp using the following command:

groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-13 Thread Arpan Ghosh
Hi, Let me begin by describing my Spark setup on EC2 (launched using the provided spark-ec2.py script): - 100 c3.2xlarge workers (8 cores 15GB memory each) - 1 c3.2xlarge Master (only running master daemon) - Spark 1.0.2 - 8GB mounted at */* 80 GB mounted at */mnt*

Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-13 Thread Arpan Ghosh
], False).take(10) Davies On Wed, Aug 13, 2014 at 12:21 PM, Arpan Ghosh ar...@automatic.com wrote: Hi, Let me begin by describing my Spark setup on EC2 (launched using the provided spark-ec2.py script): 100 c3.2xlarge workers (8 cores 15GB memory each) 1 c3.2xlarge Master (only

Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-13 Thread Arpan Ghosh
appreciate that if you could test it with you real case. Davies On Wed, Aug 13, 2014 at 1:57 PM, Arpan Ghosh ar...@automatic.com wrote: Thanks Davies. I am running Spark 1.0.2 (which seems to be the latest release) I'll try changing it to a reduceByKey() and check the size of the largest

Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-13 Thread Arpan Ghosh
().sortBy(lambda x:x[1], False).take(10) Davies On Wed, Aug 13, 2014 at 12:21 PM, Arpan Ghosh ar...@automatic.com wrote: Hi, Let me begin by describing my Spark setup on EC2 (launched using the provided spark-ec2.py script): 100 c3.2xlarge workers (8 cores 15GB memory each) 1