Trouble using MultipleTextOutputFormat with Spark

2014-12-30 Thread Arpan Ghosh
Hi, I am trying to use the MultipleTextOutputFormat to rename the output files of my Spark job something different from the default "part-N". I have implemented a custom MultipleTextOutputFormat class as follows: *class DriveOutputRenameMultipleTextOutputFormat extends MultipleTextOutputFor

using MultipleOutputFormat to ensure one output file per key

2014-11-25 Thread Arpan Ghosh
Hi, How can I implement a custom MultipleOutputFormat and specify it as the output of my Spark job so that I can ensure that there is a unique output file per key (instead of a a unique output file per reducer)? Thanks Arpan

Read timeout while running a Job on data in S3

2014-08-25 Thread Arpan Ghosh
I am running a spark job on ~ 124 GB of data in a S3 bucket. The Job runs fine but occasionally returns the following exception during the first map stage which involves reading and transforming the data from S3. Is there a config parameter I can set to increase this timeout limit? *14/08/23 04:45

Re: [PySpark] order of values in GroupByKey()

2014-08-22 Thread Arpan Ghosh
I was grouping time series data by a key. I want the values to be sorted by timestamp after the grouping. On Fri, Aug 22, 2014 at 7:26 PM, Matthew Farrellee wrote: > On 08/22/2014 04:32 PM, Arpan Ghosh wrote: > >> Is there any way to control the ordering of values for each

[PySpark] order of values in GroupByKey()

2014-08-22 Thread Arpan Ghosh
Is there any way to control the ordering of values for each key during a groupByKey() operation? Is there some sort of implicit ordering in place already? Thanks Arpan

Getting hadoop distcp to work on ephemeral-hsfs in spark-ec2 cluster

2014-08-14 Thread Arpan Ghosh
Hi, I have launched an AWS Spark cluster using the spark-ec2 script (--hadoop-major-version=1). The ephemeral-HDFS is setup correctly and I can see the name node at :50070. When I try to copy files from S3 into ephemeral-HDFS using distcp using the following command: ephemeral-hdfs/bin/hadoop dis

Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-14 Thread Arpan Ghosh
The errors are occurring in the exact same time in the job as well..right at the end of the groupByKey() when 5 tasks are left. On Thu, Aug 14, 2014 at 12:59 PM, Arpan Ghosh wrote: > Hi Davies, > > I tried the second option and launched my ec2 cluster with master on all > t

Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-14 Thread Arpan Ghosh
gt; 2) try master or 1.1 branch with the feature of spilling in Python. > > Davies > > On Wed, Aug 13, 2014 at 4:08 PM, Arpan Ghosh wrote: > > Here are the biggest keys: > > > > [ (17634, 87874097), > > > > (8407, 38395833), > > > > (2

Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-13 Thread Arpan Ghosh
way: > > rdd.countByKey().sortBy(lambda x:x[1], False).take(10) > > Davies > > > On Wed, Aug 13, 2014 at 12:21 PM, Arpan Ghosh wrote: > > Hi, > > > > Let me begin by describing my Spark setup on EC2 (launched using the > > provided spark-ec2.py script):

Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-13 Thread Arpan Ghosh
t; if you could test it with you real case. > > Davies > > On Wed, Aug 13, 2014 at 1:57 PM, Arpan Ghosh wrote: > > Thanks Davies. I am running Spark 1.0.2 (which seems to be the latest > > release) > > > > I'll try changing it to a reduceByKey() and check th

Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-13 Thread Arpan Ghosh
gt; > rdd.countByKey().sortBy(lambda x:x[1], False).take(10) > > Davies > > > On Wed, Aug 13, 2014 at 12:21 PM, Arpan Ghosh wrote: > > Hi, > > > > Let me begin by describing my Spark setup on EC2 (launched using the > > provided spark-ec2.py script): >

groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-13 Thread Arpan Ghosh
Hi, Let me begin by describing my Spark setup on EC2 (launched using the provided spark-ec2.py script): - 100 c3.2xlarge workers (8 cores & 15GB memory each) - 1 c3.2xlarge Master (only running master daemon) - Spark 1.0.2 - 8GB mounted at */* & 80 GB mounted at */mnt* *spark-default