Hi,
I am trying to use the MultipleTextOutputFormat to rename the output files
of my Spark job something different from the default "part-N".
I have implemented a custom MultipleTextOutputFormat class as follows:
*class DriveOutputRenameMultipleTextOutputFormat extends
MultipleTextOutputFor
Hi,
How can I implement a custom MultipleOutputFormat and specify it as the
output of my Spark job so that I can ensure that there is a unique output
file per key (instead of a a unique output file per reducer)?
Thanks
Arpan
I am running a spark job on ~ 124 GB of data in a S3 bucket. The Job runs
fine but occasionally returns the following exception during the first map
stage which involves reading and transforming the data from S3. Is there a
config parameter I can set to increase this timeout limit?
*14/08/23 04:45
I was grouping time series data by a key. I want the values to be sorted by
timestamp after the grouping.
On Fri, Aug 22, 2014 at 7:26 PM, Matthew Farrellee wrote:
> On 08/22/2014 04:32 PM, Arpan Ghosh wrote:
>
>> Is there any way to control the ordering of values for each
Is there any way to control the ordering of values for each key during a
groupByKey() operation? Is there some sort of implicit ordering in place
already?
Thanks
Arpan
Hi,
I have launched an AWS Spark cluster using the spark-ec2 script
(--hadoop-major-version=1). The ephemeral-HDFS is setup correctly and I can
see the name node at :50070. When I try to copy files from
S3 into ephemeral-HDFS using distcp using the following command:
ephemeral-hdfs/bin/hadoop dis
The errors are occurring in the exact same time in the job as
well..right at the end of the groupByKey() when 5 tasks are left.
On Thu, Aug 14, 2014 at 12:59 PM, Arpan Ghosh wrote:
> Hi Davies,
>
> I tried the second option and launched my ec2 cluster with master on all
> t
gt; 2) try master or 1.1 branch with the feature of spilling in Python.
>
> Davies
>
> On Wed, Aug 13, 2014 at 4:08 PM, Arpan Ghosh wrote:
> > Here are the biggest keys:
> >
> > [ (17634, 87874097),
> >
> > (8407, 38395833),
> >
> > (2
way:
>
> rdd.countByKey().sortBy(lambda x:x[1], False).take(10)
>
> Davies
>
>
> On Wed, Aug 13, 2014 at 12:21 PM, Arpan Ghosh wrote:
> > Hi,
> >
> > Let me begin by describing my Spark setup on EC2 (launched using the
> > provided spark-ec2.py script):
t; if you could test it with you real case.
>
> Davies
>
> On Wed, Aug 13, 2014 at 1:57 PM, Arpan Ghosh wrote:
> > Thanks Davies. I am running Spark 1.0.2 (which seems to be the latest
> > release)
> >
> > I'll try changing it to a reduceByKey() and check th
gt;
> rdd.countByKey().sortBy(lambda x:x[1], False).take(10)
>
> Davies
>
>
> On Wed, Aug 13, 2014 at 12:21 PM, Arpan Ghosh wrote:
> > Hi,
> >
> > Let me begin by describing my Spark setup on EC2 (launched using the
> > provided spark-ec2.py script):
>
Hi,
Let me begin by describing my Spark setup on EC2 (launched using the
provided spark-ec2.py script):
- 100 c3.2xlarge workers (8 cores & 15GB memory each)
- 1 c3.2xlarge Master (only running master daemon)
- Spark 1.0.2
- 8GB mounted at */* & 80 GB mounted at */mnt*
*spark-default
12 matches
Mail list logo