Re: Long Shuffle Read Blocked Time

2017-04-20 Thread Pradeep Gollakota
Hi All,

It appears that the bottleneck in my job was the EBS volumes. Very high i/o
wait times across the cluster. I was only using 1 volume. Increasing to 4
made it faster.

Thanks,
Pradeep

On Thu, Apr 20, 2017 at 3:12 PM, Pradeep Gollakota <pradeep...@gmail.com>
wrote:

> Hi All,
>
> I have a simple ETL job that reads some data, shuffles it and writes it
> back out. This is running on AWS EMR 5.4.0 using Spark 2.1.0.
>
> After Stage 0 completes and the job starts Stage 1, I see a huge slowdown
> in the job. The CPU usage is low on the cluster, as is the network I/O.
> From the Spark Stats, I see large values for the Shuffle Read Blocked Time.
> As an example, one of my tasks completed in 18 minutes, but spent 15
> minutes waiting for remote reads.
>
> I'm not sure why the shuffle is so slow. Are there things I can do to
> increase the performance of the shuffle?
>
> Thanks,
> Pradeep
>


Long Shuffle Read Blocked Time

2017-04-20 Thread Pradeep Gollakota
Hi All,

I have a simple ETL job that reads some data, shuffles it and writes it
back out. This is running on AWS EMR 5.4.0 using Spark 2.1.0.

After Stage 0 completes and the job starts Stage 1, I see a huge slowdown
in the job. The CPU usage is low on the cluster, as is the network I/O.
>From the Spark Stats, I see large values for the Shuffle Read Blocked Time.
As an example, one of my tasks completed in 18 minutes, but spent 15
minutes waiting for remote reads.

I'm not sure why the shuffle is so slow. Are there things I can do to
increase the performance of the shuffle?

Thanks,
Pradeep


Re: Spark Website

2016-07-13 Thread Pradeep Gollakota
Worked for me if I go to https://spark.apache.org/site/ but not
https://spark.apache.org

On Wed, Jul 13, 2016 at 11:48 AM, Maurin Lenglart 
wrote:

> Same here
>
>
>
> *From: *Benjamin Kim 
> *Date: *Wednesday, July 13, 2016 at 11:47 AM
> *To: *manish ranjan 
> *Cc: *user 
> *Subject: *Re: Spark Website
>
>
>
> It takes me to the directories instead of the webpage.
>
>
>
> On Jul 13, 2016, at 11:45 AM, manish ranjan  wrote:
>
>
>
> working for me. What do you mean 'as supposed to'?
>
>
> ~Manish
>
>
>
> On Wed, Jul 13, 2016 at 11:45 AM, Benjamin Kim  wrote:
>
> Has anyone noticed that the spark.apache.org is not working as supposed
> to?
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
>


Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Pradeep Gollakota
IIRC, TextInputFormat supports an input path that is a comma separated
list. I haven't tried this, but I think you should just be able to do
sc.textFile("file1,file2,...")

On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang  wrote:

> I know these workaround, but wouldn't it be more convenient and
> straightforward to use SparkContext#textFiles ?
>
> On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra 
> wrote:
>
>> For more than a small number of files, you'd be better off using
>> SparkContext#union instead of RDD#union.  That will avoid building up a
>> lengthy lineage.
>>
>> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky 
>> wrote:
>>
>>> Hey Jeff,
>>> Do you mean reading from multiple text files? In that case, as a
>>> workaround, you can use the RDD#union() (or ++) method to concatenate
>>> multiple rdds. For example:
>>>
>>> val lines1 = sc.textFile("file1")
>>> val lines2 = sc.textFile("file2")
>>>
>>> val rdd = lines1 union lines2
>>>
>>> regards,
>>> --Jakob
>>>
>>> On 11 November 2015 at 01:20, Jeff Zhang  wrote:
>>>
 Although user can use the hdfs glob syntax to support multiple inputs.
 But sometimes, it is not convenient to do that. Not sure why there's no api
 of SparkContext#textFiles. It should be easy to implement that. I'd love to
 create a ticket and contribute for that if there's no other consideration
 that I don't know.

 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Pradeep Gollakota
Looks like what I was suggesting doesn't work. :/

On Wed, Nov 11, 2015 at 4:49 PM, Jeff Zhang <zjf...@gmail.com> wrote:

> Yes, that's what I suggest. TextInputFormat support multiple inputs. So in
> spark side, we just need to provide API to for that.
>
> On Thu, Nov 12, 2015 at 8:45 AM, Pradeep Gollakota <pradeep...@gmail.com>
> wrote:
>
>> IIRC, TextInputFormat supports an input path that is a comma separated
>> list. I haven't tried this, but I think you should just be able to do
>> sc.textFile("file1,file2,...")
>>
>> On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>> I know these workaround, but wouldn't it be more convenient and
>>> straightforward to use SparkContext#textFiles ?
>>>
>>> On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra <m...@clearstorydata.com>
>>> wrote:
>>>
>>>> For more than a small number of files, you'd be better off using
>>>> SparkContext#union instead of RDD#union.  That will avoid building up a
>>>> lengthy lineage.
>>>>
>>>> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky <joder...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hey Jeff,
>>>>> Do you mean reading from multiple text files? In that case, as a
>>>>> workaround, you can use the RDD#union() (or ++) method to concatenate
>>>>> multiple rdds. For example:
>>>>>
>>>>> val lines1 = sc.textFile("file1")
>>>>> val lines2 = sc.textFile("file2")
>>>>>
>>>>> val rdd = lines1 union lines2
>>>>>
>>>>> regards,
>>>>> --Jakob
>>>>>
>>>>> On 11 November 2015 at 01:20, Jeff Zhang <zjf...@gmail.com> wrote:
>>>>>
>>>>>> Although user can use the hdfs glob syntax to support multiple
>>>>>> inputs. But sometimes, it is not convenient to do that. Not sure why
>>>>>> there's no api of SparkContext#textFiles. It should be easy to implement
>>>>>> that. I'd love to create a ticket and contribute for that if there's no
>>>>>> other consideration that I don't know.
>>>>>>
>>>>>> --
>>>>>> Best Regards
>>>>>>
>>>>>> Jeff Zhang
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>