Re: Spark reading from S3 getting very slow

2015-11-05 Thread Steve Loughran

On 5 Nov 2015, at 02:03, Younes Naguib 
> wrote:

Hi all,

I’m reading large text files from s3. Sizes between from 30GB and 40GB.
Every stage runs in 8-9s, except the last 32, jumps to 1mn-2mn for some reason!
Here is my sample code:
val myDF = sc.textFile(input_file).map{
  x =>
val p = x.split("\t", -1)
new ()
}.toDF()

myDF.registerTempTable("tbl")
sqlContext.sql("select count(1) from tbl").collect()

Any help/idea?

Thanks,
Younes Naguib
Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC  H3G 1R8
Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | 
younes.nag...@tritondigital.com


There's a performance problem in S3n on Hadoop 2.6, where the jets3t library 
scans through the tail of the file on a close(). S3a on Hadoop 2.7+ doesn't 
have this problem


Spark reading from S3 getting very slow

2015-11-04 Thread Younes Naguib
Hi all,

I'm reading large text files from s3. Sizes between from 30GB and 40GB.
Every stage runs in 8-9s, except the last 32, jumps to 1mn-2mn for some reason!
Here is my sample code:
val myDF = sc.textFile(input_file).map{
  x =>
val p = x.split("\t", -1)
new ()
}.toDF()

myDF.registerTempTable("tbl")
sqlContext.sql("select count(1) from tbl").collect()

Any help/idea?

Thanks,
Younes Naguib
Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC  H3G 1R8
Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | 
younes.nag...@tritondigital.com 



Re: spark, reading from s3

2015-02-12 Thread Kane Kim
The thing is that my time is perfectly valid...

On Tue, Feb 10, 2015 at 10:50 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Its with the timezone actually, you can either use an NTP to maintain
 accurate system clock or you can adjust your system time to match with the
 AWS one. You can do it as:

 telnet s3.amazonaws.com 80
 GET / HTTP/1.0


 [image: Inline image 1]

 Thanks
 Best Regards

 On Wed, Feb 11, 2015 at 6:43 AM, Kane Kim kane.ist...@gmail.com wrote:

 I'm getting this warning when using s3 input:
 15/02/11 00:58:37 WARN RestStorageService: Adjusted time offset in
 response to
 RequestTimeTooSkewed error. Local machine and S3 server disagree on the
 time by approximately 0 seconds. Retrying connection.

 After that there are tons of 403/forbidden errors and then job fails.
 It's sporadic, so sometimes I get this error and sometimes not, what
 could be the issue?
 I think it could be related to network connectivity?





Re: spark, reading from s3

2015-02-12 Thread Franc Carter
Check that your timezone is correct as well, an incorrect timezone can make
it look like your time is correct when it is skewed.

cheers

On Fri, Feb 13, 2015 at 5:51 AM, Kane Kim kane.ist...@gmail.com wrote:

 The thing is that my time is perfectly valid...

 On Tue, Feb 10, 2015 at 10:50 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Its with the timezone actually, you can either use an NTP to maintain
 accurate system clock or you can adjust your system time to match with the
 AWS one. You can do it as:

 telnet s3.amazonaws.com 80
 GET / HTTP/1.0


 [image: Inline image 1]

 Thanks
 Best Regards

 On Wed, Feb 11, 2015 at 6:43 AM, Kane Kim kane.ist...@gmail.com wrote:

 I'm getting this warning when using s3 input:
 15/02/11 00:58:37 WARN RestStorageService: Adjusted time offset in
 response to
 RequestTimeTooSkewed error. Local machine and S3 server disagree on the
 time by approximately 0 seconds. Retrying connection.

 After that there are tons of 403/forbidden errors and then job fails.
 It's sporadic, so sometimes I get this error and sometimes not, what
 could be the issue?
 I think it could be related to network connectivity?






-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  franc.car...@rozettatech.com|
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA


Re: spark, reading from s3

2015-02-12 Thread Kane Kim
Looks like my clock is in sync:

-bash-4.1$ date  curl -v s3.amazonaws.com
Thu Feb 12 21:40:18 UTC 2015
* About to connect() to s3.amazonaws.com port 80 (#0)
*   Trying 54.231.12.24... connected
* Connected to s3.amazonaws.com (54.231.12.24) port 80 (#0)
 GET / HTTP/1.1
 User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/
3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2
 Host: s3.amazonaws.com
 Accept: */*

 HTTP/1.1 307 Temporary Redirect
 x-amz-id-2:
sl8Tg81ZnBj3tD7Q9f2KFBBZKC83TbAUieHJu9IA3PrBibvB3M7NpwAlfTi/Tdwg
 x-amz-request-id: 48C14DF82BE1A970
 Date: Thu, 12 Feb 2015 21:40:19 GMT
 Location: http://aws.amazon.com/s3/
 Content-Length: 0
 Server: AmazonS3


On Thu, Feb 12, 2015 at 12:26 PM, Franc Carter franc.car...@rozettatech.com
 wrote:


 Check that your timezone is correct as well, an incorrect timezone can
 make it look like your time is correct when it is skewed.

 cheers

 On Fri, Feb 13, 2015 at 5:51 AM, Kane Kim kane.ist...@gmail.com wrote:

 The thing is that my time is perfectly valid...

 On Tue, Feb 10, 2015 at 10:50 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Its with the timezone actually, you can either use an NTP to maintain
 accurate system clock or you can adjust your system time to match with the
 AWS one. You can do it as:

 telnet s3.amazonaws.com 80
 GET / HTTP/1.0


 [image: Inline image 1]

 Thanks
 Best Regards

 On Wed, Feb 11, 2015 at 6:43 AM, Kane Kim kane.ist...@gmail.com wrote:

 I'm getting this warning when using s3 input:
 15/02/11 00:58:37 WARN RestStorageService: Adjusted time offset in
 response to
 RequestTimeTooSkewed error. Local machine and S3 server disagree on the
 time by approximately 0 seconds. Retrying connection.

 After that there are tons of 403/forbidden errors and then job fails.
 It's sporadic, so sometimes I get this error and sometimes not, what
 could be the issue?
 I think it could be related to network connectivity?






 --

 *Franc Carter* | Systems Architect | Rozetta Technology

 franc.car...@rozettatech.com  franc.car...@rozettatech.com|
 www.rozettatechnology.com

 Tel: +61 2 8355 2515

 Level 4, 55 Harrington St, The Rocks NSW 2000

 PO Box H58, Australia Square, Sydney NSW 1215

 AUSTRALIA




Re: spark, reading from s3

2015-02-10 Thread Akhil Das
Its with the timezone actually, you can either use an NTP to maintain
accurate system clock or you can adjust your system time to match with the
AWS one. You can do it as:

telnet s3.amazonaws.com 80
GET / HTTP/1.0


[image: Inline image 1]

Thanks
Best Regards

On Wed, Feb 11, 2015 at 6:43 AM, Kane Kim kane.ist...@gmail.com wrote:

 I'm getting this warning when using s3 input:
 15/02/11 00:58:37 WARN RestStorageService: Adjusted time offset in
 response to
 RequestTimeTooSkewed error. Local machine and S3 server disagree on the
 time by approximately 0 seconds. Retrying connection.

 After that there are tons of 403/forbidden errors and then job fails.
 It's sporadic, so sometimes I get this error and sometimes not, what could
 be the issue?
 I think it could be related to network connectivity?



spark, reading from s3

2015-02-10 Thread Kane Kim
I'm getting this warning when using s3 input:
15/02/11 00:58:37 WARN RestStorageService: Adjusted time offset in response
to
RequestTimeTooSkewed error. Local machine and S3 server disagree on the
time by approximately 0 seconds. Retrying connection.

After that there are tons of 403/forbidden errors and then job fails.
It's sporadic, so sometimes I get this error and sometimes not, what could
be the issue?
I think it could be related to network connectivity?