Re: Fw: Spark + Parquet + IBM Block Storage at Bluemix

2016-09-25 Thread Mario Ds Briggs

Hi Daniel,

can you give it a try in the IBM's Analytics for Spark, the fix has been in
for a week now


thanks
Mario



From:   Daniel Lopes <dan...@onematch.com.br>
To: Mario Ds Briggs/India/IBM@IBMIN
Cc: Adam Roberts <arobe...@uk.ibm.com>, user
<user@spark.apache.org>, Steve Loughran
<ste...@hortonworks.com>, Sachin Aggarwal4/India/IBM@IBMIN
Date:   14/09/2016 01:19 am
Subject:    Re: Fw: Spark + Parquet + IBM Block Storage at Bluemix



Hi Mario,

Thanks for your help, so I will keeping using CSVs

Best,

Daniel Lopes
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br

On Mon, Sep 12, 2016 at 3:39 PM, Mario Ds Briggs <mario.bri...@in.ibm.com>
wrote:
  Daniel,

  I believe it is related to
  https://issues.apache.org/jira/browse/SPARK-13979 and happens only when
  task fails in a executor (probably for some other reason u hit the latter
  in parquet and not csv).

  The PR in there, should be shortly available in IBM's Analytics for
  Spark.


  thanks
  Mario

  Inactive hide details for Adam Roberts---12/09/2016 09:37:21 pm---Mario,
  incase you've not seen this...Adam Roberts---12/09/2016 09:37:21
  pm---Mario, incase you've not seen this...

  From: Adam Roberts/UK/IBM
  To: Mario Ds Briggs/India/IBM@IBMIN
  Date: 12/09/2016 09:37 pm
  Subject: Fw: Spark + Parquet + IBM Block Storage at Bluemix


  Mario, incase you've not seen this...
 
 
 
 
 
 Adam Roberts
 
 IBM Spark   
 Team Lead   
 
 Runtime 
 Technologies
 - Hursley   
 
 
 
 
 
 
 
 
 


  - Forwarded by Adam Roberts/UK/IBM on 12/09/2016 17:06 -

  From: Daniel Lopes <dan...@onematch.com.br>
  To: Steve Loughran <ste...@hortonworks.com>
  Cc: user <user@spark.apache.org>
  Date: 12/09/2016 13:05
  Subject: Re: Spark + Parquet + IBM Block Storage at Bluemix




  Thanks Steve,

  But this error occurs only with parquet files, CSVs works.

  Best,

  Daniel Lopes
  Chief Data and Analytics Officer | OneMatch
  c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

  www.onematch.com.br

  On Sun, Sep 11, 2016 at 3:28 PM, Steve Loughran <ste...@hortonworks.com>
  wrote:
On 9 Sep 2016, at 17:56, Daniel Lopes <
dan...@onematch.com.br> wrote:

Hi, someone can help

I'm trying to use parquet in IBM Block Storage at Spark
but when I try to load get this error:

using this config

credentials = {
  "name": "keystone",
  "auth_url": "https://identity.open.softlayer.com;,
  "project":
"object_storage_23f274c1_d11XXXe634",
  "projectId": "XXd9c4aa39b7c7eb",
  "region": "dallas",
  "userId": "X64087180b40X2b909",
  "username": "admin_9dd810f8901d48778XX",
  "password": "chX6_",
  "domainId": "c1ddad17cfcX41",
  "domainName": "10XX",
  "role": "admin"
}

def set_hadoop_config(cred

Re: Fw: Spark + Parquet + IBM Block Storage at Bluemix

2016-09-13 Thread Daniel Lopes
Hi Mario,

Thanks for your help, so I will keeping using CSVs

Best,

*Daniel Lopes*
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br


On Mon, Sep 12, 2016 at 3:39 PM, Mario Ds Briggs 
wrote:

> Daniel,
>
> I believe it is related to https://issues.apache.org/
> jira/browse/SPARK-13979 and happens only when task fails in a executor
> (probably for some other reason u hit the latter in parquet and not csv).
>
> The PR in there, should be shortly available in IBM's Analytics for Spark.
>
>
> thanks
> Mario
>
> [image: Inactive hide details for Adam Roberts---12/09/2016 09:37:21
> pm---Mario, incase you've not seen this...]Adam Roberts---12/09/2016
> 09:37:21 pm---Mario, incase you've not seen this...
>
> From: Adam Roberts/UK/IBM
> To: Mario Ds Briggs/India/IBM@IBMIN
> Date: 12/09/2016 09:37 pm
> Subject: Fw: Spark + Parquet + IBM Block Storage at Bluemix
> --
>
>
> Mario, incase you've not seen this...
>
> --
> *Adam Roberts*
> IBM Spark Team Lead
> Runtime Technologies - Hursley
> - Forwarded by Adam Roberts/UK/IBM on 12/09/2016 17:06 -
>
> From: Daniel Lopes 
> To: Steve Loughran 
> Cc: user 
> Date: 12/09/2016 13:05
> Subject: Re: Spark + Parquet + IBM Block Storage at Bluemix
> --
>
>
>
> Thanks Steve,
>
> But this error occurs only with parquet files, CSVs works.
>
> Best,
>
> *Daniel Lopes*
> Chief Data and Analytics Officer | OneMatch
> c: +55 (18) 99764-2733 | *https://www.linkedin.com/in/dslopes*
> 
>
> *www.onematch.com.br*
> 
>
> On Sun, Sep 11, 2016 at 3:28 PM, Steve Loughran <*ste...@hortonworks.com*
> > wrote:
>
>On 9 Sep 2016, at 17:56, Daniel Lopes <*dan...@onematch.com.br*
>  > wrote:
>
>  Hi, someone can help
>
>  I'm trying to use parquet in IBM Block Storage at Spark but when
>  I try to load get this error:
>
>  using this config
>
>  credentials = {
>"name": "keystone",
>*"auth_url": "**https://identity.open.softlayer.com*
>  *",*
>"project": "object_storage_23f274c1_d11XXXe634",
>"projectId": "XXd9c4aa39b7c7eb",
>"region": "dallas",
>"userId": "X64087180b40X2b909",
>"username": "admin_9dd810f8901d48778XX",
>"password": "chX6_",
>"domainId": "c1ddad17cfcX41",
>"domainName": "10XX",
>"role": "admin"
>  }
>
>  def set_hadoop_config(credentials):
>  """This function sets the Hadoop configuration with given
>  credentials,
>  so it is possible to access data using SparkContext"""
>
>  prefix = "fs.swift.service." + credentials['name']
>  hconf = sc._jsc.hadoopConfiguration()
>  *hconf.set(prefix + ".auth.url",
>  credentials['auth_url']+'/v3/auth/tokens')*
>  hconf.set(prefix + ".auth.endpoint.prefix", "endpoints")
>  hconf.set(prefix + ".tenant", credentials['projectId'])
>  hconf.set(prefix + ".username", credentials['userId'])
>  hconf.set(prefix + ".password", credentials['password'])
>  hconf.setInt(prefix + ".http.port", 8080)
>  hconf.set(prefix + ".region", credentials['region'])
>  hconf.setBoolean(prefix + ".public", True)
>
>  set_hadoop_config(credentials)
>
>  -
>
>  Py4JJavaErrorTraceback (most recent call last)
>   in ()
>  > 1 train.groupby('Acordo').count().show()
>
>  *Py4JJavaError: An error occurred while calling* o406.showString.
>  : org.apache.spark.SparkException: Job aborted due to stage
>  failure: Task 60 in stage 30.0 failed 10 times, most recent failure: 
> Lost
>  task 60.9 in stage 30.0 (TID 2556, yp-spark-dal09-env5-0039):
>  org.apache.hadoop.fs.swift.exceptions.
>  SwiftConfigurationException:* Missing mandatory configuration
>  option: fs.swift.service.keystone.auth.url*
>
>
>In my own code, I'd assume that the value of credentials['name']
>didn't match that of the URL, assuming you have something like
>swift://bucket.keystone . Failing that: the options were set too late.
>
>Instead of asking for the hadoop config and editing that, set the
>option in your spark context, before it is launched, with the prefix
>"hadoop"
>
>at 

Re: Fw: Spark + Parquet + IBM Block Storage at Bluemix

2016-09-12 Thread Mario Ds Briggs


Daniel,

I believe it is related to
https://issues.apache.org/jira/browse/SPARK-13979 and happens only when
task fails in a executor (probably for some other reason u hit the latter
in parquet and not csv).

The PR in there, should be shortly available in IBM's Analytics for Spark.


thanks
Mario



From:   Adam Roberts/UK/IBM
To: Mario Ds Briggs/India/IBM@IBMIN
Date:   12/09/2016 09:37 pm
Subject:Fw: Spark + Parquet + IBM Block Storage at Bluemix


Mario, incase you've not seen this...
   
   
   
   
   
   Adam Roberts
   
   IBM Spark   
   Team Lead   
   
   Runtime 
   Technologies
   - Hursley   
   
   
   
   
   
   
   
   
   



- Forwarded by Adam Roberts/UK/IBM on 12/09/2016 17:06 -

From:   Daniel Lopes 
To: Steve Loughran 
Cc: user 
Date:   12/09/2016 13:05
Subject:Re: Spark + Parquet + IBM Block Storage at Bluemix



Thanks Steve,

But this error occurs only with parquet files, CSVs works.

Best,

Daniel Lopes
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br

On Sun, Sep 11, 2016 at 3:28 PM, Steve Loughran 
wrote:

On 9 Sep 2016, at 17:56, Daniel Lopes 
wrote:

Hi, someone can help

I'm trying to use parquet in IBM Block Storage at Spark but when I
try to load get this error:

using this config

credentials = {
  "name": "keystone",
  "auth_url": "https://identity.open.softlayer.com;,
  "project": "object_storage_23f274c1_d11XXXe634",
  "projectId": "XXd9c4aa39b7c7eb",
  "region": "dallas",
  "userId": "X64087180b40X2b909",
  "username": "admin_9dd810f8901d48778XX",
  "password": "chX6_",
  "domainId": "c1ddad17cfcX41",
  "domainName": "10XX",
  "role": "admin"
}

def set_hadoop_config(credentials):
    """This function sets the Hadoop configuration with given
credentials,
    so it is possible to access data using SparkContext"""

    prefix = "fs.swift.service." + credentials['name']
    hconf = sc._jsc.hadoopConfiguration()
    hconf.set(prefix + ".auth.url", credentials
['auth_url']+'/v3/auth/tokens')
    hconf.set(prefix + ".auth.endpoint.prefix", "endpoints")
    hconf.set(prefix + ".tenant", credentials['projectId'])
    hconf.set(prefix + ".username", credentials['userId'])
    hconf.set(prefix + ".password", credentials['password'])
    hconf.setInt(prefix + ".http.port", 8080)
    hconf.set(prefix + ".region", credentials['region'])
    hconf.setBoolean(prefix + ".public", True)

set_hadoop_config(credentials)

-

Py4JJavaErrorTraceback (most recent call last)
 in ()
> 1 train.groupby('Acordo').count().show()

Py4JJavaError: An error occurred while calling o406.showString.
: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 60 in stage 30.0 failed 10 times, most recent
failure: Lost task 60.9 in stage 30.0 (TID 2556,
yp-spark-dal09-env5-0039):
org.apache.hadoop.fs.swift.exceptions.SwiftConfigurationException:
Missing mandatory configuration option: