Re: Fw: Spark + Parquet + IBM Block Storage at Bluemix

Mario Ds Briggs Sun, 25 Sep 2016 20:54:07 -0700

Hi Daniel,

can you give it a try in the IBM's Analytics for Spark, the fix has been in
for a week now



thanks
Mario



From:   Daniel Lopes <dan...@onematch.com.br>
To:     Mario Ds Briggs/India/IBM@IBMIN
Cc:     Adam Roberts <arobe...@uk.ibm.com>, user
            <user@spark.apache.org>, Steve Loughran
            <ste...@hortonworks.com>, Sachin Aggarwal4/India/IBM@IBMIN
Date:   14/09/2016 01:19 am
Subject:        Re: Fw: Spark + Parquet + IBM Block Storage at Bluemix



Hi Mario,

Thanks for your help, so I will keeping using CSVs

Best,

Daniel Lopes
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br

On Mon, Sep 12, 2016 at 3:39 PM, Mario Ds Briggs <mario.bri...@in.ibm.com>
wrote:
  Daniel,

  I believe it is related to
  https://issues.apache.org/jira/browse/SPARK-13979 and happens only when
  task fails in a executor (probably for some other reason u hit the latter
  in parquet and not csv).

  The PR in there, should be shortly available in IBM's Analytics for
  Spark.


  thanks
  Mario

  Inactive hide details for Adam Roberts---12/09/2016 09:37:21 pm---Mario,
  incase you've not seen this...Adam Roberts---12/09/2016 09:37:21
  pm---Mario, incase you've not seen this...

  From: Adam Roberts/UK/IBM
  To: Mario Ds Briggs/India/IBM@IBMIN
  Date: 12/09/2016 09:37 pm
  Subject: Fw: Spark + Parquet + IBM Block Storage at Bluemix


  Mario, incase you've not seen this...
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
 Adam Roberts                                                        
                                                                     
 IBM Spark                                                           
 Team Lead                                                           
                                                                     
 Runtime                                                             
 Technologies                                                        
 - Hursley                                                           
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     


  ----- Forwarded by Adam Roberts/UK/IBM on 12/09/2016 17:06 -----

  From: Daniel Lopes <dan...@onematch.com.br>
  To: Steve Loughran <ste...@hortonworks.com>
  Cc: user <user@spark.apache.org>
  Date: 12/09/2016 13:05
  Subject: Re: Spark + Parquet + IBM Block Storage at Bluemix




  Thanks Steve,

  But this error occurs only with parquet files, CSVs works.

  Best,

  Daniel Lopes
  Chief Data and Analytics Officer | OneMatch
  c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

  www.onematch.com.br

  On Sun, Sep 11, 2016 at 3:28 PM, Steve Loughran <ste...@hortonworks.com>
  wrote:
                    On 9 Sep 2016, at 17:56, Daniel Lopes <
                    dan...@onematch.com.br> wrote:

                    Hi, someone can help

                    I'm trying to use parquet in IBM Block Storage at Spark
                    but when I try to load get this error:

                    using this config

                    credentials = {
                      "name": "keystone",
                      "auth_url": "https://identity.open.softlayer.com";,
                      "project":
                    "object_storage_23f274c1_d11XXXXXXXXXXXXXXXe634",
                      "projectId": "XXXXXXd9c4aa39b7c7eCCCCCCCCb",
                      "region": "dallas",
                      "userId": "XXXXX64087180b40XXXXX2b909",
                      "username": "admin_XXXX9dd810f8901d48778XXXXXX",
                      "password": "chXXXXXXXXXXXXX6_",
                      "domainId": "c1ddad17cfcXXXXXXXXX41",
                      "domainName": "10XXXXXX",
                      "role": "admin"
                    }

                    def set_hadoop_config(credentials):
                        """This function sets the Hadoop configuration with
                    given credentials,
                        so it is possible to access data using
                    SparkContext"""

                        prefix = "fs.swift.service." + credentials['name']
                        hconf = sc._jsc.hadoopConfiguration()
                        hconf.set(prefix + ".auth.url", credentials
                    ['auth_url']+'/v3/auth/tokens')
                        hconf.set(prefix + ".auth.endpoint.prefix",
                    "endpoints")
                        hconf.set(prefix + ".tenant", credentials
                    ['projectId'])
                        hconf.set(prefix + ".username", credentials
                    ['userId'])
                        hconf.set(prefix + ".password", credentials
                    ['password'])
                        hconf.setInt(prefix + ".http.port", 8080)
                        hconf.set(prefix + ".region", credentials
                    ['region'])
                        hconf.setBoolean(prefix + ".public", True)

                    set_hadoop_config(credentials)

                    -------------------------------------------------

                    Py4JJavaErrorTraceback (most recent call last)
                    <ipython-input-55-5a14928215eb> in <module>()
                    ----> 1 train.groupby('Acordo').count().show()

                    Py4JJavaError: An error occurred while calling
                    o406.showString.
                    : org.apache.spark.SparkException: Job aborted due to
                    stage failure: Task 60 in stage 30.0 failed 10 times,
                    most recent failure: Lost task 60.9 in stage 30.0 (TID
                    2556, yp-spark-dal09-env5-0039):
                    
org.apache.hadoop.fs.swift.exceptions.SwiftConfigurationException:
                     Missing mandatory configuration option:
                    fs.swift.service.keystone.auth.url


        In my own code, I'd assume that the value of credentials['name']
        didn't match that of the URL, assuming you have something like
        swift://bucket.keystone . Failing that: the options were set too
        late.

        Instead of asking for the hadoop config and editing that, set the
        option in your spark context, before it is launched, with the
        prefix "hadoop"

                    at
                    org.apache.hadoop.fs.swift.http.RestClientBindings.copy
                    (RestClientBindings.java:223)
                    at
                    org.apache.hadoop.fs.swift.http.RestClientBindings.bind
                    (RestClientBindings.java:147)


                    Daniel Lopes
                    Chief Data and Analytics Officer | OneMatch
                    c: +55 (18) 99764-2733 |
                    https://www.linkedin.com/in/dslopes

                    www.onematch.com.br

Re: Fw: Spark + Parquet + IBM Block Storage at Bluemix

Reply via email to