Re: Fw: Spark + Parquet + IBM Block Storage at Bluemix

Mario Ds Briggs Mon, 12 Sep 2016 11:40:43 -0700


Daniel,


I believe it is related to
https://issues.apache.org/jira/browse/SPARK-13979 and happens only when
task fails in a executor (probably for some other reason u hit the latter
in parquet and not csv).

The PR in there, should be shortly available in IBM's Analytics for Spark.


thanks
Mario



From:   Adam Roberts/UK/IBM
To:     Mario Ds Briggs/India/IBM@IBMIN
Date:   12/09/2016 09:37 pm
Subject:        Fw: Spark + Parquet + IBM Block Storage at Bluemix


Mario, incase you've not seen this...
                                                                       
                                                                       
                                                                       
                                                                       
                                                                       
   Adam Roberts                                                        
                                                                       
   IBM Spark                                                           
   Team Lead                                                           
                                                                       
   Runtime                                                             
   Technologies                                                        
   - Hursley                                                           
                                                                       
                                                                       
                                                                       
                                                                       
                                                                       
                                                                       
                                                                       
                                                                       
                                                                       



----- Forwarded by Adam Roberts/UK/IBM on 12/09/2016 17:06 -----

From:   Daniel Lopes <dan...@onematch.com.br>
To:     Steve Loughran <ste...@hortonworks.com>
Cc:     user <user@spark.apache.org>
Date:   12/09/2016 13:05
Subject:        Re: Spark + Parquet + IBM Block Storage at Bluemix



Thanks Steve,

But this error occurs only with parquet files, CSVs works.

Best,

Daniel Lopes
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br

On Sun, Sep 11, 2016 at 3:28 PM, Steve Loughran <ste...@hortonworks.com>
wrote:

        On 9 Sep 2016, at 17:56, Daniel Lopes <dan...@onematch.com.br>
        wrote:

        Hi, someone can help

        I'm trying to use parquet in IBM Block Storage at Spark but when I
        try to load get this error:

        using this config

        credentials = {
          "name": "keystone",
          "auth_url": "https://identity.open.softlayer.com";,
          "project": "object_storage_23f274c1_d11XXXXXXXXXXXXXXXe634",
          "projectId": "XXXXXXd9c4aa39b7c7eCCCCCCCCb",
          "region": "dallas",
          "userId": "XXXXX64087180b40XXXXX2b909",
          "username": "admin_XXXX9dd810f8901d48778XXXXXX",
          "password": "chXXXXXXXXXXXXX6_",
          "domainId": "c1ddad17cfcXXXXXXXXX41",
          "domainName": "10XXXXXX",
          "role": "admin"
        }

        def set_hadoop_config(credentials):
            """This function sets the Hadoop configuration with given
        credentials,
            so it is possible to access data using SparkContext"""

            prefix = "fs.swift.service." + credentials['name']
            hconf = sc._jsc.hadoopConfiguration()
            hconf.set(prefix + ".auth.url", credentials
        ['auth_url']+'/v3/auth/tokens')
            hconf.set(prefix + ".auth.endpoint.prefix", "endpoints")
            hconf.set(prefix + ".tenant", credentials['projectId'])
            hconf.set(prefix + ".username", credentials['userId'])
            hconf.set(prefix + ".password", credentials['password'])
            hconf.setInt(prefix + ".http.port", 8080)
            hconf.set(prefix + ".region", credentials['region'])
            hconf.setBoolean(prefix + ".public", True)

        set_hadoop_config(credentials)

        -------------------------------------------------

        Py4JJavaErrorTraceback (most recent call last)
        <ipython-input-55-5a14928215eb> in <module>()
        ----> 1 train.groupby('Acordo').count().show()

        Py4JJavaError: An error occurred while calling o406.showString.
        : org.apache.spark.SparkException: Job aborted due to stage
        failure: Task 60 in stage 30.0 failed 10 times, most recent
        failure: Lost task 60.9 in stage 30.0 (TID 2556,
        yp-spark-dal09-env5-0039):
        org.apache.hadoop.fs.swift.exceptions.SwiftConfigurationException:
        Missing mandatory configuration option:
        fs.swift.service.keystone.auth.url


  In my own code, I'd assume that the value of credentials['name'] didn't
  match that of the URL, assuming you have something like
  swift://bucket.keystone . Failing that: the options were set too late.

  Instead of asking for the hadoop config and editing that, set the option
  in your spark context, before it is launched, with the prefix "hadoop"


        at org.apache.hadoop.fs.swift.http.RestClientBindings.copy
        (RestClientBindings.java:223)
        at org.apache.hadoop.fs.swift.http.RestClientBindings.bind
        (RestClientBindings.java:147)


        Daniel Lopes
        Chief Data and Analytics Officer | OneMatch
        c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

        www.onematch.com.br

Re: Fw: Spark + Parquet + IBM Block Storage at Bluemix

Reply via email to