Hi Freedafeng
The following works for me df will be a data frame. fullPath is lists list
of various part files stored in s3.
fullPath =
['s3n:///json/StreamingKafkaCollector/s1/2016-07-10/146817304/part-r
-0-a2121800-fa5b-44b1-a994-67795' ]
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('json').load(fullPath).select(³key1") #.cache()
My file is raw JSON. I you might have to tweak the above statement to work
with gz files
Andy
From: freedafeng
Date: Wednesday, July 27, 2016 at 10:36 AM
To: "user @spark"
Subject: spark 1.6.0 read s3 files error.
> cdh 5.7.1. pyspark.
>
> codes: ===
> from pyspark import SparkContext, SparkConf
>
> conf = SparkConf().setAppName('s3 ---')
> sc = SparkContext(conf=conf)
>
> myRdd =
> sc.textFile("s3n:///y=2016/m=5/d=26/h=20/2016.5.26.21.9.52.6d53180a-28b9-4
> e65-b749-b4a2694b9199.json.gz")
>
> count = myRdd.count()
> print "The count is", count
>
> ===
> standalone mode: command line:
>
> AWS_ACCESS_KEY_ID=??? AWS_SECRET_ACCESS_KEY=??? ./bin/spark-submit
> --driver-memory 4G --master spark://master:7077 --conf
> "spark.default.parallelism=70" /root/workspace/test/s3.py
>
> Error:
> )
> 16/07/27 17:27:26 INFO spark.SparkContext: Created broadcast 0 from textFile
> at NativeMethodAccessorImpl.java:-2
> Traceback (most recent call last):
> File "/root/workspace/test/s3.py", line 12, in
> count = myRdd.count()
> File
> "/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/pyspark
> .zip/pyspark/rdd.py",
> line 1004, in count
> File
> "/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/pyspark
> .zip/pyspark/rdd.py",
> line 995, in sum
> File
> "/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/pyspark
> .zip/pyspark/rdd.py",
> line 869, in fold
> File
> "/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/pyspark
> .zip/pyspark/rdd.py",
> line 771, in collect
> File
> "/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/py4j-0.
> 9-src.zip/py4j/java_gateway.py",
> line 813, in __call__
> File
> "/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/py4j-0.
> 9-src.zip/py4j/protocol.py",
> line 308, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : java.lang.VerifyError: Bad type on operand stack
> Exception Details:
> Location:
>
> org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.copy(Ljava/lang/Stri
> ng;Ljava/lang/String;)V
> @155: invokevirtual
> Reason:
> Type 'org/jets3t/service/model/S3Object' (current frame, stack[4]) is
> not assignable to 'org/jets3t/service/model/StorageObject'
> Current Frame:
> bci: @155
> flags: { }
> locals: { 'org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore',
> 'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object'
> }
> stack: { 'org/jets3t/service/S3Service', 'java/lang/String',
> 'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object',
> integer }
> Bytecode:
> 000: b200 fcb9 0190 0100 9900 39b2 00fc bb01
> 010: 5659 b701 5713 0192 b601 5b2b b601 5b13
> 020: 0194 b601 5b2c b601 5b13 0196 b601 5b2a
> 030: b400 7db6 00e7 b601 5bb6 015e b901 9802
> 040: 002a b400 5799 0030 2ab4 0047 2ab4 007d
> 050: 2b01 0101 01b6 019b 4e2a b400 6b09 949e
> 060: 0016 2db6 019c 2ab4 006b 949e 000a 2a2d
> 070: 2cb6 01a0 b1bb 00a0 592c b700 a14e 2d2a
> 080: b400 73b6 00b0 2ab4 0047 2ab4 007d b600
> 090: e72b 2ab4 007d b600 e72d 03b6 01a4 57a7
> 0a0: 000a 4e2a 2d2b b700 c7b1
> Exception Handler Table:
> bci [0, 116] => handler: 162
> bci [117, 159] => handler: 162
> Stackmap Table:
> same_frame_extended(@65)
> same_frame(@117)
> same_locals_1_stack_item_frame(@162,Object[#139])
> same_frame(@169)
>
> at
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3Fi
> leSystem.java:338)
> at
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem
> .java:328)
>
> .
>
> TIA
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-
> error-tp27417.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>