Hi Jerry, Akhil,

Thanks your your help. With s3n, the entire file is downloaded even while
just creating the RDD with sqlContext.read.parquet().  It seems like even
just opening and closing the InputStream causes the entire data to get
fetched.

As it turned out, I was able to use s3a and avoid this problem.  I was
under the impression that s3a was only meant for using EMRFS, where the
metadata of the FS is kept separately.  This is not true; s3a maps object
keys directly to file names and directories.

On Sun, Aug 9, 2015 at 6:01 AM, Jerry Lam <chiling...@gmail.com> wrote:

> Hi Akshat,
>
> Is there a particular reason you don't use s3a? From my experience,s3a
> performs much better than the rest. I believe the inefficiency is from the
> implementation of the s3 interface.
>
> Best Regards,
>
> Jerry
>
> Sent from my iPhone
>
> On 9 Aug, 2015, at 5:48 am, Akhil Das <ak...@sigmoidanalytics.com> wrote:
>
> Depends on which operation you are doing, If you are doing a .count() on a
> parquet, it might not download the entire file i think, but if you do a
> .count() on a normal text file it might pull the entire file.
>
> Thanks
> Best Regards
>
> On Sat, Aug 8, 2015 at 3:12 AM, Akshat Aranya <aara...@gmail.com> wrote:
>
>> Hi,
>>
>> I've been trying to track down some problems with Spark reads being very
>> slow with s3n:// URIs (NativeS3FileSystem).  After some digging around, I
>> realized that this file system implementation fetches the entire file,
>> which isn't really a Spark problem, but it really slows down things when
>> trying to just read headers from a Parquet file or just creating partitions
>> in the RDD.  Is this something that others have observed before, or am I
>> doing something wrong?
>>
>> Thanks,
>> Akshat
>>
>
>

Reply via email to