[jira] [Comment Edited] (SPARK-36024) Switch the datasource example due to the depreciation of the dataset

Steve Loughran (Jira) Tue, 06 Jul 2021 02:33:07 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-36024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17375400#comment-17375400
 ]


Steve Loughran edited comment on SPARK-36024 at 7/6/21, 9:32 AM:
-----------------------------------------------------------------

similar to HADOOP-17784

I'm "in discussions" with them. Maybe I can persuade them to leave the index 
file up

And I'd like to move on to a dataset where (a) it's stable (b) got real 
ORC/Parquet data alongside the CSV

Finally: need to make sure that this time, not matter how "stable" the source 
is, whoever runs it knows we need it.

Where in the docs is this?

(oh, obviously it'll be something I wrote, won't it...)


was (Author: ste...@apache.org):
similar to HADOOP-17784

I'm "in discussions" with them. Maybe I can persuade them to leave the index 
file up

And I'd like to move on to a dataset where (a) it's stable (b) got real 
ORC/Parquet data alongside the CSV

Finally: need to make sure that this time, not matter how "stable" the source 
is, whoever runs it knows we need it.

Where in the docs is this?

> Switch the datasource example due to the depreciation of the dataset
> --------------------------------------------------------------------
>
>                 Key: SPARK-36024
>                 URL: https://issues.apache.org/jira/browse/SPARK-36024
>             Project: Spark
>          Issue Type: Documentation
>          Components: Documentation
>    Affects Versions: 3.1.2
>            Reporter: Leona Yoda
>            Priority: Trivial
>
> The S3 bucket that used for an example in "Integration with Cloud 
> Infrastructures" document will be deleted on Jul 1, 2021 
> [https://registry.opendata.aws/landsat-8/ 
> |https://registry.opendata.aws/landsat-8/]
> The dataset will move to another bucket but it requires `--request-payer 
> requester` option so users have to pay S3 cost. 
> [https://registry.opendata.aws/usgs-landsat/]
>  
> So I think it's better to change the datasource like this.
> [https://github.com/yoda-mon/spark/commit/cdb24acdbb57a429e5bf1729502653b91a600022]
>  
> I chose [NYC Taxi data| 
> [https://registry.opendata.aws/nyc-tlc-trip-records-pds/|https://registry.opendata.aws/nyc-tlc-trip-records-pds/),]]
>  here for an example. 
>  Unlike landat data it's not compressed, but it is just an example and there 
> are several tutorials using Spark  (e.g. 
> [https://github.com/aws-samples/amazon-eks-apache-spark-etl-sample)]
>  
> Reed test result
> {code:java}
> scala> sc.textFile("s3a://nyc-tlc/misc/taxi 
> _zone_lookup.csv").take(10).foreach(println) 
> "LocationID","Borough","Zone","service_zone" 1,"EWR","Newark Airport","EWR" 
> 2,"Queens","Jamaica Bay","Boro Zone" 3,"Bronx","Allerton/Pelham 
> Gardens","Boro Zone" 4,"Manhattan","Alphabet City","Yellow Zone" 5,"Staten 
> Island","Arden Heights","Boro Zone" 6,"Staten Island","Arrochar/Fort 
> Wadsworth","Boro Zone" 7,"Queens","Astoria","Boro Zone" 8,"Queens","Astoria 
> Park","Boro Zone" 9,"Queens","Auburndale","Boro Zone"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-36024) Switch the datasource example due to the depreciation of the dataset

Reply via email to