[ 
https://issues.apache.org/jira/browse/HADOOP-17784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17373245#comment-17373245
 ] 

Leona Yoda edited comment on HADOOP-17784 at 7/2/21, 6:00 AM:
--------------------------------------------------------------

I checked Registry of Open Data on AWS([https://registry.opendata.aws/]), there 
are several datasets which format is csv.gz.

 
 * NOAA Global Historical Climatology Network Daily
[https://registry.opendata.aws/noaa-ghcn/]

{code:java}
// code placeholder
$ aws s3 ls noaa-ghcn-pds/csv.gz/ --no-sign-request --human-readable
2021-07-02 04:08:17 3.3 KiB 1763.csv.gz 
2021-07-02 04:08:27 3.2 KiB 1764.csv.gz 
... 
2021-07-02 04:09:04 143.1 MiB 2019.csv.gz 
2021-07-02 04:09:04 138.8 MiB  n 
2021-07-02 04:09:04 66.6 MiB 2021.csv.gz

$ filename="2020.csv.gz"
$ aws s3 cp s3://noaa-ghcn-pds/csv.gz/$filename /tmp --no-sign-request && cat 
/tmp/$filename | gzip -d | head
AE000041196,20200101,TMIN,168,,,S,
AE000041196,20200101,PRCP,0,D,,S,
AE000041196,20200101,TAVG,211,H,,S,
...

$ wc -l /tmp/$filename
698966 /tmp/2020.csv.gz{code}
The datesets on these years seems enough size.

 * NOAA Integrated Surface Database
 [https://registry.opendata.aws/noaa-isd/]

{code:java}
// code placeholder
$ aws s3 ls s3://noaa-isd-pds/ --no-sign-request --human-readable
...
2021-07-02 09:57:30   12.1 MiB isd-inventory.csv.z
2020-07-04 09:24:18  428 Bytes isd-inventory.txt
2021-07-02 09:57:14   13.1 MiB isd-inventory.txt.z
...

$ filename="isd-inventory.csv.z"
$ aws s3 cp s3://noaa-isd-pds/$filename /tmp --no-sign-request && cat 
/tmp/$filename | gzip -d | head

"USAF","WBAN","YEAR","JAN","FEB","MAR","APR","MAY","JUN","JUL","AUG","SEP","OCT","NOV","DEC"
"007018","99999","2011","0","0","2104","2797","2543","2614","382","0","0","0","0","0"
"007018","99999","2013","0","0","0","0","0","0","710","0","0","0","0","0"
...

$ wc -l /tmp/$filename
44296 /tmp/isd-inventory.csv.z{code}
Under the subpath s3://noaa-isd-pds/data/, there are a lot of gzipped files but 
they are sepalated by space.
 * iNaturalist Licensed Observation Images
 [https://registry.opendata.aws/inaturalist-open-data/]

{code:java}
// code placeholder
aws s3 ls s3://inaturalist-open-data/ --no-sign-request --human-readable
                           PRE metadata/
                           PRE photos/
2021-05-20 15:59:08    1.8 GiB observations.csv.gz
2021-05-20 15:54:47    3.8 MiB observers.csv.gz
2021-05-20 16:02:14    3.1 GiB photos.csv.gz
2021-05-20 15:54:52   25.9 MiB taxa.csv.gz

$ filename="taxa.csv.gz"
$ aws s3 cp s3://inaturalist-open-data/$filename /tmp --no-sign-request && cat 
/tmp/$filename | gzip -d | head
taxon_id ancestry rank_level rank name active
3736 48460/1/2/355675/3/67566/3727/3735 10 
species Phimosus infuscatus true8742 48460/1/2/355675/3/7251/8659/8741 10 
species Snowornis cryptolophus true
...
$ wc -l /tmp/$filename
108058 /tmp/taxa.csv.gz

$ filename="observations.csv.gz"
$ aws s3 cp s3://inaturalist-open-data/$filename /tmp --no-sign-request && cat 
/tmp/$filename | gzip -d | head
observation_uuid observer_id latitude longitude positional_accuracy taxon_id 
quality_grade observed_on
7d59cfce-7602-4877-a027-80008481466f 354 38.0127535059 -122.5013941526 76553 
research 2011-09-03
b5d3c525-2bff-4ab4-ac4d-21c655d0a4d2 505 38.6113711142 -122.7838897705 52854 
research 2011-09-04
...
$ wc -l /tmp/$filename
8692639 /tmp/observations.csv.gz{code}
The files on top seems good, but they're sepalated by tab.

 

cf. LandSat-8
{code:java}
// code placeholder
aws s3 ls s3://landsat-pds/ --no-sign-request --human-readable
                           PRE 4ac2fe6f-99c0-4940-81ea-2accba9370b9/
                           PRE L8/
                           PRE a96cb36b-1e0d-4245-854f-399ad968d6d3/
                           PRE c1/
                           PRE e6acf117-1cbf-4e88-af62-2098f464effe/
                           PRE runs/
                           PRE tarq/
                           PRE tarq_corrupt/
                           PRE test/
2017-05-17 22:42:27   23.2 KiB index.html
2016-08-20 02:12:04  105 Bytes robots.txt
2021-07-02 14:52:06   39 Bytes run_info.json
2021-07-02 14:02:06    3.2 KiB run_list.txt
2018-08-29 09:45:15   43.5 MiB scene_list.gz
$ filename="scene_list.gz"
$ aws s3 cp s3://landsat-pds/$filename /tmp --no-sign-request && cat 
/tmp/$filename | gzip -d | head
entityId,acquisitionDate,cloudCover,processingLevel,path,row,min_lat,min_lon,max_lat,max_lon,download_url
LC80101172015002LGN00,2015-01-02 
15:49:05.571384,80.81,L1GT,10,117,-79.09923,-139.66082,-77.7544,-125.09297,https://s3-us-west-2.amazonaws.com/landsat-pds/L8/010/117/LC80101172015002LGN00/index.html
LC80260392015002LGN00,2015-01-02 
16:56:51.399666,90.84,L1GT,26,39,29.23106,-97.48576,31.36421,-95.16029,https://s3-us-west-2.amazonaws.com/landsat-pds/L8/026/039/LC80260392015002LGN00/index.html
...


$ wc -l /tmp/$filename
183059 /tmp/scene_list.gz{code}
 

 


was (Author: yoda-mon):
I checked Registry of Open Data on AWS(https://registry.opendata.aws/), there 
are several datasets which format is csv.gz.

 
 * NOAA Global Historical Climatology Network Daily
[https://registry.opendata.aws/noaa-ghcn/
]
{code:java}
// code placeholder
$ aws s3 ls noaa-ghcn-pds/csv.gz/ --no-sign-request --human-readable
2021-07-02 04:08:17 3.3 KiB 1763.csv.gz 
2021-07-02 04:08:27 3.2 KiB 1764.csv.gz 
... 
2021-07-02 04:09:04 143.1 MiB 2019.csv.gz 
2021-07-02 04:09:04 138.8 MiB  n 
2021-07-02 04:09:04 66.6 MiB 2021.csv.gz

$ filename="2020.csv.gz"
$ aws s3 cp s3://noaa-ghcn-pds/csv.gz/$filename /tmp --no-sign-request && cat 
/tmp/$filename | gzip -d | head
AE000041196,20200101,TMIN,168,,,S,
AE000041196,20200101,PRCP,0,D,,S,
AE000041196,20200101,TAVG,211,H,,S,
...

$ wc -l /tmp/$filename
698966 /tmp/2020.csv.gz{code}
The datesets on these years seems enough size.
 * NOAA Integrated Surface Database
[https://registry.opendata.aws/noaa-isd/]

{code:java}
// code placeholder
$ aws s3 ls s3://noaa-isd-pds/ --no-sign-request --human-readable
...
2021-07-02 09:57:30   12.1 MiB isd-inventory.csv.z
2020-07-04 09:24:18  428 Bytes isd-inventory.txt
2021-07-02 09:57:14   13.1 MiB isd-inventory.txt.z
...

$ filename="isd-inventory.csv.z"
$ aws s3 cp s3://noaa-isd-pds/$filename /tmp --no-sign-request && cat 
/tmp/$filename | gzip -d | head

"USAF","WBAN","YEAR","JAN","FEB","MAR","APR","MAY","JUN","JUL","AUG","SEP","OCT","NOV","DEC"
"007018","99999","2011","0","0","2104","2797","2543","2614","382","0","0","0","0","0"
"007018","99999","2013","0","0","0","0","0","0","710","0","0","0","0","0"
...

$ wc -l /tmp/$filename
44296 /tmp/isd-inventory.csv.z{code}
Under the subpath s3://noaa-isd-pds/data/, there are a lot of gzipped files but 
they are sepalated by space.


 * iNaturalist Licensed Observation Images
[https://registry.opendata.aws/inaturalist-open-data/]

{code:java}
// code placeholder
aws s3 ls s3://inaturalist-open-data/ --no-sign-request --human-readable
                           PRE metadata/
                           PRE photos/
2021-05-20 15:59:08    1.8 GiB observations.csv.gz
2021-05-20 15:54:47    3.8 MiB observers.csv.gz
2021-05-20 16:02:14    3.1 GiB photos.csv.gz
2021-05-20 15:54:52   25.9 MiB taxa.csv.gz

$ filename="taxa.csv.gz"
$ aws s3 cp s3://inaturalist-open-data/$filename /tmp --no-sign-request && cat 
/tmp/$filename | gzip -d | head
taxon_id ancestry rank_level rank name active
3736 48460/1/2/355675/3/67566/3727/3735 10 
species Phimosus infuscatus true8742 48460/1/2/355675/3/7251/8659/8741 10 
species Snowornis cryptolophus true
...
$ wc -l /tmp/$filename
108058 /tmp/taxa.csv.gz

$ filename="observations.csv.gz"
$ aws s3 cp s3://inaturalist-open-data/$filename /tmp --no-sign-request && cat 
/tmp/$filename | gzip -d | head
observation_uuid observer_id latitude longitude positional_accuracy taxon_id 
quality_grade observed_on
7d59cfce-7602-4877-a027-80008481466f 354 38.0127535059 -122.5013941526 76553 
research 2011-09-03
b5d3c525-2bff-4ab4-ac4d-21c655d0a4d2 505 38.6113711142 -122.7838897705 52854 
research 2011-09-04
...
$ wc -l /tmp/$filename
8692639 /tmp/observations.csv.gz{code}
The files on top seems good, but they're sepalated by tab.

 

cf. LandSat-8
{code:java}
// code placeholder
aws s3 ls s3://landsat-pds/ --no-sign-request --human-readable
                           PRE 4ac2fe6f-99c0-4940-81ea-2accba9370b9/
                           PRE L8/
                           PRE a96cb36b-1e0d-4245-854f-399ad968d6d3/
                           PRE c1/
                           PRE e6acf117-1cbf-4e88-af62-2098f464effe/
                           PRE runs/
                           PRE tarq/
                           PRE tarq_corrupt/
                           PRE test/
2017-05-17 22:42:27   23.2 KiB index.html
2016-08-20 02:12:04  105 Bytes robots.txt
2021-07-02 14:52:06   39 Bytes run_info.json
2021-07-02 14:02:06    3.2 KiB run_list.txt
2018-08-29 09:45:15   43.5 MiB scene_list.gz
$ filename="scene_list.gz"
$ aws s3 cp s3://landsat-pds/$filename /tmp --no-sign-request && cat 
/tmp/$filename | gzip -d | head
entityId,acquisitionDate,cloudCover,processingLevel,path,row,min_lat,min_lon,max_lat,max_lon,download_url
LC80101172015002LGN00,2015-01-02 
15:49:05.571384,80.81,L1GT,10,117,-79.09923,-139.66082,-77.7544,-125.09297,https://s3-us-west-2.amazonaws.com/landsat-pds/L8/010/117/LC80101172015002LGN00/index.html
LC80260392015002LGN00,2015-01-02 
16:56:51.399666,90.84,L1GT,26,39,29.23106,-97.48576,31.36421,-95.16029,https://s3-us-west-2.amazonaws.com/landsat-pds/L8/026/039/LC80260392015002LGN00/index.html
...


$ wc -l /tmp/$filename
183059 /tmp/scene_list.gz{code}




 

 

> hadoop-aws landsat-pds test bucket will be deleted after Jul 1, 2021
> --------------------------------------------------------------------
>
>                 Key: HADOOP-17784
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17784
>             Project: Hadoop Common
>          Issue Type: Test
>          Components: fs/s3, test
>            Reporter: Leona Yoda
>            Priority: Major
>
> I found an anouncement that landsat-pds buket will be deleted on July 1, 2021
> (https://registry.opendata.aws/landsat-8/)
> and  I think this bucket  is used in th test of hadoop-aws module use
> [https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/S3ATestConstants.java#L93]
>  
> At this time I can access the bucket but we might have to change the test 
> bucket in someday.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to