[ https://issues.apache.org/jira/browse/HADOOP-17784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17373245#comment-17373245 ]
Leona Yoda edited comment on HADOOP-17784 at 7/2/21, 6:00 AM: -------------------------------------------------------------- I checked Registry of Open Data on AWS([https://registry.opendata.aws/]), there are several datasets which format is csv.gz. * NOAA Global Historical Climatology Network Daily [https://registry.opendata.aws/noaa-ghcn/] {code:java} // code placeholder $ aws s3 ls noaa-ghcn-pds/csv.gz/ --no-sign-request --human-readable 2021-07-02 04:08:17 3.3 KiB 1763.csv.gz 2021-07-02 04:08:27 3.2 KiB 1764.csv.gz ... 2021-07-02 04:09:04 143.1 MiB 2019.csv.gz 2021-07-02 04:09:04 138.8 MiB n 2021-07-02 04:09:04 66.6 MiB 2021.csv.gz $ filename="2020.csv.gz" $ aws s3 cp s3://noaa-ghcn-pds/csv.gz/$filename /tmp --no-sign-request && cat /tmp/$filename | gzip -d | head AE000041196,20200101,TMIN,168,,,S, AE000041196,20200101,PRCP,0,D,,S, AE000041196,20200101,TAVG,211,H,,S, ... $ wc -l /tmp/$filename 698966 /tmp/2020.csv.gz{code} The datesets on these years seems enough size. * NOAA Integrated Surface Database [https://registry.opendata.aws/noaa-isd/] {code:java} // code placeholder $ aws s3 ls s3://noaa-isd-pds/ --no-sign-request --human-readable ... 2021-07-02 09:57:30 12.1 MiB isd-inventory.csv.z 2020-07-04 09:24:18 428 Bytes isd-inventory.txt 2021-07-02 09:57:14 13.1 MiB isd-inventory.txt.z ... $ filename="isd-inventory.csv.z" $ aws s3 cp s3://noaa-isd-pds/$filename /tmp --no-sign-request && cat /tmp/$filename | gzip -d | head "USAF","WBAN","YEAR","JAN","FEB","MAR","APR","MAY","JUN","JUL","AUG","SEP","OCT","NOV","DEC" "007018","99999","2011","0","0","2104","2797","2543","2614","382","0","0","0","0","0" "007018","99999","2013","0","0","0","0","0","0","710","0","0","0","0","0" ... $ wc -l /tmp/$filename 44296 /tmp/isd-inventory.csv.z{code} Under the subpath s3://noaa-isd-pds/data/, there are a lot of gzipped files but they are sepalated by space. * iNaturalist Licensed Observation Images [https://registry.opendata.aws/inaturalist-open-data/] {code:java} // code placeholder aws s3 ls s3://inaturalist-open-data/ --no-sign-request --human-readable PRE metadata/ PRE photos/ 2021-05-20 15:59:08 1.8 GiB observations.csv.gz 2021-05-20 15:54:47 3.8 MiB observers.csv.gz 2021-05-20 16:02:14 3.1 GiB photos.csv.gz 2021-05-20 15:54:52 25.9 MiB taxa.csv.gz $ filename="taxa.csv.gz" $ aws s3 cp s3://inaturalist-open-data/$filename /tmp --no-sign-request && cat /tmp/$filename | gzip -d | head taxon_id ancestry rank_level rank name active 3736 48460/1/2/355675/3/67566/3727/3735 10 species Phimosus infuscatus true8742 48460/1/2/355675/3/7251/8659/8741 10 species Snowornis cryptolophus true ... $ wc -l /tmp/$filename 108058 /tmp/taxa.csv.gz $ filename="observations.csv.gz" $ aws s3 cp s3://inaturalist-open-data/$filename /tmp --no-sign-request && cat /tmp/$filename | gzip -d | head observation_uuid observer_id latitude longitude positional_accuracy taxon_id quality_grade observed_on 7d59cfce-7602-4877-a027-80008481466f 354 38.0127535059 -122.5013941526 76553 research 2011-09-03 b5d3c525-2bff-4ab4-ac4d-21c655d0a4d2 505 38.6113711142 -122.7838897705 52854 research 2011-09-04 ... $ wc -l /tmp/$filename 8692639 /tmp/observations.csv.gz{code} The files on top seems good, but they're sepalated by tab. cf. LandSat-8 {code:java} // code placeholder aws s3 ls s3://landsat-pds/ --no-sign-request --human-readable PRE 4ac2fe6f-99c0-4940-81ea-2accba9370b9/ PRE L8/ PRE a96cb36b-1e0d-4245-854f-399ad968d6d3/ PRE c1/ PRE e6acf117-1cbf-4e88-af62-2098f464effe/ PRE runs/ PRE tarq/ PRE tarq_corrupt/ PRE test/ 2017-05-17 22:42:27 23.2 KiB index.html 2016-08-20 02:12:04 105 Bytes robots.txt 2021-07-02 14:52:06 39 Bytes run_info.json 2021-07-02 14:02:06 3.2 KiB run_list.txt 2018-08-29 09:45:15 43.5 MiB scene_list.gz $ filename="scene_list.gz" $ aws s3 cp s3://landsat-pds/$filename /tmp --no-sign-request && cat /tmp/$filename | gzip -d | head entityId,acquisitionDate,cloudCover,processingLevel,path,row,min_lat,min_lon,max_lat,max_lon,download_url LC80101172015002LGN00,2015-01-02 15:49:05.571384,80.81,L1GT,10,117,-79.09923,-139.66082,-77.7544,-125.09297,https://s3-us-west-2.amazonaws.com/landsat-pds/L8/010/117/LC80101172015002LGN00/index.html LC80260392015002LGN00,2015-01-02 16:56:51.399666,90.84,L1GT,26,39,29.23106,-97.48576,31.36421,-95.16029,https://s3-us-west-2.amazonaws.com/landsat-pds/L8/026/039/LC80260392015002LGN00/index.html ... $ wc -l /tmp/$filename 183059 /tmp/scene_list.gz{code} was (Author: yoda-mon): I checked Registry of Open Data on AWS(https://registry.opendata.aws/), there are several datasets which format is csv.gz. * NOAA Global Historical Climatology Network Daily [https://registry.opendata.aws/noaa-ghcn/ ] {code:java} // code placeholder $ aws s3 ls noaa-ghcn-pds/csv.gz/ --no-sign-request --human-readable 2021-07-02 04:08:17 3.3 KiB 1763.csv.gz 2021-07-02 04:08:27 3.2 KiB 1764.csv.gz ... 2021-07-02 04:09:04 143.1 MiB 2019.csv.gz 2021-07-02 04:09:04 138.8 MiB n 2021-07-02 04:09:04 66.6 MiB 2021.csv.gz $ filename="2020.csv.gz" $ aws s3 cp s3://noaa-ghcn-pds/csv.gz/$filename /tmp --no-sign-request && cat /tmp/$filename | gzip -d | head AE000041196,20200101,TMIN,168,,,S, AE000041196,20200101,PRCP,0,D,,S, AE000041196,20200101,TAVG,211,H,,S, ... $ wc -l /tmp/$filename 698966 /tmp/2020.csv.gz{code} The datesets on these years seems enough size. * NOAA Integrated Surface Database [https://registry.opendata.aws/noaa-isd/] {code:java} // code placeholder $ aws s3 ls s3://noaa-isd-pds/ --no-sign-request --human-readable ... 2021-07-02 09:57:30 12.1 MiB isd-inventory.csv.z 2020-07-04 09:24:18 428 Bytes isd-inventory.txt 2021-07-02 09:57:14 13.1 MiB isd-inventory.txt.z ... $ filename="isd-inventory.csv.z" $ aws s3 cp s3://noaa-isd-pds/$filename /tmp --no-sign-request && cat /tmp/$filename | gzip -d | head "USAF","WBAN","YEAR","JAN","FEB","MAR","APR","MAY","JUN","JUL","AUG","SEP","OCT","NOV","DEC" "007018","99999","2011","0","0","2104","2797","2543","2614","382","0","0","0","0","0" "007018","99999","2013","0","0","0","0","0","0","710","0","0","0","0","0" ... $ wc -l /tmp/$filename 44296 /tmp/isd-inventory.csv.z{code} Under the subpath s3://noaa-isd-pds/data/, there are a lot of gzipped files but they are sepalated by space. * iNaturalist Licensed Observation Images [https://registry.opendata.aws/inaturalist-open-data/] {code:java} // code placeholder aws s3 ls s3://inaturalist-open-data/ --no-sign-request --human-readable PRE metadata/ PRE photos/ 2021-05-20 15:59:08 1.8 GiB observations.csv.gz 2021-05-20 15:54:47 3.8 MiB observers.csv.gz 2021-05-20 16:02:14 3.1 GiB photos.csv.gz 2021-05-20 15:54:52 25.9 MiB taxa.csv.gz $ filename="taxa.csv.gz" $ aws s3 cp s3://inaturalist-open-data/$filename /tmp --no-sign-request && cat /tmp/$filename | gzip -d | head taxon_id ancestry rank_level rank name active 3736 48460/1/2/355675/3/67566/3727/3735 10 species Phimosus infuscatus true8742 48460/1/2/355675/3/7251/8659/8741 10 species Snowornis cryptolophus true ... $ wc -l /tmp/$filename 108058 /tmp/taxa.csv.gz $ filename="observations.csv.gz" $ aws s3 cp s3://inaturalist-open-data/$filename /tmp --no-sign-request && cat /tmp/$filename | gzip -d | head observation_uuid observer_id latitude longitude positional_accuracy taxon_id quality_grade observed_on 7d59cfce-7602-4877-a027-80008481466f 354 38.0127535059 -122.5013941526 76553 research 2011-09-03 b5d3c525-2bff-4ab4-ac4d-21c655d0a4d2 505 38.6113711142 -122.7838897705 52854 research 2011-09-04 ... $ wc -l /tmp/$filename 8692639 /tmp/observations.csv.gz{code} The files on top seems good, but they're sepalated by tab. cf. LandSat-8 {code:java} // code placeholder aws s3 ls s3://landsat-pds/ --no-sign-request --human-readable PRE 4ac2fe6f-99c0-4940-81ea-2accba9370b9/ PRE L8/ PRE a96cb36b-1e0d-4245-854f-399ad968d6d3/ PRE c1/ PRE e6acf117-1cbf-4e88-af62-2098f464effe/ PRE runs/ PRE tarq/ PRE tarq_corrupt/ PRE test/ 2017-05-17 22:42:27 23.2 KiB index.html 2016-08-20 02:12:04 105 Bytes robots.txt 2021-07-02 14:52:06 39 Bytes run_info.json 2021-07-02 14:02:06 3.2 KiB run_list.txt 2018-08-29 09:45:15 43.5 MiB scene_list.gz $ filename="scene_list.gz" $ aws s3 cp s3://landsat-pds/$filename /tmp --no-sign-request && cat /tmp/$filename | gzip -d | head entityId,acquisitionDate,cloudCover,processingLevel,path,row,min_lat,min_lon,max_lat,max_lon,download_url LC80101172015002LGN00,2015-01-02 15:49:05.571384,80.81,L1GT,10,117,-79.09923,-139.66082,-77.7544,-125.09297,https://s3-us-west-2.amazonaws.com/landsat-pds/L8/010/117/LC80101172015002LGN00/index.html LC80260392015002LGN00,2015-01-02 16:56:51.399666,90.84,L1GT,26,39,29.23106,-97.48576,31.36421,-95.16029,https://s3-us-west-2.amazonaws.com/landsat-pds/L8/026/039/LC80260392015002LGN00/index.html ... $ wc -l /tmp/$filename 183059 /tmp/scene_list.gz{code} > hadoop-aws landsat-pds test bucket will be deleted after Jul 1, 2021 > -------------------------------------------------------------------- > > Key: HADOOP-17784 > URL: https://issues.apache.org/jira/browse/HADOOP-17784 > Project: Hadoop Common > Issue Type: Test > Components: fs/s3, test > Reporter: Leona Yoda > Priority: Major > > I found an anouncement that landsat-pds buket will be deleted on July 1, 2021 > (https://registry.opendata.aws/landsat-8/) > and I think this bucket is used in th test of hadoop-aws module use > [https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/S3ATestConstants.java#L93] > > At this time I can access the bucket but we might have to change the test > bucket in someday. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org