VXQuery has been created to work on a large set of small XML files. Our goal for benchmarking is to find a dataset that does not need the XML files to be modified. The only task would be to download and distributed the dataset on to a cluster.
The initial benchmark test will focus on NOAA's National Climatic Data Center (NCDC) which provides Global Historical Climate Network data. The dataset includes various daily weather sensor readings from ~90,000 stations across the globe. Each station has varied amounts of data based on how long the station has been active. The oldest stations have data from the 1890's. NCDC offers two methods of accessing the information: dat files and a web service (XML and JSON). I created a script that downloads the dat files and uses these to generate the equivalent XML file from the web service. The web service query is a month's data for a single station. The script allows for a single download of all the historical data and then is process locally. The station information is offered in a separate web service query and contains more information than available in the dat files.Thus, I have a second script to download all stations separately. I have started to download open street maps to consider this data set as another benchmark source. Side Question: Do you have any ideas of other sources that fit our requirements?
