Question: What is our target file size? VXQuery has been designed to work on many small files, but what is a small file? Are we talking 64mb or 64kb?
Background: The issue has come to my attention as I ran out of inodes one of the nodes when replicating the weather data set. Apparently one node in our cluster has a 2 TB drive and is limited to 132816896. My naive partitioning method for benchmarking has replicated the weather data five times and that exceeds the number inodes available. In researching the issue, we ran the following command to count the number of files: time find -type f | wc -l Here are the results: ** I am still waiting after about 4 hours, update when its finished ** It seems we have a huge performance hit for my current configuration of weather data. The average size is probably 32kb. The XML documents are from querying a web service provided by NOAA. Each file holds a month's records of sensor data. The concern is how the query time is affected by the act of opening and closing so many files. Options: 1. Treat this as the parameters we defined for our test. 2. Change the amount of data return by web service query. Example: Query for a years worth of data, thus reducing the number of files by a factor of 12. 3. Create a way to store XML files appended together. Thus reducing the number of times a file must be opened and closed.
