Question:
What is our target file size? VXQuery has been designed to work on many
small files, but what is a small file? Are we talking 64mb or 64kb?

Background:
The issue has come to my attention as I ran out of inodes one of the nodes
when replicating the weather data set. Apparently one node in our cluster
has a 2 TB drive and is limited to 132816896. My naive partitioning method
for benchmarking has replicated the weather data five times and that
exceeds the number inodes available.

In researching the issue, we ran the following command to count the number
of files:
   time find -type f | wc -l
Here are the results:
  ** I am still waiting after about 4 hours, update when its finished **

It seems we have a huge performance hit for my current configuration of
weather data. The average size is probably 32kb. The XML documents are from
querying a web service provided by NOAA. Each file holds a month's records
of sensor data.

The concern is how the query time is affected by the act of opening and
closing so many files.

Options:
1. Treat this as the parameters we defined for our test.
2. Change the amount of data return by web service query. Example: Query
for a years worth of data, thus reducing the number of files by a factor of
12.
3. Create a way to store XML files appended together. Thus reducing the
number of times a file must be opened and closed.

Reply via email to