FYI on weather data. The Weather Web Service offer weather data through queries to their website. In researching the data queries that are possible, I have found the way to get the largest amount of real data in a single query. The site limits all data queries to be a single month. In addition results are paged when the station has more than four sensor to report. The resulting XML document is at most 32KB. While sizes vary because of the size and number of data points for each sensor, we have an upper bound around 32KB.
The file size averages for three sections of the larger dataset. Size # of Files Average File Size ~50 MB 7,476 7KB ~500 MB 30,982 17KB I am working on getting a larger test size of ~8.5 GB set up. The average size difference changes based on the sensors chosen. Each of the larger datasets include the smaller versions. I am working on a new way of partitioning the data with symbolic links to get around my issue with inodes. Right now I am sticking with the real data and working around the inode issue. On Thu, Jan 16, 2014 at 9:07 PM, Eldon Carman <[email protected]> wrote: > On Thu, Jan 16, 2014 at 8:41 PM, Vinayak Borkar <[email protected]>wrote: > >> On 1/16/14, 3:41 PM, Eldon Carman wrote: >> >>> Question: >>> What is our target file size? VXQuery has been designed to work on many >>> small files, but what is a small file? Are we talking 64mb or 64kb? >>> >> >> The restriction is on the size of objects (or documents). In VXQuery each >> document has to fit in a frame as per the current implementation and since >> one xml file contains an XML document, this translates to file sizes. I >> think we should do Option 3 in your mail below and support files that have >> multiple documents concatenated and stored in the same file (This should be >> fine since the collection function returns a collection of items). >> >> > Ok, lets this more. Some of the new rewrite rules for pushing the child > steps into the data source scan may also help with processing larger files, > while keeping our frame size relatively small. > > >> >> >>> Background: >>> The issue has come to my attention as I ran out of inodes one of the >>> nodes >>> when replicating the weather data set. Apparently one node in our cluster >>> has a 2 TB drive and is limited to 132816896. My naive partitioning >>> method >>> >> >> Do you mean 2GB? >> > > Let me clarify: > Most nodes have a 3TB drive with a limit of 182,591,488 inodes. > I found one had a drive replaced. On that node we have a 2TB drive with a > limit of 132,816,896 inodes. The weather data had caused the drive to > exceed the 130 million inodes. > > >> >> Vinayak >> >> >> for benchmarking has replicated the weather data five times and that >>> exceeds the number inodes available. >>> >>> In researching the issue, we ran the following command to count the >>> number >>> of files: >>> time find -type f | wc -l >>> Here are the results: >>> ** I am still waiting after about 4 hours, update when its finished ** >>> >>> It seems we have a huge performance hit for my current configuration of >>> weather data. The average size is probably 32kb. The XML documents are >>> from >>> querying a web service provided by NOAA. Each file holds a month's >>> records >>> of sensor data. >>> >>> The concern is how the query time is affected by the act of opening and >>> closing so many files. >>> >>> Options: >>> 1. Treat this as the parameters we defined for our test. >>> 2. Change the amount of data return by web service query. Example: Query >>> for a years worth of data, thus reducing the number of files by a factor >>> of >>> 12. >>> 3. Create a way to store XML files appended together. Thus reducing the >>> number of times a file must be opened and closed. >>> >>> >> >
