On Thu, Jan 16, 2014 at 8:41 PM, Vinayak Borkar <[email protected]> wrote:

> On 1/16/14, 3:41 PM, Eldon Carman wrote:
>
>> Question:
>> What is our target file size? VXQuery has been designed to work on many
>> small files, but what is a small file? Are we talking 64mb or 64kb?
>>
>
> The restriction is on the size of objects (or documents). In VXQuery each
> document has to fit in a frame as per the current implementation and since
> one xml file contains an XML document, this translates to file sizes. I
> think we should do Option 3 in your mail below and support files that have
> multiple documents concatenated and stored in the same file (This should be
> fine since the collection function returns a collection of items).
>
>
Ok, lets this more. Some of the new rewrite rules for pushing the child
steps into the data source scan may also help with processing larger files,
while keeping our frame size relatively small.


>
>
>> Background:
>> The issue has come to my attention as I ran out of inodes one of the nodes
>> when replicating the weather data set. Apparently one node in our cluster
>> has a 2 TB drive and is limited to 132816896. My naive partitioning method
>>
>
> Do you mean 2GB?
>

Let me clarify:
Most nodes have a 3TB drive with a limit of 182,591,488 inodes.
I found one had a drive replaced. On that node we have a 2TB drive with a
limit of 132,816,896 inodes. The weather data had caused the drive to
exceed the 130 million inodes.


>
> Vinayak
>
>
>  for benchmarking has replicated the weather data five times and that
>> exceeds the number inodes available.
>>
>> In researching the issue, we ran the following command to count the number
>> of files:
>>     time find -type f | wc -l
>> Here are the results:
>>    ** I am still waiting after about 4 hours, update when its finished **
>>
>> It seems we have a huge performance hit for my current configuration of
>> weather data. The average size is probably 32kb. The XML documents are
>> from
>> querying a web service provided by NOAA. Each file holds a month's records
>> of sensor data.
>>
>> The concern is how the query time is affected by the act of opening and
>> closing so many files.
>>
>> Options:
>> 1. Treat this as the parameters we defined for our test.
>> 2. Change the amount of data return by web service query. Example: Query
>> for a years worth of data, thus reducing the number of files by a factor
>> of
>> 12.
>> 3. Create a way to store XML files appended together. Thus reducing the
>> number of times a file must be opened and closed.
>>
>>
>

Reply via email to