Good questions. Didn't realize anyone would think through it so much:-).
Answers are inline. I appreciate your help.

Pradeep
--------

Are you going to process the strings-and-associated-integers all at once in
a run-it-once task, then distribute an on-disk rendition of the info for
access later?  Or does the data structure need to be one that's able to be
updated on the fly as the strings come in (possibly while the structure is
queried to retrieve data), and it will eventually have to deal with 100
million string-and-ints items?

Answer: The program is an analytics application. The data is fetched just
once from a database and indexed. There is no need to fetch the data again
until the user explicitly requests for it.

What does it mean, in your case, to "index...strings...associated with a
list of integers"?  What do you need to be able to do after the strings have
been indexed?
 - display (or print) the strings in alpha order (with or without their
associated integers)
 - count # occurrences of each distinct string (or are there 100 million
distinct strings?)
 - get associated integers based on exact match only, or case-insensitive
match, or either
 - search for individual words or phrases (case-sensitively?) within the
strings
 - find strings are associated with some particular integer value

Ans: The users will specify regex expression to search for strings.

Can there be duplicate strings that have different associated sets of
integers, and you need to be able to get all the integer_sets for a
particular string?

Ans: no

What range are the integers (16-bit, 32-bit; are they signed)?

Ans: Unsigned. 32-bit.

How many integers are associated with the typical string?

Ans: Depends on the input data.

(Do you need variable-length storage of the groups of integers, or can you
decide you'll store e.g. 8 integers for each string and have that be a
rational approach?)

Ans: Variable length.

Does the amount of disk space used for the sets of integers matter much?

Ans: Not really.

Multiple tasks / threads querying the data at the same time?

Ans: Yes

Queries coming in while updates take place?

Ans: No

===================================
This list is hosted by DevelopMentorŪ  http://www.develop.com

View archives and manage your subscription(s) at http://discuss.develop.com

Reply via email to