[algogeeks] Re: Finding a number from 300million list with constraint on memory.

Sanket Mon, 27 Jun 2011 09:03:56 -0700

That is true Dave. Since the numbers are random, we could very well
have that scenario. To reduce this possibility, we can reverse the
allocation - 1.5MB for the array and 500KB for the input. Since the
input is going to be read sequentially, we would need to swap only
after exhausting all the numbers in the 500KB block. However, even
after this, in the worst worst case, the possibility you bring up
would exist :-)
The other thing to consider is the time it would take to read the
entire input again, and the related I/O swapping for the input as well
as the array (if we are using your approach).


On Jun 26, 2:48 pm, Dave <dave_and_da...@juno.com> wrote:
> @Sankeet: How do you know that you wouldn't do more i/o for swapping
> than just reading the data twice? Since the input numbers are not
> sorted, it seems that you could be swapping a different block in for
> every number.
>
> Dave
>
> On Jun 26, 2:12 pm, Sanket <vasa.san...@gmail.com> wrote:
>
> > A small tweak (and possible optimization) to Dave's algorithm mighe be
> > to use bit based array. That is, have an array of 125000000 byte
> > values where for each byte, the 8 bits represent a flag of whether
> > that particular number is present in the file or not.
> > So, a[0] would indicate whether numbers 100000000 - 100000007 are
> > present in the file or not.
> > Now, we scan the file number by number, and for each number scanned,
> > we convert the number into an index into the array and set the
> > corresponding bit.
> > After the entire file has been scanned, the task of finding a number
> > not present in the file simply is to scan the array and find a bit
> > which has not been set.
> > To overcome the RAM limitation, we can keep 500KB of space for a chunk
> > of the array which would be swapped in/out based on the values read
> > from the file and the remaining 1.5MB would be used for the file
> > contents.
>
> > On Jun 23, 11:24 pm, Douglas Diniz <dgdi...@gmail.com> wrote:
>
> > > Well, I understood that you have a number (say x) and you want if the
> > > number x is not in the file. Atul wrote:
>
> > > "The numbers are all distinct. they may vary from 100,000,000 to
> > > 999,999,999 and there are roughly 300 million (ie. 300 * 1000,000
> > > numbers approx.). So there are 600*1000,000 number which are not in
> > > the file and we are asked to return any one these 600*1000,000 numbers
> > > given the memory constraints. Moreover the numbers are in random
> > > order. "
>
> > > And in the first email:
>
> > > "Find a 9-digit number that isn't present in the file."
>
> > > The description is ambiguous. But I think you are right. My
> > > interpretation was wrong.
>
> > > On Fri, Jun 24, 2011 at 1:06 AM, Dave <dave_and_da...@juno.com> wrote:
> > > > @Douglas: I was assuming that the file contains the numbers in ASCII,
> > > > and that normal file buffering is being used. If you think that I need
> > > > to include the buffers in the given storage, then fine: just use half
> > > > for buffers and half for the array of counters. The only change to the
> > > > algorithm would be to replace 1000000 everywhere with 500000.
>
> > > > I find your description below a bit confusing, where you say that you
> > > > will search for the number sequentially. What number are you searching
> > > > for? The problem is to find a number that is not in the file, not to
> > > > check to see if a given number is not in the file. Suppose that the
> > > > file contains account numbers; the problem is to find an unused
> > > > account number. So how do you search sequentially through a file to
> > > > find a number that is not in the file? If I have misinterpreted what
> > > > you wrote, please present your entire algorithm.
>
> > > > Dave
>
> > > > On Jun 23, 10:18 pm, Douglas Diniz <dgdi...@gmail.com> wrote:
> > > >> Dave, I think you method is very slow. If you have a array of 1000000
> > > >> of 16 bits integers this mean that all the Ram is in use, and you will
> > > >> need at least 300x10^6 disk access, because you will need the read the
> > > >> numbers one by one from the file, and this will take a LOT of time.
> > > >> And your method use the module operation % for every number, which is 
> > > >> slow.
> > > >> I think the basic algorithm of read blocks of 2MB to Ram and searching
> > > >> for the number sequentially until you find the number or reach the end
> > > >> of file, is some order of magnitude better, because you read the
> > > >> blocks linearly from the disk, which is much, much better than read
> > > >> one by one, and you dont have to calculate the %.
>
> > > >> On Thu, Jun 23, 2011 at 11:24 PM, Dave <dave_and_da...@juno.com> wrote:
> > > >> > @Atul: Here is one way to do it:
>
> > > >> > Treat the array as a million 16-bit integers: short a[1000000].
>
> > > >> > Set the array to zero.
>
> > > >> > Read through the file, and for each number k, increment a[k%1000000].
> > > >> > Since the numbers are distinct, there can be at most 900 numbers in
> > > >> > each bin.
>
> > > >> > Find j such that a[j] < 900. You now know that there is a missing
> > > >> > number with the six-digit number j as the final six digits.
>
> > > >> > Set a[100] to a[999] to zero.
>
> > > >> > Read through the array again. Ignore any number k whose last six
> > > >> > digits are not j. For the remaining numbers, set a[k/1000000] = 1.
>
> > > >> > Find i, with 100 <= i <= 999, such that a[i]==0. This gives you the
> > > >> > first three digits of a missing number.
>
> > > >> > Put the first thee digits together with the last six digits: k =
> > > >> > 1000000*i + j to get a number that is not in the file.
>
> > > >> > Dave
>
> > > >> > On Jun 23, 7:59 am, atul purohit <gonewiththe...@gmail.com> wrote:
> > > >> >> Hi,
>
> > > >> >> Can someone explain me how to do this.
>
> > > >> >> Given a file containing roughly 300 million 9-digit numbers. Find a 
> > > >> >> 9-digit
> > > >> >> number that isn't present in the file.You have unlimited drive 
> > > >> >> space but
> > > >> >> only 2MB of RAM available.
>
> > > >> >> I suppose we have to use bit-vectors n all but still couldnt figure 
> > > >> >> out a
> > > >> >> proper way. A few steps or an algo will be great.
>
> > > >> >> Cheers,
> > > >> >> Atul
>
> > > >> > --
> > > >> > You received this message because you are subscribed to the Google 
> > > >> > Groups "Algorithm Geeks" group.
> > > >> > To post to this group, send email to algogeeks@googlegroups.com.
> > > >> > To unsubscribe from this group, send email to 
> > > >> > algogeeks+unsubscr...@googlegroups.com.
> > > >> > For more options, visit this group 
> > > >> > athttp://groups.google.com/group/algogeeks?hl=en.
>
> > > >> --
> > > >> -------------------------------------------------------------------
> > > >> Douglas Gameiro Diniz
> > > >> Engenheiro de Computação - 2003 - UNICAMP
>
> > > >> Mobile: (19) 92158777
> > > >> Gtalk: dgdiniz
> > > >> Msn: thedougdi...@hotmail.com- Hide quoted text -
>
> > > >> - Show quoted text -
>
> > > > --
> > > > You received this message because you are subscribed to the Google 
> > > > Groups "Algorithm Geeks" group.
> > > > To post to this group, send email to algogeeks@googlegroups.com.
> > > > To unsubscribe from this group, send email to 
> > > > algogeeks+unsubscr...@googlegroups.com.
> > > > For more options, visit this group 
> > > > athttp://groups.google.com/group/algogeeks?hl=en.
>
> > > --
> > > -------------------------------------------------------------------
> > > Douglas Gameiro Diniz
> > > Engenheiro de Computação - 2003 - UNICAMP
>
> > > Mobile: (19) 92158777
> > > Gtalk: dgdiniz
> > > Msn: thedougdi...@hotmail.com- Hide quoted text -
>
> > - Show quoted text -

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.

[algogeeks] Re: Finding a number from 300million list with constraint on memory.

Reply via email to