date:20071005

[algogeeks] Re: Google Interview Question: find shortest summary containing all key words

2007-10-05 Thread Shrenik


reposting since it didn't appear yesterday, apologies

A small variation of Vishal's algorithm to eliminate the need of the
bitmap. I use a hash table of integers index by the keyword,
initialized to all 0. I also make use of the property that in the
shortest summary the first and the last keyword will appear exactly
once.

1. foreach word in document
hash[word]++; // by the end of this loop we should have
frequencies of all keywords

2. Do a preliminary check to see if each hash[keyword] has frequency
= 0. If at least one has frequency = 0, stop and return null to
indicate summary not found.

4. startp = pointer to first word, endp = pointer to last word

3. for (; startp = endp; startp++)
if (hash[*startp] == 1)  // stop when keyword with frequency =
1 is found
   break;
else
   hash[*startp]--; // by end of this loop, startp will point
to the first word of the summary, i.e. keyword that appears once

4. for (; startp = endp; endp--)
if (hash[*endp] == 1)  // stop when keyword with frequency = 1
is found
   break;
else
   hash[*endp]--; // by end of this loop, endp will point to
the last word of the summary, i.e. keyword that appears once

5. return summary which is from startp to endp

Worst case is O(2N) = O(N). Also, if more than one shortest summaries
are present, this will return the last one.

-Shrenik

On Oct 5, 6:34 am, MartinH [EMAIL PROTECTED] wrote:
 {I already posted this yesterday but did not appear: apologies if
 duplicate results}

 This looks to be a hard to solve and nastily realistic problem. By
 nastily realistic I mean it looks like the kind of thing we might be
 asked to code by lunchtime tomorrow.

 As daizisheng wrote there is nothing wrong with perfect hashing to
 identify keywords, but it may be impractical to implement. With just
 'a'..'z' allowed in keywords, max data compression and max keyword
 length 10, the hash table needs 2^48 entries. With full 8 bit or ISO
 characters its size becomes completely untenable. Obviously not all
 locations represent real words but any hash function that takes this
 into account to compress the range won't be O(1) and/or can't be
 guaranteed perfect.

 Similar objections to using a hash table to assign integers to words.
 If collisions are allowed, not 0(1). Might just as well have hashed in
 the keywords first and accepted worse than O(1). As Vishal suggested a
 trie looks more realistic. Building the trie can be done O(m), with m
 - total characters in keywords. Identifying whether a document
 character is part of a keyword candidate is then O(1).

 You asked for O(n) on the length of the sequence. I think this can be
 taken as n, the number of characters in the document. Fair because we
 have to iterate through all the characters to find the position of
 words and keywords. Anyway, this is Big O so we can argue that
 O(n)=O(doc_words) i.e. n = K(doc_words) with K average wordlength in
 document.

 Having got O(1) for keyword identification, to get O(n) we need a
 scanning algorithm containing a simple loop that advances a pointer to
 the next current character in the document, or do we? Andrey's listing
 has a nested loop to do trimming that rechecks words already checked
 in the main loop. Also the algorithm has a tendency to find longer and
 longer sequences with the same start word that clearly cannot provide
 a new minimum. I'm no expert but does this give O(n+m)?

 A simple loop can be got by maintaining an advancing list of
 candidate  keyword locations. By advancing I mean that all words
 referenced from the list are as far from the start of the document as
 logically possible. Say keywords are a,b,c,d: suppose a,b,c have been
 identified and another b is found - the candidate becomes a,c,b. Now a
 is found and the candidate becomes c,b,a. Any subsequence starting
 with the old word a, must be longer that one starting with word c. If
 d is found next, completing the set, the subsequent candidate is
 b,a,d.

 Doing this avoids maintaining a list of all keyword locations from the
 first word found in a candidate subsequence and then using a nested
 loop to advance a pointer when a complete set is found. Deletion when
 a duplicate occurs can be done by maintaining references from relevant
 trie nodes to list elements. List - trie node references are also
 needed to reset a node's reference to null when the tail is deleted
 after a complete set.

 Sticker wrote

  May I draw the final conclusion that the final time complexity must
  have to involve the total length of keywords and the length of the
  input text, rather than linear to the input text's length?

 Even storing all relevant keyword data in a single pass through the
 document, a case when a keyword is identified is more complex than
 otherwise. Actual time complexity is O(n+m+t) - t number of keywords
 in document. But we are not magicians, as far as meeting the O(n)
 requirement we must assume that m

[algogeeks] Re: Programing problem

2007-10-05 Thread adak


For each test case
  Do
  call function Read_it()  /* read the next move */
  call Count_it() /* count the # of squares walked
through */
  while (move length is  0)

  print area moved through for that case

next

That would be my starting pseudo-code.


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Algorithm Geeks group.
To post to this group, send email to algogeeks@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/algogeeks
-~--~~~~--~~--~--~---

[algogeeks] Re: Programing problem

2007-10-05 Thread Vishal

Find the enclosing rectangle for the area. Calculate minX, maxX, minY, maxY.

Now find whether the enclosing area lies to left and right side of the
path**. Note that the path is not self-intersecting.

currX = startX;
area = 0
if ( left )  // enclosing area to left of path
{
  for each move
  {
 if ( north )  increment currX;
 if ( south ) decrement currX;
 if ( east )   area -= ( currX - minX ) * blocks
 if ( west )   area += ( currX - minX ) * blocks
  }
}
else  // enclosing are to right of path
{
  for each move
  {
 if ( north )  increment currX;
 if ( south ) decrement currX;
 if ( east )   area += ( currX - minX ) * blocks
 if ( west )   area -= ( currX - minX ) * blocks
  }
}

** I am not sure how to determine on which side of the path the enclosing
area lies. Probably, if number of right turns  number of left turns, then
it lies on the right side. This seems naive, maybe incorrect, I am not too
sure about it. Can anyone comment?

~Vishal

On 10/5/07, adak [EMAIL PROTECTED] wrote:


 For each test case
   Do
   call function Read_it()  /* read the next move */
   call Count_it() /* count the # of squares walked
 through */
   while (move length is  0)

   print area moved through for that case

 next

 That would be my starting pseudo-code.


 


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Algorithm Geeks group.
To post to this group, send email to algogeeks@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/algogeeks
-~--~~~~--~~--~--~---

[algogeeks] Re: Google Interview Question: find shortest summary containing all key words

[algogeeks] Re: Programing problem

[algogeeks] Re: Programing problem

3 matches

Site Navigation

Mail list logo

Footer information