Hi to all,

I've an algortihm thats given below, can anybody help me implementing it.
Any sort of suggestion will be appreciated. I've finished removing stop
words,
calculating term frequencies with Lucene. The rest of the part is not quite
clear.
I'm working only on the English part.

3.2.3 Extracting Representative Words
After preprocessing the page, the system retains the most
characteristic parts of the contents. Association rules are
used to collect Chinese and English phrases as
representative words. The approach can be outlined as
follows:

  -  Scan the documents to retrieve each Chinese and English word. Then,
  each individual Chinese word

      and English word can be distinguished.


  -        ‧ Discard all stopwords, i.e., neither Chinese nor English
  word. Stopwords are high frequency words

      that are found in all texts and that do not contribute to the
distinctive meaning of the text, for example

      prepositions and articles such as "in", "on" and "the".


  - ‧      Calculate the frequency of occurrence for each remaining
  word.



  - ‧      If the frequency of a word is less than two, it is
  discarded. Next, the association rule is used to

               compute the confidence value as follows.
.

Confidence (A ⇒B) = P(A∩B) / P(A)

If P(A) is 1 and the co-occurrence, P(A∩B), is also 1, then the resulting
confidence value must be 1.

Obviously, it must be greater than the threshold Clearly, P(A) cannot be 0
as this will result in a

division-by-zero [13].


  - ‧ Assign the frequencies of the remaining words to each word in the
  document. For all the words in the

document that already have been eliminated a frequency of 0 is assigned.


  - ‧ After all the inverse term frequencies of the document are
  identified and terms with a frequency

less than 2 are discarded, the document is scanned again to calculate the
confidence value of the cooccurrence.

The word is not incorporated into the current document's phrase array if: 1)
the confidence

value is below the threshold, 2) the same word has already been assigned to
document's phrase array, or

3) the subclass word is already assigned to the current document's phrase
array. For example, assume the

current document's phrase array includes the phrase "operating system". If
the system subsequently

processes the related phrase "operating", then the latter phrase is not
added to the current document's
phrase array.

The steps needed to compute the confidence value of each
word are as follows.

(1) Fetch the previous two words until either term
frequency is not equal to 0.

(2) Calculate the confidence value of this term and the
next word. This step is discussed further in a later
section.

(3) When the confidence value is less than the threshold
(0.7), the system looks up whether the value of
computing words is two. If it is, then skip and ignore
it; otherwise, we will save this phrase, excluding the
current word, into the phrase array and continue
scanning the next word. A check is performed for
each word to determine whether it exists in the phrase
array or not.

(4) If the confidence value of the phrase is larger than the
threshold, the next word is fetched. If its term
frequency is 0, then save the current phrase and the
frequency; otherwise, steps 2 to 4 are repeated until
the end of the document is reached.

Thanx in Advance,
--
சாய் Hari

Reply via email to