My apologies, here are the testing files attached.
Begin forwarded message:
Date: 14 October 2015 at 10:06:33 BST
Subject: AutoDetectParser bug?
Hi
There might be a bug with the AutoDetectParser, which fails to recognise some plain-text files as plain text.
In the attachment are three testing files, as you can see they are all plain text.
The following code is used for my testing:
———————— AutoDetectParser parser = new AutoDetectParser(); for (File f : new File("/Users/-/work/jate/experiment/bugged_corpus").listFiles()) { InputStream in = new BufferedInputStream(new FileInputStream(f.toString())); BodyContentHandler handler = new BodyContentHandler(-1); Metadata metadata = new Metadata(); try {
parser.parse(in, handler, metadata); String content = handler.toString(); System.out.println(metadata); //line A }catch (Exception e){ e.printStackTrace(); } } ———————— for the three testing files, I would expect line A to print “plain text”, in fact, it is printing the following: X-Parsed-By=org.apache.tika.parser.EmptyParser Content-Type=image/x-portable-bitmap X-Parsed-By=org.apache.tika.parser.DefaultParser X-Parsed-By=org.apache.tika.parser.mp3.Mp3Parser xmpDM:audioCompressor=MP3 Content-Type=audio/mpeg X-Parsed-By=org.apache.tika.parser.EmptyParser Content-Type=image/x-portable-bitmap
And as a result, variable “content” is always empty.
Any suggestions on this please?
Thanks
|
ID3 , C4 .5 and C5 .0 produce decision trees , RIPPER isa rule-based learner
and the Naive Bayes algorithm computes conditional probabilities of the classes
from the instances .
In all experiments the SVM_Light system outperformed other learning algorithms
, which confirms Yang 's -LRB- Yang and Liu , 1999 -RRB- results for svms fed
with Reuters data .
For LVQ the decrease may be due to the fact that no adaptations to results ,
allowing to measure the accuracy of the top five alternatives -LRB- Best5 -RRB-
.
If a new category is introduced , the accuracy will slightly decline until 30
documents are manually classified and the category is automatically included
into a new classifier .
STP may use both general linguistic knowledge and linguistic algorithms or
heuristics adapted to the application in order to extract information from
texts that is relevant for classification .
Obviously , the change of topics can be accommodated by adding new categories
and e-mails and producing a new classifier on the basis of old and new data .
These properties influenced the system architecture , which is presented in
Section 3 . Various publicly available SML systems have been tested with
different methods of STP-based preprocessing .
Call center agents judge the performance of ICC-MAIL most easily in terms of
accuracy : In what percentage of cases does the classifier suggest the correct
text block ?
A client\/server solution was built that allows the call center agents to
connect as clients to the ICe-MAIL server , which implements the system
described in Section 3 .
MorphAna : Morphological Analysis provided by sines yields the word stems of
nouns , verbs and adjectives , as well as the full forms of unknown words .
Combined : In order to emphasize words found relevant by the STP heuristics
without losing other information retrieved by MorphAna , the previous two
techniques are combined .
If an e-mail contains several questions , the classification process can be
repeated by marking each question and iteratively applying the process to the
marked part .
the domain were made , such as adapting the number of codebook vectors , the
initial learning parameters or the number of iterations during training -LRB-
cf.
We noted that in six trials the accuracy could be improved in Combined compared
to MorphAna , but in four trials , boosting led to deterioration .
This includes heuristics for the identification of multiple requests in a
single e-mail that could be based on key words and key phrases as well as on
the analysis of the document structure .
\* A reorganization of the existing three-level cate - null gory system into a
semantically consistent tree structure would allow us to explore the
nonterminal nodes of the tree for multi-layered SML .
The whole process brings about high costs in analyzing and modeling the
application domain , especially if it is to take into account the problem of
changing categories in the present application .
The implementation and usage of the system including the graphical user
interface is presented in Section 5 . We conclude by giving an outlook to
further expected improvements -LRB- Section 6 -RRB- .
In the categorization phase , the new document is preprocessed , and a result
vector is built as described above and handed over to the categorizer -LRB- cf.
Figure 1 -RRB- .
Negations were found to describe a state to be changed or to refer to missing
objects , as in I can not read my email or There is no correct date .
As a result of the tests in our application domain , we identified a favorite
statistical tool and found that task-specific linguistic preprocessing is
encouraging , while general STP is not .
The workflow of the system consists of a learning step carried out off-line
-LRB- the light gray box -RRB- and an online categorization step -LRB- the dark
gray box -RRB- .
For a call center agent , identifying the customer 's problem is often
time-consuming , as the problem space changes if new products are launched or
existing regulations are modified .
It has not yet generally been investigated how the type of data influences the
learning result -LRB- Yang , 1999 -RRB- , or under which circumstances which
kind of preprocessing and which learning algorithm is most appropriate .
Several aspects must be considered : Length of the documents , morphological
and syntactic well-formedness , the degree to which a document can be uniquely
classified , and , of course , the language of the documents .
We start out from the corpus of categorized e-mails described in Section 2 . In
order to normalize the vectors representing the preprocessing results of texts
of different length , and to concentrate on relevant material -LRB- cf.
While we tried various kinds of linguistic preprocessing , systematic
experiments have been carried out with morphological analysis -LRB- MorphAna
-RRB- , shallow parsing heuristics -LRB- STP-Heuristics -RRB- , and a
combination of both -LRB- Combined -RRB- .
If , on the other hand , no proposed solution is found to be adequate , the
ICe-MAIL tool can still be used to manually select any text block and copy them
into a backup folder .
null \* A preliminary test of the throughput achieved by using the STP and SML
technology in Ice-MAIL showed that experienced users take about 50-70 seconds
on average for one cycle , as described above .
It showed that the surface and the look-and-feel is accepted and the
functionality corresponds to the real-time needs of the call center agents , as
users were slightly faster than within their usual environment .
decision trees , decision rules or probability weightings .
Figure 1 : Architecture of the icc-mail System .
The chunk parser itself is subdivided into three components .
Each entry represents the occurrence of the corresponding feature .
During the learning phase , each document is preprocessed .
5 The heuristics were implemented in icc-mail using sines .
3 Integrating Language Technology With Machine learning
We identified them through negation particles .
Why is this the case ? .
Would more extensive linguistic preprocessing help ?
We are using a lexicon of approx .
How can I start my email program .
neural networks are rather sensitive to misconfigurations .
In general , SML tools work with a vector representation of data .
The probability increases with the distance of thevector from the hyper plane .
100000 word stems of German -LRB- Neumann et al. , 1997 -RRB- .
Moreover , these data did not contain multiple queries in one e-mall .
The distance is measured by computing e.g. the euclidean distance between the
vectors .
STP tools used for classification tasks promise very high recall\/precision or
accuracy values .
A document is said to belong to the class with the highest probability .
They store all documents as vectors during the learning phase .
The relearning step is based on data from this database .
We call this replacement of a classifier `` relearning '' .
The latter most likely refers to the preceding sentence , e.g.
The potential of the technology presented extends beyond call center
applications .
For the application in hand , this was not the case .
The experiments described in Section 4 make use of this feature .
The boosting for RIPPER seems to run into problems of overfitting .
\* Further task-specific heuristics aiming at gen - null eral structural
linguistic properties should be defined .
In the first step , phrasal fragments like general nominal expressions and verb
groups are recognized .
Then each single document is translated into a vector of numbers isomorphic to
the defining vector .
SML techniques are used to build a classifier that is used for new , incoming
messages .
A closer look at the data the ICe-MAIL system is processing will clarify the
task further .
Support Vector Machines -LRB- SVMs -RRB- : svms are described in -LRB- Vapnik ,
1995 -RRB- .
Neural Networks : Neural Networks are a special kind of `` non-symbolic ''
eager learning algo -
This figure was gained through experiments with three users over a duration of
about one hour each .
The suggested answer text is associated with the category named `` Delete &
Reinstall AOL 4.0 '' .
In the categorization phase , a new document vector leads to the activation of
a single category .
We intend to explore its use within an information broking assistant in
document classification .
The accuracy improves with usage , since each relearning step will yield better
classifiers .
Several pruning or specialization heuristics can be used to control the amount
of generalization .
For each class , a categorizer is built by computing such a hyper plane .
STP-Heuristics : Shallow parsing techniques are used to heuristically identify
sentences containing relevant information .
The k-nearest neighbor algorithm IB performed surprisingly badly although
different values ofk were used .
The system is currently undergoing extensive tests at the call center of AOL
Bertelsmann Online .
Other features of the ICe-MAIL client module include a spell checker and a
history view .
In our case the relevant features consist of the user-defined output of the
linguistic preprocessor .
This paper describes a new approach to the classification of e-mail requests
along these lines .
Thus human intervention seems mandatory to allow for individual , customized an
- null swers .
A drastic example is shown in Figure 2 . The bad conformance to linguistic stan
-
SVMs are binary learners in that they distinguish positive and negative
examples for each class .
We expected that content words in these constructions should be particularly
influential to the categorization .
All experiments were carried out using 10-fold cross-validation on the data
described in Section 2 .
The nature of these documents will allow us to explore the application of more
sophisticated language technologies during linguistic preprocessing .
In a further industrial project with German Telekom , the ICC-MAIL technology
will be extended to process multi-lingual press releases .
Next , the dependency-based structure of the fragments of each sentence is
computed using a set of specific sentence patterns .
By combining both methodologies in ICe-MAIL , we achieve high accuracy and can
still preserve a useful degree of domain-independence .
These types of information can be used to identify the linguistic properties of
a large training set of categorized e-mails .
STP gathers partial information about text such as part of speech , word stems
, negations , or sentence type .
We carried out experiments with unmodified e-mail data accumulated over a
period of three months in the call center database .
From each of these lists , the 100 most frequent results - according to a
TF\/IDF measure - are selected .
Questions are identified by their word order , i.e. yes-no questions start with
a verb and wh-questions with a wh-particle .
Other tests not reported in Table 1 looked at improvements through more general
and sophisticated STP such as chunk parsing .
Thus the preprocessing results will often differ for e-mails expressing the
same problem and hence not be useful for SML .
The definition of new categories must be fed into ICe-MAIL by a `` knowledge
engineer '' , who maintains the system .
By changing the number of neighbors k or the kind of distance measure , the
amount of generalization can be controlled .
P1 , P2 ... Pm are all proper representations of U in R , and Pl , P2 ... Pn
are the parts of them which represent V.
For example , anaphoric references and syntactic functions may be coded by the
same kind of attribute-value pairs , but are usually considered as different
ambiguity types .
For example , syntactic dependencies may be coded geometrically in one
representation system , and with features in another , but disambiguating
questions should be the same .
If `` interpreter '' is given , it means that an expert system of the generic
task at hand could not be expected to solve the ambiguity .
Attempts have also been made on French texts and dialogues , and on monolingual
telephone dialogues for which analysis results produced by automatic analyzers
were available .
Returning to G , we might then say that `` the '' representation of U is the
disjunction of all trees T associated to U via G.
In many contexts , automatic analyzers can not fully disambiguate a sentence or
an utterance reliably , but can produce ambiguous results containing the
correct interpretation .
This may be illustrated by the following diagram , A P2 ` p3 _ where we take
the representations to be tree structures represented by triangles .
When we define ambiguity types , the linguistic intuition should be the main
factor to consider , because it is the basis for any disambiguation method .
Finally , An ambiguity pattern is a schem ~ i wfth variables which can be
instantiated to a -LRB- usually unbounded -RRB- set of ambiguity kernels .
We do n't elaborate , as ambiguity patterns are specific to particular
representation systems and analyzers , so that they should not appear in our
labeling .
For example , the kernel header `` ambiguity ~ I10a-2 ' 5.1 '' identifies
kernel # 2 ' in dialogue EMMI 10a , noted here EMMI10a .
That could be different in a context where `` state '' could be construed as a
proper noun -LRB- `` State '' -RRB- , for example in a dialogue involving the
State Department .
For example , what is the use of defining a system of 1000 semantic features if
no system and no lexicographers may assign them to terms in an efficient and
reliable way ?
In such contexts , the automatic analyzer can not fully and reliably
disambiguate a sentence or an utterance , and the best available heuristics do
n't select the correct results often enough .
Then comes the ambiguity type -LRB- structure , comm_act , class , meaning ,
target language , reference , address , situation , mode -RRB- and its value
-LRB- s -RRB- .
Ambiguities of segmentation into utterances are frequent , and most annoying ,
as analyzers generally work utterance by utterance , even if they can access
analysis results of the preceding context .
\/ TURN is optional and should be inserted to close the list of utterances ,
that is if the next paragraph contains only one utterance and does not begin
with PARAG .
Ambiguities of segmentation into paragraphs may occur in written texts , if ,
for example , there is a separation by a character only , without or .
Although utterance-level ambiguities must be considered in tile context of
whole utterances , a sequence like `` international telephone services '' is
ambiguous in the same way in utterances -LRB- l -RRB- and -LRB- 3 -RRB- above .
I A fragment V presents an ambiguity of multiplicity n -LRB- n -RRB- 2 -RRB- in
an utterance U if it has n different proper representations which are part of n
or more proper representations of U.
The following example is like the famous one : `` Time flies like an arrow ''
\/ `` Linguist 's examples '' are often derided , but they really appear in
texts and dialogues .
It has been first necessary to define formally the very notion of ambiguity
relative to a representation system , as well as associated concepts such as
ambiguity kernel , ambiguity scope , ambiguity occurrence .
- Let P be a proper representation of U and Q be a minimal underspecified part
of P. The scope of the ambiguity of underspecification exhibited by Q is the
fragment V represented by Q.
paragraphs -RRB- , or a turn -LRB- rcsp .
For instance , text-to-speech requires less detail than translation .
Hence , some ambiguities may remain after extralinguistic disambiguation .
They are much more frequent and problematic in dialogues .
2 representations , Ambiguities and Associated Notions
Mutsuko Tomokiyo ATR Interpreting telecommunications research Labs
Each ambiguity kernel begins with its header .
Usually , linguists say that U has several representations with reference to G.
`` 5.1 '' is the coding of \ -LRB- 11 \ -RRB- .
Part of these collected ambiguities have been used for experiments on
interactive disambiguation .
For example , two decorated trees may differ in their geometry or not .
Which combinations are possible should be determined by the person doing the
labeling .
trees decorated with various types of structures are very popular .
In the case of utterances , the same remarks apply .
Third , the representations should be amenable to efficient computer processing
.
Which class of representation systems do we consider in our labeling ?
a text -RRB- can be segmented in at least two different ways into turns -LRB-
resp .
The second case nmy never occur in representations where all attributes are
present in each decoration .
Theory and practice of ambiguity labeling with a view to interactive
disambiguation in text and speech MT
V is an ambiguity scope of an ambiguity if it is minimal relative to that
ambiguity .
Bracketed numbers are optional and correspond to the turns or paragraphs as
presented in the original .
We have experimented our technique on various kinds of dialogues and on some
texts in several languages .
Consider the utterance : -LRB- i -RRB- Do you know where the international
telephone services are located ?
In the case of an anaphoric element , Q will presumably correspond to one word
or term V.
Here is an ambiguity pattern of multiplicity 2 corresponding to the example
above -LRB- constituent structures -RRB- .
The idea is to label all ambiguity occurrences , but only the ambiguity kernels
not already labeled .
Their list will be completed in the future as more ambiguity labeling is
performed .
The linguists may define more types and complete the list of values if
necessary .
\* what are the possible methods of interactive disambiguation , for each
ambiguity type ?
If the representations are complex , the difference between two representations
is defined recursively .
A representation in a formal representation system is proper if it contains no
exclusive disjunction .
In practice , however , developers prefer to use hybrid data structures to
represent utterances .
Further refinements can be made only with respect to the intended
interpretation of the representations .
For each paragraph or turn , we then label the ambiguities of each possible
utterance .
The interpretation of `` 1 am to '' -LRB- obligation or future -RRB- is
solvable reliably only by the speaker .
A `` computable '' representation system is a representation system for which a
`` reasonable '' parser can be developed .
But if we use f-structures with disjunctions , U will always have one -LRB- or
zero ! -RRB- associated structure S.
It is useful to study vatious properties of these ambiguities in the view of
subsequent total or partial interactive disambiguation .
We found many examples of such ambiguities in ATR 's transcriptions of Wizard
of Oz interpretations dialogues \ -LRB- 101 .
In the first pair -LRB- constituent structures -RRB- , `` international
telephone services '' is represented by a complete subtree .
In the second pair -LRB- dependency structures -RRB- , the representing
subtrees are not complete subtrees of the whole tree .
In the third part , we propose a format for ambiguity labeling , and illustrate
it examples from a transcribed dialogue .
We have proposed a technique for labeling ambiguities in texts and in dialogue
transcriptions , and experimented it on multilingual data .
50 ~ 60 % overall viterbi constitency corresponds then to 65 ~ 75 % individual
success rate , which is optimistic .
It is indeed frequent that an ambiguity relative to a fragment appears ,
disappears and reappears as one broadens its context .
In a data base , it suffices to store only the kernels , and references to the
kernels from the utterances .
The mark TUm ~ -LRB- or PARAG for a text -RRB- must be used if there is more
than one utterance .
Then , we would like to say that S is ambiguous if it contains at least one
disjunction .
It should be done at a less specific level , suitable for generating
disambiguation dialogues understandable by non-specialists .
In the case above , for example , we might have the configurations given in the
figure below .
As the usual notion of ambiguity is too vague for our purpose , it is necessary
to refine it .
V is a fragment of U , usually , but not necessarily connex , the scope of the
ambiguity .
-LRB- tense -LCB- pres past \ -RRB- -RRB- ... -RRB- \ -LRB- `` books '' -LRB-
-LRB- lex `` book-N '' -RRB- -LRB- cat noun -RRB- ... -RRB- \ -RRB- \ -RRB-
there would be 2 proper representations , one with -LRB- tense pres -RRB- , and
the other with -LRB- tense past -RRB- .
a paragraph -RRB- can be segmented in at least two different ways into
utterances , or an utterance can be analyzed in at least two different ways ,
whereby the analysis is performed in view of translation into one or several _
l % ngugges inthe context o ~ i a certifin generic task .
Further extralinguistic and sure disambiguation may be performed -LRB- 1 -RRB-
by an expert system , if the task is constrained enough ; -LRB- 2 -RRB- by the
users -LRB- author or speakers -RRB- , through interactive disambiguation ; and
-LRB- 3 -RRB- by a -LRB- human -RRB- expert translator or interpreter ,
accessible through the network .
The first case often happens in the case of anaphoras : -LRB- ref ? -RRB- , or
in the case where some information has not been exactly computed , e.g. -LRB-
taskdomain ? -RRB- , \ -LRB- decade of month ? -RRB- , but is necessary for
translating in at least one of tile target languages .
I An ambiguity occurrence , or simply ambiguity , A , of multiplicity n -LRB- n
-RRB- 2 -RRB- relative to a representation system R , may be formally defined
as : A = -LRB- U , V , , -RRB- , where m -RRB- n and : U is a complete
utterance , called the context of the ambiguity .
However , as soon as they are taken out of context , they look again as
artificial as `` linguist 's examples '' \/ Although many studies on
ambiguities have been published , the specific goal of studying ambiguities in
the context of interactive disambiguation in text and speech translation has
led us to explore new ground and to propose the concept of `` ambiguity
labeling '' .
It is interesting to compare the intuition of the human labeller with results
actually produced : most of the time , differences may be attributed to the
fact that available analyzers do n't yet match our expectations for `` state of
the art '' analyzers , because they produce spurious , `` parasite ''
ambiguities , and do n't yet implement all types of sure linguistic constraints
.
For example , an expert interpreter `` monitoring '' several bilingual
conversations could solve some ambiguities from his workstation , either
because the system decides to ask him first , or 1 According to a study by
Cohen & Oviatt , the combined success rate -LRB- SR -RRB- is bigger than the
product of the individual success rates by about 10 % in the middle range .
In the case of a classical context-free grammar G , shall we say that a
representation of U is any tree T associated to U via G , or that it is the set
of all such trees ?
This is done in the second part , where we define formally the notion of
ambiguity relative to a representation system , as well as associated concepts
such as kernel , scope , occurrence and type of ambiguity .
However , the same spoken language analyzers may be able to produce sets of
outputs containing the correct analysis in about 90 % of the cases -LRB- ``
structural consistency '' \ -LRB- 2 \ -RRB- -RRB- 2 .
For example , attachment ambiguities are represented differently in the outputs
of various analyzers , but it is always possible to recognize such an ambiguity
, and to explain it by using a `` skeleton '' flat bracketing .
For instance , `` Please state your phone number '' shoukl not be deemed
ambiguous , as no complete analysis should allow `` state '' to be a noun , or
`` phone '' to be a verb .
This means that any strictly smaller fragment W of U has strictly less than n
associated sub-representations or , equivalently , that at least two of the
representations of V are be \ -RRB- equal with respect to W.
We suppose an architecture flexible enough to allow the above three
extralinguistic processes to be optional , and , in the case of interactive
disambiguation , to allow users to control the amount of questions asked by the
system .
For lack of space , we can not give here the context-free grammar which defines
our labeling formally , and illustrate the underlying principles by way of
examples from a dialogue transcription taken from \ -LRB- 1 \ -RRB- .
In that case , it is important that the questions asked from the users are the
most crucial ones , so that failure of the last step to select the correct
interpretation does not result in too damaging translation errors .
Ambiguity labeling may also be considered as part of the specification of
present and future state of the art analyzers , which means that : it should be
compatible with the representation systems used by the actual or intended
analyzers .
Finally , our labeling should only be concerned with the final result of
analysis , not in any intermediate stage , because we want to retain only
ambiguities which would remain unsolved after the complete automatic analysis
process has been performed .
We take it for granted that , for each considered representation system , we
know how to define , R ~ r each fragment V of an utterance U having a proper
representation P , tile part of P which represents V.
P4 . Vocabulary learning means learning the words and their limitations ,
probability of occurrences , and syntactic behavior around them , Swartz &
Yazdani -LRB- 1992 -RRB- .
Answers 7 and 10 are examples of bypassing strategies i.e. ; the use of a
different verb or another sentence structure as a means for avoiding relative
clauses .
Children , who now go to french schools , often switch back to English for
their leisure activities because of the scarcity of options open to them .
When the children were asked about the main subject in the picture , the
answers were acceptable in standard French , showing that they had no problems
in using relative clauses with qui .
Additionally , we are studying the state of the art of systems using Artificial
Intelligence techniques as well as NLP resources and\/or methodologies for
teaching language , especially for bilingual and minority groups .
Another novelty is in the pedagogical approach of exposing the learner to the
expert model and to the learner model in a comparative manner , thus helping to
clarify the sources of error .
The following examples give a brief survey of the use of indirect object
relative clauses : avec lequel \/ laquelle , sur lequel \/ laquelle , ~ qui ,
and dont : 11 .
However , unexpected `` pop-up '' activities would come up on the screen from
time to time -LRB- style '' Tip of the day '' or `` TV ad . '' -RRB- .
At around the same period researchers were starting to put also some emphasis
on the teaching strategies adopted in the system such as in WEST , Burton &
Brown -LRB- 1976 -RRB- .
In such a case one can not help but to think about the advantages that
technology can offer , especially in an era where language resources are ready
for the pick .
In the following sections we summarize an empirical study that helped us To our
knowledge , there are no systems that use machine translation tools for
generating two versions of the same language instead of multilingual generation
.
The syntactic graph and the lexicon are annotated with probabilities on usually
faulty expressions in order to intensify the explanation or the number of
examples and exercises on those particular parts -LRB- principles P3 and P4
-RRB- .
In an effort to gain some insight into the projected linguistic model , an
empirical study on the population of elementary students in the City of Moncton
, New Brunswick , Canada was completed 1 .
Otherwise they use a bypassing strategy by separating the sentence into two
parts as in `` C'est une branche et un oiseau '' , or by using another verb
that allows qui as in 18 .
This paper presents a project that investigates to what extent computational
linguistic methods and tools used at GETA for machine translation can be used
to implement novel functionalities in intelligent computer assisted language
learning .
double generation of Acadian French and Standard french .
null - SYGMOR for the morphological generation sub-agent .
How to implement parsers that can process ungrammatical input ?
Using Language Resources in an Intelligent Tutoring System for french
The lexicon can be augmented similarly .
- ATEF for the morphological analysis subagent .
-LRB- Bypassing strategy -RRB- 2.2 Object relative clauses
How to represent the linguistic knowledge in the expert and learner models ?
The user chooses where to start by clicking on a hot-button picture .
3 . Puzzle playing where words have assigned shapes according to their
functions .
Answer 8 shows a common use of the preposition h instead of de .
Figure 2 : Annotated tree for a sentence in standard french .
Fitting the puzzle means placing the words in the correct order .
How to implement teaching strategies that are appropriate for language learning
?
Our intelligent tutoring system project is still in its early phases .
We do not intend to build a fully free learning environment .
In this case only the formalisms at GETA would be exploited not the existing
linguistic data .
The corpus consisted of the sentences collected during the empirical study
-LRB- see section 2 -RRB- .
This allows the teacher to take responsibility of the degree of unstructured or
of focused learning .
-LRB- Use of an English verb -RRB- 4 . C'est une fiile qui botte le ballon .
In brief , this setting of language learning is not that of a typical native
speaker .
However , language learning had its own specific difficulties that were not
generalized in other ITS systems .
Answer 9 is also representative of the frequent use of prepositions at the end
of the sentence .
For many years GETA has been working on MT systems from and into french .
-LRB- Use of an inappropriate verb -RRB- 5 . C'est un papa etson garqon .
Many of these children use English syntax as well as borrowed vocabulary quite
frequently .
Tha same grammar can be used by incrementing its rules to include
new\/different sentence structures .
In the next sections , we will examine the children 's answers concerning
relative clauses .
Introduction The project that we have started is intended for the minority
French speaking Acadian community living in Atlantic Canada .
We have also seen , through an empirical study , the kinds of linguistic
difficulties that a minority group is encountering .
Recent systems show how researchers are being more open to psycho linguistic ,
pedagogical and applied linguistic theories .
Conclusion We have presented in this paper an ongoing software development
project that is still in its early phases .
Another alternative would be to consider the non-standard french as a
completely new language from all points of view .
2 . Corpuses or extracts from children stories are equipped with hyperlinks to
word meanings or grammar usage explanations .
By looking at these examples , it is evident that complex relative clauses are
rather unknown to the children .
They show that the easiest particles for them are qui and que even when misused
as in answer 12 .
fs -LRB- gov -RRB- cat -LRB- d ~ ~ -RRB- fs -LRB- des -RRB- cat -LRB- n -RRB-
fs -LRB- gov -RRB- cat v ~ . ~ , -LRB- ~ , ~ fs -LRB- gov -RRB- ~ fs -LRB- reg
-RRB- -RRB- cat -LRB- s -RRB- Figure \ -RRB- : Annotated tree for a sentence in
non-standard french .
For example , the teacher could favor certain activities such as presenting
examples of `` non standard French sentences '' and opposing them to English
structures in a effort to show the children some Anglicisms ; or maybe choose a
specific microworld , such as Holloween or Christmas so that the exercises
would be closer to children 's real daily experience -LRB- principle P1 -RRB- .
An impressive core of linguistic knowledge is available but has not yet been
experimented on in building language learning software , though work is
underway for integration of heterogeneous nlp components , Boitet & Seligman
-LRB- 1994 -RRB- .
Among the first milestones in Intelligent Tutoring Systems -LRB- ITS -RRB- was
Carbonell 's system -LRB- 1970 -RRB- that used a knowledge-base to check the
student 's answers and to allow him\/her to interact in `` natural language '' .
Ariane for example , uses special purpose rule-writing formalisms for each of
its morphological and lexical modules both for analysis and for generation ,
with a strict separation of algorithmic and linguistic knowledge , Hutchins &
Somers -LRB- 1992 -RRB- .
Following are some of the answers with the most frequent errors or bypassing
strategies , they are marked with a \* ; the sentences with italics are the
acceptable ones : 6 . C'est le livre que le garcon lit .
We begin our presentation with a literature review of related work in
Intelligent Tutoring Systems -LRB- ITS -RRB- particularly on Computer Assisted
Language Learning -LRB- CALL and Intelligent CALL -RRB- followed by the
principles that this community is now expecting from system builders .
- EXPANSF for lexical expansion - TRANSF for translation into standard French
C. ROBRA in its multi-level analysis - for syntactic tree definitions and The
first series of experiments we realized using GETA 's resources concentrate on
double analysis\/generation of standard French and non-standard local french .
Then , in the last section we propose the system 's general architecture and an
overview some of its activities ; particularly those that counteract Anglicisms
by double generating examples in standard French and in the local dialect using
linguistic resources usually used in machine translation .
In the introduction and in the first sections , we have argued for the positive
effects of computers on language learning and then on some of the issues that
researchers in the field are hoping to see implemented from a computational and
a pedagogical point of view .
It 's with such works and many others later , that Intelligent Tutoring Systems
' architecture was more or less separated into four modules : an expert 's
model , a learner 's model , a teacher 's model , and an interface , Wengers
-LRB- 1987 -RRB- .
For example , The ICICLE Project is based on L2 learning theory -LRB- McCoy et
al. , 1996 -RRB- ; Alexia -LRB- Selva et al. , 1997 -RRB- and FLUENT -LRB-
Hamburger and Hashim , 1992 -RRB- are based on constructivism , Mr. Collins
-LRB- Bull et al. , 1995 -RRB- is based on four empirical studies in an effort
to `` discover '' student errors and their learning strategies .
Another tendency , that is very noticeably parallel to that of NLP , is the
development of sophisticated language resources such as dictionaries for
language -LRB- lexical -RRB- learning as exemplified by CELINE at Grenoble
-LRB- Men6zo et al. , 1996 -RRB- , the SAFRAN project -LRB- 1997 -RRB- and The
Reader at Princeton University -LRB- 1997 -RRB- which uses wordnet , or real
corpuses as in the European project Camille -LRB- Ingraham et al. , 1994 -RRB- .
define the learner model .
1 Artificial intelligence
Language learning and
2.1 Subject relative clauses
2.3 Complex relative clauses
learner Model
There is a transition from X to F labeled with a and weight a iff Xa -- ~ a is
a rule of the grammar .
Both the grammar compilation algorithms -LRB- GRM library -RRB- and our
automata optimization tools -LRB- FSM library -RRB- work in the most general
case .
If dynamic grammars and lazy expansion are not needed , we can expand the
result fully and then apply weighted determinization and minimization
algorithms .
In many of those applications , the actual languages described are regular ,
but context-free representations are much more concise and easier to create .
We thank Bruce Buntschuh and Ted Roycraft for their help with defining the
dynamic grammar features and for their comments on this work .
the benefits of this representation , we compared the compilation time and the
size of the resulting lazy automata with and without preoptimization .
A dynamic substitution consists of the application of the substitution a to ~ ,
during the process of recognition of a word sequence .
minals , and the rules involving them , are available for use in derivations ;
they are just not available as start symbols .
For example , Figure 3 shows the weighted automaton for grammar G2 consisting
of the last three rules of G1 with start symbol X.
More precisely , define a dependency graph Dc for G 's nonterminals and examine
the set of its strongly-connected components -LRB- SCCs -RRB- .
If each of these subgrammars is either left-linear or rightlinear , we shall
see that compilation into a single finite automaton is possible .
Thus , M -LRB- X -RRB- can always be defined in constant time and space by
editing the automaton K -LRB- S -RRB- .
This operation does not require any recompilation , since it does not affect
the automaton M -LRB- X -RRB- built for each nonterminal X.
We describe an efficient algorithm for compiling into weighted finite automata
an interesting class of weighted context-free grammars that represent regular
languages .
In particular , speech understanding applications require appropriate grammars
both to constrain speech recognition and to help extract the meaning of
utterances .
This replacement is also done on demand , with only the necessary part of aa
being expanded for a given input string .
Rule X -LRB- ~ - + Y1 -- . Y ~ has a corresponding path that maps X to the
sequence I\/1 ... Y ~ with weight ~ .
The bigram examples also show the advantages of lazy replacement and editing
over the full expansion used in previous work -LRB- Pereira and Wright , 1997
-RRB- .
Dynamic activation or deactivation of rules 2 We augment the grammar with a set
of active nonterminals , which are those available as start symbols for
derivations .
The replacement operation is lazy , that is , the states and transitions of the
replacing automata are only expanded when needed for a given input string .
For example , Figure 4 shows the dependency graph for our example grammar G1 ,
with SCCs -LCB- Z -RCB- and -LRB- X , Y -RCB- .
The GRM Library also includes an efficient compilation too \ -RRB- for weighted
context-dependent rewrite rules -LRB- Mohri and Sproat , 1996 -RRB- that is
used in textto-speech projects at Lucent Bell Laboratories .
It can be used to compile effi - null ciently an interesting class of grammars
representing weighted regular languages and allows for dynamic modifications
that are crucial in many speech recognition applications .
For example , compilation is about 700 times faster in the optimized case for a
fully expanded automaton even for a 40-word vocabulary model , and the result
about 39 times smaller .
We did experiments with full bigram models with various vocabulary sizes , and
with two unweighted grammars derived by feature instantiation from hand-built
feature-based grammars -LRB- Pereira and Wright , 1997 -RRB- .
Grammar compilation takes as input a weighted CFG represented as a weighted
transducer -LRB- Salomaa and Soittola , 1978 -RRB- , which may have been
optimized prior to compilation -LRB- preoptimized -RRB- .