Fwd: AutoDetectParser bug?

Ziqi Zhang Wed, 14 Oct 2015 02:09:03 -0700

My apologies, here are the testing files attached.

Begin forwarded message:

From: Ziqi Zhang <ziqi.zh...@sheffield.ac.uk>

Date: 14 October 2015 at 10:06:33 BST

To: user@tika.apache.org

Subject: AutoDetectParser bug?

There might be a bug with the AutoDetectParser, which fails to recognise some plain-text files as plain text.

In the attachment are three testing files, as you can see they are all plain text.

The following code is used for my testing:

————————

AutoDetectParser parser = new AutoDetectParser();
for (File f : new File("/Users/-/work/jate/experiment/bugged_corpus").listFiles()) {
    InputStream in = new BufferedInputStream(new FileInputStream(f.toString()));
    BodyContentHandler handler = new BodyContentHandler(-1);
    Metadata metadata = new Metadata();
    try {

        parser.parse(in, handler, metadata);
        String content = handler.toString();
        System.out.println(metadata); //line A
    }catch (Exception e){
        e.printStackTrace();
    }
}

————————

for the three testing files, I would expect line A to print “plain text”, in fact, it is printing the following:

X-Parsed-By=org.apache.tika.parser.EmptyParser Content-Type=image/x-portable-bitmap

X-Parsed-By=org.apache.tika.parser.DefaultParser X-Parsed-By=org.apache.tika.parser.mp3.Mp3Parser xmpDM:audioCompressor=MP3 Content-Type=audio/mpeg

X-Parsed-By=org.apache.tika.parser.EmptyParser Content-Type=image/x-portable-bitmap

And as a result, variable “content” is always empty.

Any suggestions on this please?

Thanks

ID3 , C4 .5 and C5 .0 produce decision trees , RIPPER isa rule-based learner 
and the Naive Bayes algorithm computes conditional probabilities of the classes 
from the instances .
In all experiments the SVM_Light system outperformed other learning algorithms 
, which confirms Yang 's -LRB- Yang and Liu , 1999 -RRB- results for svms fed 
with Reuters data .
For LVQ the decrease may be due to the fact that no adaptations to results , 
allowing to measure the accuracy of the top five alternatives -LRB- Best5 -RRB- 
.
If a new category is introduced , the accuracy will slightly decline until 30 
documents are manually classified and the category is automatically included 
into a new classifier .
STP may use both general linguistic knowledge and linguistic algorithms or 
heuristics adapted to the application in order to extract information from 
texts that is relevant for classification .
Obviously , the change of topics can be accommodated by adding new categories 
and e-mails and producing a new classifier on the basis of old and new data .
These properties influenced the system architecture , which is presented in 
Section 3 . Various publicly available SML systems have been tested with 
different methods of STP-based preprocessing .
Call center agents judge the performance of ICC-MAIL most easily in terms of 
accuracy : In what percentage of cases does the classifier suggest the correct 
text block ?
A client\/server solution was built that allows the call center agents to 
connect as clients to the ICe-MAIL server , which implements the system 
described in Section 3 .
MorphAna : Morphological Analysis provided by sines yields the word stems of 
nouns , verbs and adjectives , as well as the full forms of unknown words .
Combined : In order to emphasize words found relevant by the STP heuristics 
without losing other information retrieved by MorphAna , the previous two 
techniques are combined .
If an e-mail contains several questions , the classification process can be 
repeated by marking each question and iteratively applying the process to the 
marked part .
the domain were made , such as adapting the number of codebook vectors , the 
initial learning parameters or the number of iterations during training -LRB- 
cf.
We noted that in six trials the accuracy could be improved in Combined compared 
to MorphAna , but in four trials , boosting led to deterioration .
This includes heuristics for the identification of multiple requests in a 
single e-mail that could be based on key words and key phrases as well as on 
the analysis of the document structure .
\* A reorganization of the existing three-level cate - null gory system into a 
semantically consistent tree structure would allow us to explore the 
nonterminal nodes of the tree for multi-layered SML .
The whole process brings about high costs in analyzing and modeling the 
application domain , especially if it is to take into account the problem of 
changing categories in the present application .
The implementation and usage of the system including the graphical user 
interface is presented in Section 5 . We conclude by giving an outlook to 
further expected improvements -LRB- Section 6 -RRB- .
In the categorization phase , the new document is preprocessed , and a result 
vector is built as described above and handed over to the categorizer -LRB- cf. 
Figure 1 -RRB- .
Negations were found to describe a state to be changed or to refer to missing 
objects , as in I can not read my email or There is no correct date .
As a result of the tests in our application domain , we identified a favorite 
statistical tool and found that task-specific linguistic preprocessing is 
encouraging , while general STP is not .
The workflow of the system consists of a learning step carried out off-line 
-LRB- the light gray box -RRB- and an online categorization step -LRB- the dark 
gray box -RRB- .
For a call center agent , identifying the customer 's problem is often 
time-consuming , as the problem space changes if new products are launched or 
existing regulations are modified .
It has not yet generally been investigated how the type of data influences the 
learning result -LRB- Yang , 1999 -RRB- , or under which circumstances which 
kind of preprocessing and which learning algorithm is most appropriate .
Several aspects must be considered : Length of the documents , morphological 
and syntactic well-formedness , the degree to which a document can be uniquely 
classified , and , of course , the language of the documents .
We start out from the corpus of categorized e-mails described in Section 2 . In 
order to normalize the vectors representing the preprocessing results of texts 
of different length , and to concentrate on relevant material -LRB- cf.
While we tried various kinds of linguistic preprocessing , systematic 
experiments have been carried out with morphological analysis -LRB- MorphAna 
-RRB- , shallow parsing heuristics -LRB- STP-Heuristics -RRB- , and a 
combination of both -LRB- Combined -RRB- .
If , on the other hand , no proposed solution is found to be adequate , the 
ICe-MAIL tool can still be used to manually select any text block and copy them 
into a backup folder .
null \* A preliminary test of the throughput achieved by using the STP and SML 
technology in Ice-MAIL showed that experienced users take about 50-70 seconds 
on average for one cycle , as described above .
It showed that the surface and the look-and-feel is accepted and the 
functionality corresponds to the real-time needs of the call center agents , as 
users were slightly faster than within their usual environment .
decision trees , decision rules or probability weightings .
Figure 1 : Architecture of the icc-mail System .
The chunk parser itself is subdivided into three components .
Each entry represents the occurrence of the corresponding feature .
During the learning phase , each document is preprocessed .
5 The heuristics were implemented in icc-mail using sines .
3 Integrating Language Technology With Machine learning
We identified them through negation particles .
Why is this the case ? .
Would more extensive linguistic preprocessing help ?
We are using a lexicon of approx .
How can I start my email program .
neural networks are rather sensitive to misconfigurations .
In general , SML tools work with a vector representation of data .
The probability increases with the distance of thevector from the hyper plane .
100000 word stems of German -LRB- Neumann et al. , 1997 -RRB- .
Moreover , these data did not contain multiple queries in one e-mall .
The distance is measured by computing e.g. the euclidean distance between the 
vectors .
STP tools used for classification tasks promise very high recall\/precision or 
accuracy values .
A document is said to belong to the class with the highest probability .
They store all documents as vectors during the learning phase .
The relearning step is based on data from this database .
We call this replacement of a classifier `` relearning '' .
The latter most likely refers to the preceding sentence , e.g.
The potential of the technology presented extends beyond call center 
applications .
For the application in hand , this was not the case .
The experiments described in Section 4 make use of this feature .
The boosting for RIPPER seems to run into problems of overfitting .
\* Further task-specific heuristics aiming at gen - null eral structural 
linguistic properties should be defined .
In the first step , phrasal fragments like general nominal expressions and verb 
groups are recognized .
Then each single document is translated into a vector of numbers isomorphic to 
the defining vector .
SML techniques are used to build a classifier that is used for new , incoming 
messages .
A closer look at the data the ICe-MAIL system is processing will clarify the 
task further .
Support Vector Machines -LRB- SVMs -RRB- : svms are described in -LRB- Vapnik , 
1995 -RRB- .
Neural Networks : Neural Networks are a special kind of `` non-symbolic '' 
eager learning algo -
This figure was gained through experiments with three users over a duration of 
about one hour each .
The suggested answer text is associated with the category named `` Delete & 
Reinstall AOL 4.0 '' .
In the categorization phase , a new document vector leads to the activation of 
a single category .
We intend to explore its use within an information broking assistant in 
document classification .
The accuracy improves with usage , since each relearning step will yield better 
classifiers .
Several pruning or specialization heuristics can be used to control the amount 
of generalization .
For each class , a categorizer is built by computing such a hyper plane .
STP-Heuristics : Shallow parsing techniques are used to heuristically identify 
sentences containing relevant information .
The k-nearest neighbor algorithm IB performed surprisingly badly although 
different values ofk were used .
The system is currently undergoing extensive tests at the call center of AOL 
Bertelsmann Online .
Other features of the ICe-MAIL client module include a spell checker and a 
history view .
In our case the relevant features consist of the user-defined output of the 
linguistic preprocessor .
This paper describes a new approach to the classification of e-mail requests 
along these lines .
Thus human intervention seems mandatory to allow for individual , customized an 
- null swers .
A drastic example is shown in Figure 2 . The bad conformance to linguistic stan 
-
SVMs are binary learners in that they distinguish positive and negative 
examples for each class .
We expected that content words in these constructions should be particularly 
influential to the categorization .
All experiments were carried out using 10-fold cross-validation on the data 
described in Section 2 .
The nature of these documents will allow us to explore the application of more 
sophisticated language technologies during linguistic preprocessing .
In a further industrial project with German Telekom , the ICC-MAIL technology 
will be extended to process multi-lingual press releases .
Next , the dependency-based structure of the fragments of each sentence is 
computed using a set of specific sentence patterns .
By combining both methodologies in ICe-MAIL , we achieve high accuracy and can 
still preserve a useful degree of domain-independence .
These types of information can be used to identify the linguistic properties of 
a large training set of categorized e-mails .
STP gathers partial information about text such as part of speech , word stems 
, negations , or sentence type .
We carried out experiments with unmodified e-mail data accumulated over a 
period of three months in the call center database .
From each of these lists , the 100 most frequent results - according to a 
TF\/IDF measure - are selected .
Questions are identified by their word order , i.e. yes-no questions start with 
a verb and wh-questions with a wh-particle .
Other tests not reported in Table 1 looked at improvements through more general 
and sophisticated STP such as chunk parsing .
Thus the preprocessing results will often differ for e-mails expressing the 
same problem and hence not be useful for SML .
The definition of new categories must be fed into ICe-MAIL by a `` knowledge 
engineer '' , who maintains the system .
By changing the number of neighbors k or the kind of distance measure , the 
amount of generalization can be controlled .

P1 , P2 ... Pm are all proper representations of U in R , and Pl , P2 ... Pn 
are the parts of them which represent V.
For example , anaphoric references and syntactic functions may be coded by the 
same kind of attribute-value pairs , but are usually considered as different 
ambiguity types .
For example , syntactic dependencies may be coded geometrically in one 
representation system , and with features in another , but disambiguating 
questions should be the same .
If `` interpreter '' is given , it means that an expert system of the generic 
task at hand could not be expected to solve the ambiguity .
Attempts have also been made on French texts and dialogues , and on monolingual 
telephone dialogues for which analysis results produced by automatic analyzers 
were available .
Returning to G , we might then say that `` the '' representation of U is the 
disjunction of all trees T associated to U via G.
In many contexts , automatic analyzers can not fully disambiguate a sentence or 
an utterance reliably , but can produce ambiguous results containing the 
correct interpretation .
This may be illustrated by the following diagram , A P2 ` p3 _ where we take 
the representations to be tree structures represented by triangles .
When we define ambiguity types , the linguistic intuition should be the main 
factor to consider , because it is the basis for any disambiguation method .
Finally , An ambiguity pattern is a schem ~ i wfth variables which can be 
instantiated to a -LRB- usually unbounded -RRB- set of ambiguity kernels .
We do n't elaborate , as ambiguity patterns are specific to particular 
representation systems and analyzers , so that they should not appear in our 
labeling .
For example , the kernel header `` ambiguity ~ I10a-2 ' 5.1 '' identifies 
kernel # 2 ' in dialogue EMMI 10a , noted here EMMI10a .
That could be different in a context where `` state '' could be construed as a 
proper noun -LRB- `` State '' -RRB- , for example in a dialogue involving the 
State Department .
For example , what is the use of defining a system of 1000 semantic features if 
no system and no lexicographers may assign them to terms in an efficient and 
reliable way ?
In such contexts , the automatic analyzer can not fully and reliably 
disambiguate a sentence or an utterance , and the best available heuristics do 
n't select the correct results often enough .
Then comes the ambiguity type -LRB- structure , comm_act , class , meaning , 
target language , reference , address , situation , mode -RRB- and its value 
-LRB- s -RRB- .
Ambiguities of segmentation into utterances are frequent , and most annoying , 
as analyzers generally work utterance by utterance , even if they can access 
analysis results of the preceding context .
\/ TURN is optional and should be inserted to close the list of utterances , 
that is if the next paragraph contains only one utterance and does not begin 
with PARAG .
Ambiguities of segmentation into paragraphs may occur in written texts , if , 
for example , there is a separation by a  character only , without  or  .
Although utterance-level ambiguities must be considered in tile context of 
whole utterances , a sequence like `` international telephone services '' is 
ambiguous in the same way in utterances -LRB- l -RRB- and -LRB- 3 -RRB- above .
I A fragment V presents an ambiguity of multiplicity n -LRB- n -RRB- 2 -RRB- in 
an utterance U if it has n different proper representations which are part of n 
or more proper representations of U.
The following example is like the famous one : `` Time flies like an arrow '' 
\/ `` Linguist 's examples '' are often derided , but they really appear in 
texts and dialogues .
It has been first necessary to define formally the very notion of ambiguity 
relative to a representation system , as well as associated concepts such as 
ambiguity kernel , ambiguity scope , ambiguity occurrence .
- Let P be a proper representation of U and Q be a minimal underspecified part 
of P. The scope of the ambiguity of underspecification exhibited by Q is the 
fragment V represented by Q.
paragraphs -RRB- , or a turn -LRB- rcsp .
For instance , text-to-speech requires less detail than translation .
Hence , some ambiguities may remain after extralinguistic disambiguation .
They are much more frequent and problematic in dialogues .
2 representations , Ambiguities and Associated Notions
Mutsuko Tomokiyo ATR Interpreting telecommunications research Labs
Each ambiguity kernel begins with its header .
Usually , linguists say that U has several representations with reference to G.
`` 5.1 '' is the coding of \ -LRB- 11 \ -RRB- .
Part of these collected ambiguities have been used for experiments on 
interactive disambiguation .
For example , two decorated trees may differ in their geometry or not .
Which combinations are possible should be determined by the person doing the 
labeling .
trees decorated with various types of structures are very popular .
In the case of utterances , the same remarks apply .
Third , the representations should be amenable to efficient computer processing 
.
Which class of representation systems do we consider in our labeling ?
a text -RRB- can be segmented in at least two different ways into turns -LRB- 
resp .
The second case nmy never occur in representations where all attributes are 
present in each decoration .
Theory and practice of ambiguity labeling with a view to interactive 
disambiguation in text and speech MT
V is an ambiguity scope of an ambiguity if it is minimal relative to that 
ambiguity .
Bracketed numbers are optional and correspond to the turns or paragraphs as 
presented in the original .
We have experimented our technique on various kinds of dialogues and on some 
texts in several languages .
Consider the utterance : -LRB- i -RRB- Do you know where the international 
telephone services are located ?
In the case of an anaphoric element , Q will presumably correspond to one word 
or term V.
Here is an ambiguity pattern of multiplicity 2 corresponding to the example 
above -LRB- constituent structures -RRB- .
The idea is to label all ambiguity occurrences , but only the ambiguity kernels 
not already labeled .
Their list will be completed in the future as more ambiguity labeling is 
performed .
The linguists may define more types and complete the list of values if 
necessary .
\* what are the possible methods of interactive disambiguation , for each 
ambiguity type ?
If the representations are complex , the difference between two representations 
is defined recursively .
A representation in a formal representation system is proper if it contains no 
exclusive disjunction .
In practice , however , developers prefer to use hybrid data structures to 
represent utterances .
Further refinements can be made only with respect to the intended 
interpretation of the representations .
For each paragraph or turn , we then label the ambiguities of each possible 
utterance .
The interpretation of `` 1 am to '' -LRB- obligation or future -RRB- is 
solvable reliably only by the speaker .
A `` computable '' representation system is a representation system for which a 
`` reasonable '' parser can be developed .
But if we use f-structures with disjunctions , U will always have one -LRB- or 
zero ! -RRB- associated structure S.
It is useful to study vatious properties of these ambiguities in the view of 
subsequent total or partial interactive disambiguation .
We found many examples of such ambiguities in ATR 's transcriptions of Wizard 
of Oz interpretations dialogues \ -LRB- 101 .
In the first pair -LRB- constituent structures -RRB- , `` international 
telephone services '' is represented by a complete subtree .
In the second pair -LRB- dependency structures -RRB- , the representing 
subtrees are not complete subtrees of the whole tree .
In the third part , we propose a format for ambiguity labeling , and illustrate 
it examples from a transcribed dialogue .
We have proposed a technique for labeling ambiguities in texts and in dialogue 
transcriptions , and experimented it on multilingual data .
50 ~ 60 % overall viterbi constitency corresponds then to 65 ~ 75 % individual 
success rate , which is optimistic .
It is indeed frequent that an ambiguity relative to a fragment appears , 
disappears and reappears as one broadens its context .
In a data base , it suffices to store only the kernels , and references to the 
kernels from the utterances .
The mark TUm ~ -LRB- or PARAG for a text -RRB- must be used if there is more 
than one utterance .
Then , we would like to say that S is ambiguous if it contains at least one 
disjunction .
It should be done at a less specific level , suitable for generating 
disambiguation dialogues understandable by non-specialists .
In the case above , for example , we might have the configurations given in the 
figure below .
As the usual notion of ambiguity is too vague for our purpose , it is necessary 
to refine it .
V is a fragment of U , usually , but not necessarily connex , the scope of the 
ambiguity .
-LRB- tense -LCB- pres past \ -RRB- -RRB- ... -RRB- \ -LRB- `` books '' -LRB- 
-LRB- lex `` book-N '' -RRB- -LRB- cat noun -RRB- ... -RRB- \ -RRB- \ -RRB- 
there would be 2 proper representations , one with -LRB- tense pres -RRB- , and 
the other with -LRB- tense past -RRB- .
a paragraph -RRB- can be segmented in at least two different ways into 
utterances , or an utterance can be analyzed in at least two different ways , 
whereby the analysis is performed in view of translation into one or several _ 
l % ngugges inthe context o ~ i a certifin generic task .
Further extralinguistic and sure disambiguation may be performed -LRB- 1 -RRB- 
by an expert system , if the task is constrained enough ; -LRB- 2 -RRB- by the 
users -LRB- author or speakers -RRB- , through interactive disambiguation ; and 
-LRB- 3 -RRB- by a -LRB- human -RRB- expert translator or interpreter , 
accessible through the network .
The first case often happens in the case of anaphoras : -LRB- ref ? -RRB- , or 
in the case where some information has not been exactly computed , e.g. -LRB- 
taskdomain ? -RRB- , \ -LRB- decade of month ? -RRB- , but is necessary for 
translating in at least one of tile target languages .
I An ambiguity occurrence , or simply ambiguity , A , of multiplicity n -LRB- n 
-RRB- 2 -RRB- relative to a representation system R , may be formally defined 
as : A = -LRB- U , V ,  ,  -RRB- , where m -RRB- n and : U is a complete 
utterance , called the context of the ambiguity .
However , as soon as they are taken out of context , they look again as 
artificial as `` linguist 's examples '' \/ Although many studies on 
ambiguities have been published , the specific goal of studying ambiguities in 
the context of interactive disambiguation in text and speech translation has 
led us to explore new ground and to propose the concept of `` ambiguity 
labeling '' .
It is interesting to compare the intuition of the human labeller with results 
actually produced : most of the time , differences may be attributed to the 
fact that available analyzers do n't yet match our expectations for `` state of 
the art '' analyzers , because they produce spurious , `` parasite '' 
ambiguities , and do n't yet implement all types of sure linguistic constraints 
.
For example , an expert interpreter `` monitoring '' several bilingual 
conversations could solve some ambiguities from his workstation , either 
because the system decides to ask him first , or 1 According to a study by 
Cohen & Oviatt , the combined success rate -LRB- SR -RRB- is bigger than the 
product of the individual success rates by about 10 % in the middle range .
In the case of a classical context-free grammar G , shall we say that a 
representation of U is any tree T associated to U via G , or that it is the set 
of all such trees ?
This is done in the second part , where we define formally the notion of 
ambiguity relative to a representation system , as well as associated concepts 
such as kernel , scope , occurrence and type of ambiguity .
However , the same spoken language analyzers may be able to produce sets of 
outputs containing the correct analysis in about 90 % of the cases -LRB- `` 
structural consistency '' \ -LRB- 2 \ -RRB- -RRB- 2 .
For example , attachment ambiguities are represented differently in the outputs 
of various analyzers , but it is always possible to recognize such an ambiguity 
, and to explain it by using a `` skeleton '' flat bracketing .
For instance , `` Please state your phone number '' shoukl not be deemed 
ambiguous , as no complete analysis should allow `` state '' to be a noun , or 
`` phone '' to be a verb .
This means that any strictly smaller fragment W of U has strictly less than n 
associated sub-representations or , equivalently , that at least two of the 
representations of V are be \ -RRB- equal with respect to W.
We suppose an architecture flexible enough to allow the above three 
extralinguistic processes to be optional , and , in the case of interactive 
disambiguation , to allow users to control the amount of questions asked by the 
system .
For lack of space , we can not give here the context-free grammar which defines 
our labeling formally , and illustrate the underlying principles by way of 
examples from a dialogue transcription taken from \ -LRB- 1 \ -RRB- .
In that case , it is important that the questions asked from the users are the 
most crucial ones , so that failure of the last step to select the correct 
interpretation does not result in too damaging translation errors .
Ambiguity labeling may also be considered as part of the specification of 
present and future state of the art analyzers , which means that : it should be 
compatible with the representation systems used by the actual or intended 
analyzers .
Finally , our labeling should only be concerned with the final result of 
analysis , not in any intermediate stage , because we want to retain only 
ambiguities which would remain unsolved after the complete automatic analysis 
process has been performed .
We take it for granted that , for each considered representation system , we 
know how to define , R ~ r each fragment V of an utterance U having a proper 
representation P , tile part of P which represents V.

P4 . Vocabulary learning means learning the words and their limitations , 
probability of occurrences , and syntactic behavior around them , Swartz & 
Yazdani -LRB- 1992 -RRB- .
Answers 7 and 10 are examples of bypassing strategies i.e. ; the use of a 
different verb or another sentence structure as a means for avoiding relative 
clauses .
Children , who now go to french schools , often switch back to English for 
their leisure activities because of the scarcity of options open to them .
When the children were asked about the main subject in the picture , the 
answers were acceptable in standard French , showing that they had no problems 
in using relative clauses with qui .
Additionally , we are studying the state of the art of systems using Artificial 
Intelligence techniques as well as NLP resources and\/or methodologies for 
teaching language , especially for bilingual and minority groups .
Another novelty is in the pedagogical approach of exposing the learner to the 
expert model and to the learner model in a comparative manner , thus helping to 
clarify the sources of error .
The following examples give a brief survey of the use of indirect object 
relative clauses : avec lequel \/ laquelle , sur lequel \/ laquelle , ~ qui , 
and dont : 11 .
However , unexpected `` pop-up '' activities would come up on the screen from 
time to time -LRB- style '' Tip of the day '' or `` TV ad . '' -RRB- .
At around the same period researchers were starting to put also some emphasis 
on the teaching strategies adopted in the system such as in WEST , Burton & 
Brown -LRB- 1976 -RRB- .
In such a case one can not help but to think about the advantages that 
technology can offer , especially in an era where language resources are ready 
for the pick .
In the following sections we summarize an empirical study that helped us To our 
knowledge , there are no systems that use machine translation tools for 
generating two versions of the same language instead of multilingual generation 
.
The syntactic graph and the lexicon are annotated with probabilities on usually 
faulty expressions in order to intensify the explanation or the number of 
examples and exercises on those particular parts -LRB- principles P3 and P4 
-RRB- .
In an effort to gain some insight into the projected linguistic model , an 
empirical study on the population of elementary students in the City of Moncton 
, New Brunswick , Canada was completed 1 .
Otherwise they use a bypassing strategy by separating the sentence into two 
parts as in `` C'est une branche et un oiseau '' , or by using another verb 
that allows qui as in 18 .
This paper presents a project that investigates to what extent computational 
linguistic methods and tools used at GETA for machine translation can be used 
to implement novel functionalities in intelligent computer assisted language 
learning .
double generation of Acadian French and Standard french .
null - SYGMOR for the morphological generation sub-agent .
How to implement parsers that can process ungrammatical input ?
Using Language Resources in an Intelligent Tutoring System for french
The lexicon can be augmented similarly .
- ATEF for the morphological analysis subagent .
-LRB- Bypassing strategy -RRB- 2.2 Object relative clauses
How to represent the linguistic knowledge in the expert and learner models ?
The user chooses where to start by clicking on a hot-button picture .
3 . Puzzle playing where words have assigned shapes according to their 
functions .
Answer 8 shows a common use of the preposition h instead of de .
Figure 2 : Annotated tree for a sentence in standard french .
Fitting the puzzle means placing the words in the correct order .
How to implement teaching strategies that are appropriate for language learning 
?
Our intelligent tutoring system project is still in its early phases .
We do not intend to build a fully free learning environment .
In this case only the formalisms at GETA would be exploited not the existing 
linguistic data .
The corpus consisted of the sentences collected during the empirical study 
-LRB- see section 2 -RRB- .
This allows the teacher to take responsibility of the degree of unstructured or 
of focused learning .
-LRB- Use of an English verb -RRB- 4 . C'est une fiile qui botte le ballon .
In brief , this setting of language learning is not that of a typical native 
speaker .
However , language learning had its own specific difficulties that were not 
generalized in other ITS systems .
Answer 9 is also representative of the frequent use of prepositions at the end 
of the sentence .
For many years GETA has been working on MT systems from and into french .
-LRB- Use of an inappropriate verb -RRB- 5 . C'est un papa etson garqon .
Many of these children use English syntax as well as borrowed vocabulary quite 
frequently .
Tha same grammar can be used by incrementing its rules to include 
new\/different sentence structures .
In the next sections , we will examine the children 's answers concerning 
relative clauses .
Introduction The project that we have started is intended for the minority 
French speaking Acadian community living in Atlantic Canada .
We have also seen , through an empirical study , the kinds of linguistic 
difficulties that a minority group is encountering .
Recent systems show how researchers are being more open to psycho linguistic , 
pedagogical and applied linguistic theories .
Conclusion We have presented in this paper an ongoing software development 
project that is still in its early phases .
Another alternative would be to consider the non-standard french as a 
completely new language from all points of view .
2 . Corpuses or extracts from children stories are equipped with hyperlinks to 
word meanings or grammar usage explanations .
By looking at these examples , it is evident that complex relative clauses are 
rather unknown to the children .
They show that the easiest particles for them are qui and que even when misused 
as in answer 12 .
fs -LRB- gov -RRB- cat -LRB- d ~ ~ -RRB- fs -LRB- des -RRB- cat -LRB- n -RRB- 
fs -LRB- gov -RRB- cat v ~ . ~ , -LRB- ~ , ~ fs -LRB- gov -RRB- ~ fs -LRB- reg 
-RRB- -RRB- cat -LRB- s -RRB- Figure \ -RRB- : Annotated tree for a sentence in 
non-standard french .
For example , the teacher could favor certain activities such as presenting 
examples of `` non standard French sentences '' and opposing them to English 
structures in a effort to show the children some Anglicisms ; or maybe choose a 
specific microworld , such as Holloween or Christmas so that the exercises 
would be closer to children 's real daily experience -LRB- principle P1 -RRB- .
An impressive core of linguistic knowledge is available but has not yet been 
experimented on in building language learning software , though work is 
underway for integration of heterogeneous nlp components , Boitet & Seligman 
-LRB- 1994 -RRB- .
Among the first milestones in Intelligent Tutoring Systems -LRB- ITS -RRB- was 
Carbonell 's system -LRB- 1970 -RRB- that used a knowledge-base to check the 
student 's answers and to allow him\/her to interact in `` natural language '' .
Ariane for example , uses special purpose rule-writing formalisms for each of 
its morphological and lexical modules both for analysis and for generation , 
with a strict separation of algorithmic and linguistic knowledge , Hutchins & 
Somers -LRB- 1992 -RRB- .
Following are some of the answers with the most frequent errors or bypassing 
strategies , they are marked with a \* ; the sentences with italics are the 
acceptable ones : 6 . C'est le livre que le garcon lit .
We begin our presentation with a literature review of related work in 
Intelligent Tutoring Systems -LRB- ITS -RRB- particularly on Computer Assisted 
Language Learning -LRB- CALL and Intelligent CALL -RRB- followed by the 
principles that this community is now expecting from system builders .
- EXPANSF for lexical expansion - TRANSF for translation into standard French 
C. ROBRA in its multi-level analysis - for syntactic tree definitions and The 
first series of experiments we realized using GETA 's resources concentrate on 
double analysis\/generation of standard French and non-standard local french .
Then , in the last section we propose the system 's general architecture and an 
overview some of its activities ; particularly those that counteract Anglicisms 
by double generating examples in standard French and in the local dialect using 
linguistic resources usually used in machine translation .
In the introduction and in the first sections , we have argued for the positive 
effects of computers on language learning and then on some of the issues that 
researchers in the field are hoping to see implemented from a computational and 
a pedagogical point of view .
It 's with such works and many others later , that Intelligent Tutoring Systems 
' architecture was more or less separated into four modules : an expert 's 
model , a learner 's model , a teacher 's model , and an interface , Wengers 
-LRB- 1987 -RRB- .
For example , The ICICLE Project is based on L2 learning theory -LRB- McCoy et 
al. , 1996 -RRB- ; Alexia -LRB- Selva et al. , 1997 -RRB- and FLUENT -LRB- 
Hamburger and Hashim , 1992 -RRB- are based on constructivism , Mr. Collins 
-LRB- Bull et al. , 1995 -RRB- is based on four empirical studies in an effort 
to `` discover '' student errors and their learning strategies .
Another tendency , that is very noticeably parallel to that of NLP , is the 
development of sophisticated language resources such as dictionaries for 
language -LRB- lexical -RRB- learning as exemplified by CELINE at Grenoble 
-LRB- Men6zo et al. , 1996 -RRB- , the SAFRAN project -LRB- 1997 -RRB- and The 
Reader at Princeton University -LRB- 1997 -RRB- which uses wordnet , or real 
corpuses as in the European project Camille -LRB- Ingraham et al. , 1994 -RRB- .
define the learner model .
1 Artificial intelligence
Language learning and
2.1 Subject relative clauses
2.3 Complex relative clauses
learner Model
There is a transition from X to F labeled with a and weight a iff Xa -- ~ a is 
a rule of the grammar .
Both the grammar compilation algorithms -LRB- GRM library -RRB- and our 
automata optimization tools -LRB- FSM library -RRB- work in the most general 
case .
If dynamic grammars and lazy expansion are not needed , we can expand the 
result fully and then apply weighted determinization and minimization 
algorithms .
In many of those applications , the actual languages described are regular , 
but context-free representations are much more concise and easier to create .
We thank Bruce Buntschuh and Ted Roycraft for their help with defining the 
dynamic grammar features and for their comments on this work .
the benefits of this representation , we compared the compilation time and the 
size of the resulting lazy automata with and without preoptimization .
A dynamic substitution consists of the application of the substitution a to ~ , 
during the process of recognition of a word sequence .
minals , and the rules involving them , are available for use in derivations ; 
they are just not available as start symbols .
For example , Figure 3 shows the weighted automaton for grammar G2 consisting 
of the last three rules of G1 with start symbol X.
More precisely , define a dependency graph Dc for G 's nonterminals and examine 
the set of its strongly-connected components -LRB- SCCs -RRB- .
If each of these subgrammars is either left-linear or rightlinear , we shall 
see that compilation into a single finite automaton is possible .
Thus , M -LRB- X -RRB- can always be defined in constant time and space by 
editing the automaton K -LRB- S -RRB- .
This operation does not require any recompilation , since it does not affect 
the automaton M -LRB- X -RRB- built for each nonterminal X.
We describe an efficient algorithm for compiling into weighted finite automata 
an interesting class of weighted context-free grammars that represent regular 
languages .
In particular , speech understanding applications require appropriate grammars 
both to constrain speech recognition and to help extract the meaning of 
utterances .
This replacement is also done on demand , with only the necessary part of aa 
being expanded for a given input string .
Rule X -LRB- ~ - + Y1 -- . Y ~ has a corresponding path that maps X to the 
sequence I\/1 ... Y ~ with weight ~ .
The bigram examples also show the advantages of lazy replacement and editing 
over the full expansion used in previous work -LRB- Pereira and Wright , 1997 
-RRB- .
Dynamic activation or deactivation of rules 2 We augment the grammar with a set 
of active nonterminals , which are those available as start symbols for 
derivations .
The replacement operation is lazy , that is , the states and transitions of the 
replacing automata are only expanded when needed for a given input string .
For example , Figure 4 shows the dependency graph for our example grammar G1 , 
with SCCs -LCB- Z -RCB- and -LRB- X , Y -RCB- .
The GRM Library also includes an efficient compilation too \ -RRB- for weighted 
context-dependent rewrite rules -LRB- Mohri and Sproat , 1996 -RRB- that is 
used in textto-speech projects at Lucent Bell Laboratories .
It can be used to compile effi - null ciently an interesting class of grammars 
representing weighted regular languages and allows for dynamic modifications 
that are crucial in many speech recognition applications .
For example , compilation is about 700 times faster in the optimized case for a 
fully expanded automaton even for a 40-word vocabulary model , and the result 
about 39 times smaller .
We did experiments with full bigram models with various vocabulary sizes , and 
with two unweighted grammars derived by feature instantiation from hand-built 
feature-based grammars -LRB- Pereira and Wright , 1997 -RRB- .
Grammar compilation takes as input a weighted CFG represented as a weighted 
transducer -LRB- Salomaa and Soittola , 1978 -RRB- , which may have been 
optimized prior to compilation -LRB- preoptimized -RRB- .

Fwd: AutoDetectParser bug?

Reply via email to