Good luck with quoted speech that contains more than one sentence. E.g.

William Faulkner said, “Never be afraid to raise your voice for honesty and 
truth and compassion against injustice and lying and greed. If people all over 
the world...would do this, it would change the earth.”

-----Ursprüngliche Nachricht-----
Von: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] Im 
Auftrag von R Smith
Gesendet: Montag, 06. August 2018 16:20
An: sqlite-users@mailinglists.sqlite.org
Betreff: Re: [sqlite] [EXTERNAL] Save text file content in db: lines or whole 
file?

On 2018/08/06 12:00 PM, R Smith wrote:
>
>> I need to save text files (let say between 1 KB to 20 MB) in a SQLite
>> DB.
>>
> Why not do both?
>
> If it was me, I would write some code to split the text into sentences
> (not lines - which is rather easy in English, but might be harder in
> some other languages).
//...

I've received two off-line questions as to how I could parse text into 
sentences in "English" even, and thought I would reply here since it might 
clear up the confusion for others too.

The said questions indicated that the authors probably imagined me possessing 
some fancy AI comprehending the language into what constitutes notional 
sentences (Subject+Predicate) or such, but I fear the meaning was much more 
arbitrary, based on common syntax for written English - as William Faulkner 
wrote in a letter to Malcolm Cowley:

*"I am trying to say it all in one sentence, between one Cap and one
period."*


Think of paragraphs in English as large records delimited by 2 or more
Line-break characters (#10+#13 or perhaps only #10 if on a *nix
platform) between texts.

Each paragraph record could be comprised of one or more sentences (in
English) as records delimited by a full-stop+Space or
full-stop+linebreak, or even simply the paragraph end.

By these simple rules, the following can easily parsed into 1 paragraph
with 2 sentences and a second paragraph with 1 sentence (lines here used
as formatting only, actual line-breaks indicated with "<-" marker):
<-
The quick brown fox jumps over the
lazy dog.  My grandma said to your
grandma, I'm gonna set your flag
on fire.<-
<-
Next paragraph here...<-
<-

Now a more difficult paragraph would be a the following, all of which
would translate in to 1 single sentence if only the above rules are
catered for:
<-
I have three wishes:<-
   - to be outlived by my children<-
   - to fly in space once before I die<-
   - to see Halley's comet once more<-
<-

That will be a single-sentenced paragraph.  It's up to the
end-implementation to gauge whether that would be sufficient a split or
not.

To put this into a DB, I would strip out the line-breaks inside
sentences (perhaps not strip out, but replace with space characters,
much like HTML does) to make them more easily handled as "lines". The
final DB table might then look like this:

ID |  fileID | parNo | parLineNo | docLineNo | txtLine
  1 |     1   |   1   |     1     |     1     | The quick brown fox
jumps over the lazy dog.
  2 |     1   |   1   |     2     |     2     | My grandma said to your
grandma, I'm gonna set your flag on fire.
  3 |     1   |   2   |     1     |     3     | Next paragraph here...
  4 |     1   |   3   |     1     |     4     | I have three wishes: -
to be outlived by my children - to fly in space once before I die - to
see Halley's comet once more

So yes, not a perfect walk-in-the-park, but easy to do for basic text
parsing.
Stating the obvious: If the intent is to re-construct the file 100%
exact (so it scores the same output for a hashing algorithm) then you
cannot strip out line-breaks and you need to carefully include each and
every character byte-for-byte used to split paragraphs and the like. It
all depends on the implementation requirements.

The above text format should hold for 99.9% of English literature text
that can be had in text files (i.e. no images, tables, etc.). Not so
easy for scientific papers, research material, movie scripts and a few
others.

Sorry for not presenting that great AI solution.  :)
Ryan

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


___________________________________________
 Gunter Hick | Software Engineer | Scientific Games International GmbH | 
Klitschgasse 2-4, A-1130 Vienna | FN 157284 a, HG Wien, DVR: 0430013 | (O) +43 
1 80100 - 0

May be privileged. May be confidential. Please delete if not the addressee.
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to