Re: UTF-8 support

2018-05-06 Thread Martin Blais
See here for the list of accepted uppercase characters:
https://www.fileformat.info/info/unicode/category/Lu/list.htm

They're unicode category "Letter, Uppercase" (Lu).


On Sun, May 6, 2018 at 10:16 PM, Martin Blais  wrote:

> On Sun, May 6, 2018 at 7:52 PM, Martin Michlmayr  wrote:
>
>> * Martin Blais  [2018-05-03 21:56]:
>> > I just merged PR14 from Adrián Medraño Calvo which adds support for
>> > UTF-8 in the lexer.  This is a significant change many people have
>> > been asking about for a long time.  Thanks to Adrian for working
>> > with me very patiently on a time-consuming patch!
>>
>> Thanks for merging this.  While I don't need it myself, I know a lot
>> of users need this.
>>
>> However, it doesn't seem to work as expected.  Adrián's patch said:
>>
>> "The capitalization requirements have been dropped, as they make no
>> sense in many alphabets"
>>
>> Looking at Adrián's patch at
>> https://bitbucket.org/blais/beancount/commits/1b29c8d9efb0ca
>> f62baa387a27f22c907cb38c23
>> and the merge at
>> https://bitbucket.org/blais/beancount/commits/e416767e0a3738
>> ab2e23b477b731c2460d891064
>> I'm also quite confused how this got merged.
>>
>
> Adrian did two versions: a first, simple one where the character
> categories are checked loosely by the lexer but further verified by a
> regexp in Python, and a complex one where they are baked in the lexer. The
> complex one didn't work in my testing and increased the size of the state
> tables considerably, while the simpler one didn't slow down parsing time,
> so I merged the first, simpler one. However, I first merged in the entire
> history of his work so that it's there for keeps in the repo hsitory, in
> case it's ever needed to reimplement the lexer-based one. Then I rolled
> back to the simpler version, and amended a few things.
>
>
> For example, Adrián's patch removes the definitions for ASCII,
>> UTF-8-1, UTF-8-2, etc from beancount/parser/lexer.l  and uses:
>>
>> ACCOUNTNAME {UTF-8-L}({UTF-8-L}|{UTF-8-N})+
>> SUBACCOUNTNAME  ({UTF-8-L}|{UTF-8-N})+
>>
>> But in the final merge this change never happens and the result is
>> quite different.
>>
>> Based on looking at both diffs, I came up with the attached patch but
>> that doesn't build.
>>
>> Anyway, I see the following errors I wasn't expecting:
>>
>> /home/tbm/scratch/cvs/tbm/ledger2beancount/t.bean:2:   ValueError:
>> Invalid account name: Expenses:école
>>
>> /home/tbm/scratch/cvs/tbm/ledger2beancount/t.bean:6:   Invalid
>> token: 'Assets:test'
>>
>> /home/tbm/scratch/cvs/tbm/ledger2beancount/t.bean:10:  Invalid
>> token: 'purchase'
>>
>> /home/tbm/scratch/cvs/tbm/ledger2beancount/t.bean:11:  Invalid
>> token: 'test'
>>
>> Test case:
>>
>> 2018-03-26 * "Lower case should work now"
>>   Expenses:école 10.00 EUR
>>
>
> This is expected, should not work.
> The idea is to preserve the current semantics and require the first
> character to be an uppercase letter (even with an accent). French does
> include uppercase accented letters, "Assets:École" works.
>
>
>
>>   Equity:Opening-Balance-10.00 EUR
>>
>> 2018-03-27 * "Lower case should work now"
>>   Assets:test10.00 EUR
>>
>
> Nope. Not allowed. Just as before.
>
>
>
>>   Assets:Commodity-Test123
>>
>> 2018-03-28 * "all lower case account names"
>>   expenses:purchase  10.00 EUR
>
>   assets:test
>>
>
> Also not allowed.
>
> Let me know if I'm still missing something,
>
>
>
>
> --
>> Martin Michlmayr
>> http://www.cyrius.com/
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Beancount" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to beancount+unsubscr...@googlegroups.com.
>> To post to this group, send email to beancount@googlegroups.com.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/beancount/20180506235258.qs47yhvxlre6ycm5%40jirafa.cyrius.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to beancount+unsubscr...@googlegroups.com.
To post to this group, send email to beancount@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/beancount/CAK21%2BhPG%3DUS_1w3_AX%2BGYCT2nh%3DVxaAQEcDbSu7ZmtGRyxCDjg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: UTF-8 support

2018-05-06 Thread Martin Blais
On Sun, May 6, 2018 at 7:52 PM, Martin Michlmayr  wrote:

> * Martin Blais  [2018-05-03 21:56]:
> > I just merged PR14 from Adrián Medraño Calvo which adds support for
> > UTF-8 in the lexer.  This is a significant change many people have
> > been asking about for a long time.  Thanks to Adrian for working
> > with me very patiently on a time-consuming patch!
>
> Thanks for merging this.  While I don't need it myself, I know a lot
> of users need this.
>
> However, it doesn't seem to work as expected.  Adrián's patch said:
>
> "The capitalization requirements have been dropped, as they make no
> sense in many alphabets"
>
> Looking at Adrián's patch at
> https://bitbucket.org/blais/beancount/commits/
> 1b29c8d9efb0caf62baa387a27f22c907cb38c23
> and the merge at
> https://bitbucket.org/blais/beancount/commits/
> e416767e0a3738ab2e23b477b731c2460d891064
> I'm also quite confused how this got merged.
>

Adrian did two versions: a first, simple one where the character categories
are checked loosely by the lexer but further verified by a regexp in
Python, and a complex one where they are baked in the lexer. The complex
one didn't work in my testing and increased the size of the state tables
considerably, while the simpler one didn't slow down parsing time, so I
merged the first, simpler one. However, I first merged in the entire
history of his work so that it's there for keeps in the repo hsitory, in
case it's ever needed to reimplement the lexer-based one. Then I rolled
back to the simpler version, and amended a few things.


For example, Adrián's patch removes the definitions for ASCII,
> UTF-8-1, UTF-8-2, etc from beancount/parser/lexer.l  and uses:
>
> ACCOUNTNAME {UTF-8-L}({UTF-8-L}|{UTF-8-N})+
> SUBACCOUNTNAME  ({UTF-8-L}|{UTF-8-N})+
>
> But in the final merge this change never happens and the result is
> quite different.
>
> Based on looking at both diffs, I came up with the attached patch but
> that doesn't build.
>
> Anyway, I see the following errors I wasn't expecting:
>
> /home/tbm/scratch/cvs/tbm/ledger2beancount/t.bean:2:   ValueError:
> Invalid account name: Expenses:école
>
> /home/tbm/scratch/cvs/tbm/ledger2beancount/t.bean:6:   Invalid token:
> 'Assets:test'
>
> /home/tbm/scratch/cvs/tbm/ledger2beancount/t.bean:10:  Invalid token:
> 'purchase'
>
> /home/tbm/scratch/cvs/tbm/ledger2beancount/t.bean:11:  Invalid token:
> 'test'
>
> Test case:
>
> 2018-03-26 * "Lower case should work now"
>   Expenses:école 10.00 EUR
>

This is expected, should not work.
The idea is to preserve the current semantics and require the first
character to be an uppercase letter (even with an accent). French does
include uppercase accented letters, "Assets:École" works.



>   Equity:Opening-Balance-10.00 EUR
>
> 2018-03-27 * "Lower case should work now"
>   Assets:test10.00 EUR
>

Nope. Not allowed. Just as before.



>   Assets:Commodity-Test123
>
> 2018-03-28 * "all lower case account names"
>   expenses:purchase  10.00 EUR

  assets:test
>

Also not allowed.

Let me know if I'm still missing something,




-- 
> Martin Michlmayr
> http://www.cyrius.com/
>
> --
> You received this message because you are subscribed to the Google Groups
> "Beancount" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to beancount+unsubscr...@googlegroups.com.
> To post to this group, send email to beancount@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/beancount/20180506235258.qs47yhvxlre6ycm5%40jirafa.cyrius.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to beancount+unsubscr...@googlegroups.com.
To post to this group, send email to beancount@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/beancount/CAK21%2BhOcH%3DpMcAaNCK8vuttMvSdWT9EBT9V1e3hAX7wK1PVSLg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: UTF-8 support

2018-05-06 Thread Martin Michlmayr
* Martin Blais  [2018-05-03 21:56]:
> I just merged PR14 from Adrián Medraño Calvo which adds support for
> UTF-8 in the lexer.  This is a significant change many people have
> been asking about for a long time.  Thanks to Adrian for working
> with me very patiently on a time-consuming patch!

Thanks for merging this.  While I don't need it myself, I know a lot
of users need this.

However, it doesn't seem to work as expected.  Adrián's patch said:

"The capitalization requirements have been dropped, as they make no
sense in many alphabets"

Looking at Adrián's patch at
https://bitbucket.org/blais/beancount/commits/1b29c8d9efb0caf62baa387a27f22c907cb38c23
and the merge at
https://bitbucket.org/blais/beancount/commits/e416767e0a3738ab2e23b477b731c2460d891064
I'm also quite confused how this got merged.

For example, Adrián's patch removes the definitions for ASCII,
UTF-8-1, UTF-8-2, etc from beancount/parser/lexer.l  and uses:

ACCOUNTNAME {UTF-8-L}({UTF-8-L}|{UTF-8-N})+
SUBACCOUNTNAME  ({UTF-8-L}|{UTF-8-N})+

But in the final merge this change never happens and the result is
quite different.

Based on looking at both diffs, I came up with the attached patch but
that doesn't build.

Anyway, I see the following errors I wasn't expecting:

/home/tbm/scratch/cvs/tbm/ledger2beancount/t.bean:2:   ValueError: Invalid 
account name: Expenses:école

/home/tbm/scratch/cvs/tbm/ledger2beancount/t.bean:6:   Invalid token: 
'Assets:test'

/home/tbm/scratch/cvs/tbm/ledger2beancount/t.bean:10:  Invalid token: 
'purchase'

/home/tbm/scratch/cvs/tbm/ledger2beancount/t.bean:11:  Invalid token: 'test'

Test case:

2018-03-26 * "Lower case should work now"
  Expenses:école 10.00 EUR
  Equity:Opening-Balance-10.00 EUR

2018-03-27 * "Lower case should work now"
  Assets:test10.00 EUR
  Assets:Commodity-Test123

2018-03-28 * "all lower case account names"
  expenses:purchase  10.00 EUR
  assets:test

-- 
Martin Michlmayr
http://www.cyrius.com/

-- 
You received this message because you are subscribed to the Google Groups 
"Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to beancount+unsubscr...@googlegroups.com.
To post to this group, send email to beancount@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/beancount/20180506235258.qs47yhvxlre6ycm5%40jirafa.cyrius.com.
For more options, visit https://groups.google.com/d/optout.
diff -r 743bc6d3f432 Makefile
--- a/Makefile	Sun May 06 10:04:16 2018 -0400
+++ b/Makefile	Mon May 07 01:42:46 2018 +0200
@@ -52,7 +52,8 @@
 #$(CROOT)/lexer.c $(CROOT)/lexer.h: $(LEXER_SOURCES) $(CROOT)/grammar.h
 #	$(LEX) --outfile=$(CROOT)/lexer.c --header-file=$(CROOT)/lexer.h $(LEXER_SOURCES)
 #	patch -p1 < $(CROOT)/lexer.patch
-$(CROOT)/lexer.c $(CROOT)/lexer.h: $(CROOT)/lexer.l $(CROOT)/grammar.h
+LEXER_SOURCES = $(UNICODE_CATEGORY_SOURCES) $(CROOT)/lexer.l
+$(CROOT)/lexer.c $(CROOT)/lexer.h: $(LEXER_SOURCES) $(CROOT)/grammar.h
 	$(LEX) --outfile=$(CROOT)/lexer.c --header-file=$(CROOT)/lexer.h $<
 	patch --no-backup-if-mismatch -p1 < $(CROOT)/lexer.patch
 
diff -r 743bc6d3f432 beancount/core/account.py
--- a/beancount/core/account.py	Sun May 06 10:04:16 2018 -0400
+++ b/beancount/core/account.py	Mon May 07 01:42:46 2018 +0200
@@ -21,11 +21,10 @@
 
 # Regular expression string that matches valid account name components.
 # Categories are:
-#   Lu: Uppercase letters.
 #   L: All letters.
 #   Nd: Decimal numbers.
-ACC_COMP_TYPE_RE = regexp_utils.re_replace_unicode(r"[\p{Lu}][\p{L}\p{Nd}\-]*")
-ACC_COMP_NAME_RE = regexp_utils.re_replace_unicode(r"[\p{Lu}\p{Nd}][\p{L}\p{Nd}\-]*")
+ACC_COMP_TYPE_RE = regexp_utils.re_replace_unicode(r"[\p{L}][\p{L}\p{Nd}\-]*")
+ACC_COMP_NAME_RE = regexp_utils.re_replace_unicode(r"[\p{L}\p{Nd}][\p{L}\p{Nd}\-]*")
 
 # Regular expression string that matches a valid account. {5672c7270e1e}
 ACCOUNT_RE = "(?:{})(?:{}{})+".format(ACC_COMP_TYPE_RE, sep, ACC_COMP_NAME_RE)
diff -r 743bc6d3f432 beancount/parser/lexer.l
--- a/beancount/parser/lexer.l	Sun May 06 10:04:16 2018 -0400
+++ b/beancount/parser/lexer.l	Mon May 07 01:42:46 2018 +0200
@@ -111,16 +111,11 @@
 %x STRLIT
 
 
-ASCII   [\x00-\x7f]
-UTF-8-1 [\x80-\xbf]
-UTF-8-2 [\xc2-\xdf]{UTF-8-1}
-UTF-8-3 \xe0[\xa0-\xbf]{UTF-8-1}|[\xe1-\xec]{UTF-8-1}{UTF-8-1}|\xed[\x80-\x9f]{UTF-8-1}|[\xee-\xef]{UTF-8-1}{UTF-8-1}
-UTF-8-4 \xf0[\x90-\xbf]{UTF-8-1}{UTF-8-1}|[\xf1-\xf3]{UTF-8-1}{UTF-8-1}{UTF-8-1}|\xf4[\x80-\x8f]{UTF-8-1}{UTF-8-1}
-UTF-8-ONLY  {UTF-8-2}|{UTF-8-3}|{UTF-8-4}
-UTF-8   {ASCII}|{UTF-8-ONLY}
+UTF-8-L {UTF-8-Lu}|{UTF-8-Ll}|{UTF-8-Lt}|{UTF-8-Lo}
+UTF-8-N {UTF-8-Nd}|{UTF-8-Nl}|{UTF-8-No}
 
-ACCOUNTTYPE ([A-Z]|{UTF-8-ONLY})([A-Za-z0-9\-]|{UTF-8-ONLY})*
-ACCOUNTNAME ([A-Z0-9]|{UTF-8-ONLY})([A-Za-z0-9\-]|{UTF-8-ONLY})*
+ACCOUNTTYPE 

Re: document organization: unifying statements with everything else

2018-05-06 Thread Martin Blais
On Sun, May 6, 2018 at 1:01 PM, Stefano Zacchiroli  wrote:

> On Sun, May 06, 2018 at 12:34:17PM -0400, Martin Blais wrote:
> > It would be possible to not list the documents that are associated with
> > transactions in the journals, if that's what you mean.
> > That's a web UI option.
> [...]
> > Same as I describe above. Imagine a plugin that runs the join I'm
> proposing
> > and removes matched Document directives from the list and moves them to
> > metadata.
>
> This is the part that wasn't entirely I clear to me. I thought that the
> intended meaning of document directives was to stick document to a
> specific point in time in the transaction journal.


> While the interpretation you're suggesting here is that it just makes
> Beancount aware of the existence of documents, the date is just an
> attribute documents will have (because it's required for all Beancount
> directives); but documents have no special meanings other than what
> plugins / UIs make of them.
>

That's correct.
The Documents directive creates a list of documents.
Documents have a date, and currently are required by the grammar to have an
associated account (though that could be changed).




This addresses my concern, thank you. I will stop worrying about mixing
> transaction-specific documents and statements.
>

Oh yes, you should definitely not have to worry about that.


> BTW, I'm happy to support something like this in Beancount itself,
> > that functionality could move out of Fava, it's not web-specific.
>
> That would be helpful and make more users use documents associated to
> transactions, I think. But is of course up to the Fava people to decide
> if they want to move the plugin over or not.
>
> Cheers
> --
> Stefano Zacchiroli . z...@upsilon.cc . upsilon.cc/zack . . o . . . o . o
> Computer Science Professor . CTO Software Heritage . . . . . o . . . o o
> Former Debian Project Leader & OSI Board Director  . . . o o o . . . o .
> « the first rule of tautology club is the first rule of tautology club »
>
> --
> You received this message because you are subscribed to the Google Groups
> "Beancount" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to beancount+unsubscr...@googlegroups.com.
> To post to this group, send email to beancount@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/beancount/20180506170107.GJ1361%40upsilon.cc.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to beancount+unsubscr...@googlegroups.com.
To post to this group, send email to beancount@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/beancount/CAK21%2BhNp6mH3NfST20ZmExnoH%2BhWH0xquL%2BsVXPiQJr_9RY5wA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: document organization: unifying statements with everything else

2018-05-06 Thread Stefano Zacchiroli
On Sun, May 06, 2018 at 12:34:17PM -0400, Martin Blais wrote:
> It would be possible to not list the documents that are associated with
> transactions in the journals, if that's what you mean.
> That's a web UI option.
[...]
> Same as I describe above. Imagine a plugin that runs the join I'm proposing
> and removes matched Document directives from the list and moves them to
> metadata.

This is the part that wasn't entirely I clear to me. I thought that the
intended meaning of document directives was to stick document to a
specific point in time in the transaction journal.

While the interpretation you're suggesting here is that it just makes
Beancount aware of the existence of documents, the date is just an
attribute documents will have (because it's required for all Beancount
directives); but documents have no special meanings other than what
plugins / UIs make of them.

This addresses my concern, thank you. I will stop worrying about mixing
transaction-specific documents and statements.

> BTW, I'm happy to support something like this in Beancount itself,
> that functionality could move out of Fava, it's not web-specific.

That would be helpful and make more users use documents associated to
transactions, I think. But is of course up to the Fava people to decide
if they want to move the plugin over or not.

Cheers
-- 
Stefano Zacchiroli . z...@upsilon.cc . upsilon.cc/zack . . o . . . o . o
Computer Science Professor . CTO Software Heritage . . . . . o . . . o o
Former Debian Project Leader & OSI Board Director  . . . o o o . . . o .
« the first rule of tautology club is the first rule of tautology club »

-- 
You received this message because you are subscribed to the Google Groups 
"Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to beancount+unsubscr...@googlegroups.com.
To post to this group, send email to beancount@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/beancount/20180506170107.GJ1361%40upsilon.cc.
For more options, visit https://groups.google.com/d/optout.


Re: document organization: unifying statements with everything else

2018-05-06 Thread Martin Blais
On Sat, May 5, 2018 at 4:18 AM, Stefano Zacchiroli  wrote:

> On Fri, May 04, 2018 at 09:12:55AM -0400, Martin Blais wrote:
> > > 1) via the "documents" option beancount has great support for bank
> > In relational terms, this is a join of the list of accounts and the
> > documents.
> [...]
> > 2) but I also want to store other documents and associate them to either
> > >transactions as a whole or even individual transaction postings.
> > >Examples are: receipts (for payments, donations, etc.), invoices,
> > >paychecks, etc.
> > In relational terms, this is a join of the transactions and the
> documents.
> > You want to obtain an association list of
> >   (transaction, document)
> > Where each document belongs to at most one transaction.
>
> Ack on (1).
>
> On (2) you also need to support associating documents to individual
> postings --- you can do it in various ways in a relational model, I'm
> not sure which one you prefer.
>

I would interpret one of the metadata fields as a substring of the set of
existing documents (the list of which is provided by the full set of
Document directives, or perhaps the union of those associated with the
accounts of the transactions), possibly matching multiple documents.



>
> > As far as I can tell Beancount itself has not direct support for (2),
> > > please correct me if I'm wrong.
> >
> > It supports neither. The web interface performs the first join implicitly
> > by grouping all the directives by account and then rendering journals for
> > any account.
>
> Well, no. The fact that the "documents" option exists makes Beancount de
> facto support use case (1). You can just drop documents in dirs, and
> they will show up in the transaction flow. I'm wondering whether we can
> have something similar for associating documents to individual
> transactions / postings. I'm not clear on how/if the data model can
> support that.
>

As above, it's a join. The stream of Document directives provides the set
of available documents, and a metadata field on the transaction associates
the set of matching documents with the transactions, providing a new data
structure for the join, or perhaps just updating metadata fields with the
list of documents (*).


What is suboptimal right now is that people are using link_statements,
> making them have document entries appear in the global ledger flow for
> something that is txn/posting-specific.
>

It would be possible to not list the documents that are associated with
transactions in the journals, if that's what you mean.
That's a web UI option.


> I could add clean APIs to perform either of these joins, given a
> particular
> > meta-data field.
> > For the transactions/documents join, the match could be partial (e.g.
> > unique substring on the document filenames).
> > Let me know.
> >
> > Also, how do you need to query this?
> > What do you want to produce?
>
> Ideally, I'd like to:
>
> 1) "type" (in the sense of type sytems), metadata entries, letting
>Beancount know that a specific metadata key should point to a
>document (via an URI, or a path, I don't particularly care).


Same as I describe above. Imagine a plugin that runs the join I'm proposing
and removes matched Document directives from the list and moves them to
metadata.



> This
>will already allow a number of nice checks:
>
>- return an error if the link is dangling
>
>- query the txn and ask: do you link to any document? <- this will in
>  turn allow fava to render document links associated to txn /
>  postings even if there are no matching document entries in the
>  global ledger flow
>

What do you mean by "query"? The web interface can do whatever it wants.
It could run the suggested plugin and if the metadata field is available
render those "transaction documents" differently.



2) understand where to put the actual documents on disk, in a way that
>doesn't get in the way of the "documents" option. This looks
>complicated because:
>
>- on the one hand I want a single dir hierarchy where to put
>  documents, that works for both global documents (that should appear
>  in the flobal ledger flow) and transaction-specific documents
>
>- on the other hand I don't want to have document entries generated
>  for transaction-specific documents
>
>And the two look incompatible.
>

I don't see why not. The purpose of the Document directive is to declare
the existence of a document associated with the ledger. The plugin could
take some of those out of the stream to move them into the transaction's
metadata.

Metadata so far is intended to be used by users only and by plugins.

BTW, I'm happy to support something like this in Beancount itself, that
functionality could move out of Fava, it's not web-specific.


(At this stage this thread is longer than it would take time to program
it...)




   I guess that if we inform Beancount of which metadata entries point
>to documents, it will