Re: [PATCH v3] Re: [BUG] org-attach-id-ts-folder-format fails on customized IDs [9.6 (9.6-??-2e9999783)]

2022-08-13 Thread Ihor Radchenko
Max Nikulin  writes:

>> I do not think that it is a good idea. We currently promise to use the
>> first function in the list to generate the IDs. Other functions are
>> mostly a fallback to catch the existing attach dirs created in the past
>> using different ID scheme.
>
> My idea is to add an extra check that filters out nil values. I do not 
> mind to signal a user error if first entry from 
> `org-attach-id-to-path-function-list' returns nil and paths returned by 
> other function do not exist.

This would be possible, in addition to the patch above.
Or do you mean the default ts/uuid format functions to return nil to
short IDs as well?

>> I guess that one option could be demanding a non-nil return value when
>> generating a new attach dir (without TRY-ALL argument) and wrapping
>> things into (ignore-errors ...) otherwise.
>
> I am against `ignore-errors' because it makes harder bug hunting using 
> debug on errors.

Agreed.

> I would prefer to force users to justify strategy for non-standard IDs 
> as early as possible. The goal is to avoid manual actions to change 
> directory structure when users have hundreds of folders. My experience 
> that users may easily face performance issues with ~1000 files per 
> directory. The reason is either lack of optimization for such case in 
> particular application or a bug introduced in a new version.

I am not sure if we can solve this kind of issue reliably. The only idea
is enforcing [1-9][0-9]\{5\} for ts IDs. And we cannot do much about IDs
prefixed with the same constant string.

-- 
Ihor Radchenko,
Org mode contributor,
Learn more about Org mode at https://orgmode.org/.
Support Org development at https://liberapay.com/org-mode,
or support my work at https://liberapay.com/yantar92



Re: [PATCH v3] Re: [BUG] org-attach-id-ts-folder-format fails on customized IDs [9.6 (9.6-??-2e9999783)]

2022-08-13 Thread Max Nikulin

On 13/08/2022 12:29, Ihor Radchenko wrote:

Max Nikulin writes:


I believe that interaction between `org-attach-dir-from-id' and members
of `org-attach-id-to-path-function-list' could be improved. If a too
short ID is passed to `org-attach-id-uuid-folder-format',
`org-attach-id-ts-folder-format', or a user-defined function, they may
return nil and `org-attach-dir-from-id' should try next element from the
list. If nothing succeeds then a user error should be signaled demanding
to implement a strategy to properly deal with peculiar IDs.


I do not think that it is a good idea. We currently promise to use the
first function in the list to generate the IDs. Other functions are
mostly a fallback to catch the existing attach dirs created in the past
using different ID scheme.


My idea is to add an extra check that filters out nil values. I do not 
mind to signal a user error if first entry from 
`org-attach-id-to-path-function-list' returns nil and paths returned by 
other function do not exist.



I guess that one option could be demanding a non-nil return value when
generating a new attach dir (without TRY-ALL argument) and wrapping
things into (ignore-errors ...) otherwise.


I am against `ignore-errors' because it makes harder bug hunting using 
debug on errors.


I would prefer to force users to justify strategy for non-standard IDs 
as early as possible. The goal is to avoid manual actions to change 
directory structure when users have hundreds of folders. My experience 
that users may easily face performance issues with ~1000 files per 
directory. The reason is either lack of optimization for such case in 
particular application or a bug introduced in a new version.





Re: [PATCH v3] Re: [BUG] org-attach-id-ts-folder-format fails on customized IDs [9.6 (9.6-??-2e9999783)]

2022-08-12 Thread Ihor Radchenko
Max Nikulin  writes:

> was a really bad one. It leads to a separate directory for each short 
> ID. However I still believe that the purpose of 
> `org-attach-id-ts-folder-format' is avoid concentration of huge number 
> of file in a single directory. Distribution of attachments over 
> subdirectories is not perfectly even but it still works reasonably well 
> for long-lasting projects while IDs follow assumed format and month is 
> included into directory names.

This is only true for uuid-style. Also, this is likely not the problem
we need to worry about. To be a problem, the number of attach dirs
should be really large, which is only relevant when thousands of attach
dirs are cramped into a single parent directory. Thousands of attach
dirs would probably correspond to tens of thousands of headings, which
is not something we commonly have and, honestly, Org will have much more
serious issues when the number of headings is that large.

If someone ends up complaining about too many attach dirs, we can always
ask to adjust the ID naming scheme.

The previous paragraph appears to be consistent with

> Back to the original problem. First of all, I believe that if a user 
> decided to use non-standard IDs then it is their responsibility to 
> customize `org-attach-id-to-path-function-list' and to provide a 
> function that implements alternative layout of attachment files.
> Depending on expected amount, they may be put into single directory, 
> spread using some hash function, or accordingly to project structure. So 
> today I am against default fallback directories especially in 
> `org-attach-id-ts-folder-format'.

---

> I believe that interaction between `org-attach-dir-from-id' and members 
> of `org-attach-id-to-path-function-list' could be improved. If a too 
> short ID is passed to `org-attach-id-uuid-folder-format', 
> `org-attach-id-ts-folder-format', or a user-defined function, they may 
> return nil and `org-attach-dir-from-id' should try next element from the 
> list. If nothing succeeds then a user error should be signaled demanding 
> to implement a strategy to properly deal with peculiar IDs.

I do not think that it is a good idea. We currently promise to use the
first function in the list to generate the IDs. Other functions are
mostly a fallback to catch the existing attach dirs created in the past
using different ID scheme.

I guess that one option could be demanding a non-nil return value when
generating a new attach dir (without TRY-ALL argument) and wrapping
things into (ignore-errors ...) otherwise.

> There should be no problem to put single character IDs into a common 
> directory (UUID case), but there are enough variants for 5 characters 
> long names to create delayed performance problem. The worst case is when 
> a user decides to use some common prefix, e.g. "id-" before UUID and all 
> files go to the same directory. On the other hand I do not think that 
> code should detect such cases.

We should not worry about this. It will be the user responsibility to
create the relevant custom function.

-- 
Ihor Radchenko,
Org mode contributor,
Learn more about Org mode at https://orgmode.org/.
Support Org development at https://liberapay.com/org-mode,
or support my work at https://liberapay.com/yantar92



Re: [PATCH v3] Re: [BUG] org-attach-id-ts-folder-format fails on customized IDs [9.6 (9.6-??-2e9999783)]

2022-08-11 Thread Max Nikulin

On 11/08/2022 11:19, Ihor Radchenko wrote:

Max Nikulin writes:


I slightly dislike the "___xx" compared to "__" because it will
create a proliferation of top-level folders as opposed to cramping the
non-standard IDs into a single "__" folder.


I believed that proliferation of folders is for purpose. Intermediate
directories allows to avoid excessive number of files in single
directory. ext4 with directory tree index usually is not the case, but
other filesystems may have rather poor performance when too much files
are stuffed into single folder. Some applications become really slow for
huge directories.


I was referring to TS style timestamp resolver here. It is designed to
group directories by creation time, not to distribute them homogeneously.


My bad, I have realizes that my idea of mapping

"x" -> "__x/x"
"xy" -> "_xy/xy"

was a really bad one. It leads to a separate directory for each short 
ID. However I still believe that the purpose of 
`org-attach-id-ts-folder-format' is avoid concentration of huge number 
of file in a single directory. Distribution of attachments over 
subdirectories is not perfectly even but it still works reasonably well 
for long-lasting projects while IDs follow assumed format and month is 
included into directory names.


Back to the original problem. First of all, I believe that if a user 
decided to use non-standard IDs then it is their responsibility to 
customize `org-attach-id-to-path-function-list' and to provide a 
function that implements alternative layout of attachment files. 
Depending on expected amount, they may be put into single directory, 
spread using some hash function, or accordingly to project structure. So 
today I am against default fallback directories especially in 
`org-attach-id-ts-folder-format'.


I believe that interaction between `org-attach-dir-from-id' and members 
of `org-attach-id-to-path-function-list' could be improved. If a too 
short ID is passed to `org-attach-id-uuid-folder-format', 
`org-attach-id-ts-folder-format', or a user-defined function, they may 
return nil and `org-attach-dir-from-id' should try next element from the 
list. If nothing succeeds then a user error should be signaled demanding 
to implement a strategy to properly deal with peculiar IDs.


There should be no problem to put single character IDs into a common 
directory (UUID case), but there are enough variants for 5 characters 
long names to create delayed performance problem. The worst case is when 
a user decides to use some common prefix, e.g. "id-" before UUID and all 
files go to the same directory. On the other hand I do not think that 
code should detect such cases.




Re: [PATCH v3] Re: [BUG] org-attach-id-ts-folder-format fails on customized IDs [9.6 (9.6-??-2e9999783)]

2022-08-10 Thread Ihor Radchenko
Max Nikulin  writes:

>>> Please, do not use path components starting with dash, it is terrible
>>> for CLI tools. By the way, you promised underscores, not dashes.
>> 
>> Why?
>
> It is inconvenient in interactive sessions (in scripts appropriate 
> measures must be taken anyway), the following does not list content of 
> the "--" directory:
>
>  ls -l --

Thanks for the clarification.

>> I slightly dislike the "___xx" compared to "__" because it will
>> create a proliferation of top-level folders as opposed to cramping the
>> non-standard IDs into a single "__" folder.
>
> I believed that proliferation of folders is for purpose. Intermediate 
> directories allows to avoid excessive number of files in single 
> directory. ext4 with directory tree index usually is not the case, but 
> other filesystems may have rather poor performance when too much files 
> are stuffed into single folder. Some applications become really slow for 
> huge directories.

I was referring to TS style timestamp resolver here. It is designed to
group directories by creation time, not to distribute them homogeneously.

> On the other hand there are not so many distinct file names that may 
> fall into "__" directory, so perhaps it is OK.

Yes. Such short IDs are not supposed to be common and even if they are,
the maximum possible number of such IDs is relatively small. 


-- 
Ihor Radchenko,
Org mode contributor,
Learn more about Org mode at https://orgmode.org/.
Support Org development at https://liberapay.com/org-mode,
or support my work at https://liberapay.com/yantar92



Re: [PATCH v3] Re: [BUG] org-attach-id-ts-folder-format fails on customized IDs [9.6 (9.6-??-2e9999783)]

2022-08-10 Thread Max Nikulin

On 10/08/2022 20:23, Ihor Radchenko wrote:

Max Nikulin writes:


+the path."
+  (if (< (length id) 3)
+  (format "--/%s" id)


Please, do not use path components starting with dash, it is terrible
for CLI tools. By the way, you promised underscores, not dashes.


Why?


It is inconvenient in interactive sessions (in scripts appropriate 
measures must be taken anyway), the following does not list content of 
the "--" directory:


ls -l --


I slightly dislike the "___xx" compared to "__" because it will
create a proliferation of top-level folders as opposed to cramping the
non-standard IDs into a single "__" folder.


I believed that proliferation of folders is for purpose. Intermediate 
directories allows to avoid excessive number of files in single 
directory. ext4 with directory tree index usually is not the case, but 
other filesystems may have rather poor performance when too much files 
are stuffed into single folder. Some applications become really slow for 
huge directories.


On the other hand there are not so many distinct file names that may 
fall into "__" directory, so perhaps it is OK. Concerning "unknown" I am 
less sure.


I do not have strong opinion how to properly deal with short IDs.




[PATCH v3] Re: [BUG] org-attach-id-ts-folder-format fails on customized IDs [9.6 (9.6-??-2e9999783)]

2022-08-10 Thread Ihor Radchenko
Max Nikulin  writes:

> On 10/08/2022 18:43, Ihor Radchenko wrote:
>> Ihor Radchenko writes:
>> 
>> I have updated the patch to use "__/id" when id is too short.
>> Any objections?
>> 
>> +When ID is too short (less than 3 chars), use its md5 hash to create
>
> Misleading docs, you do not use md5.

Oops. Will fix.

>> +the path."
>> +  (if (< (length id) 3)
>> +  (format "--/%s" id)
>
> Please, do not use path components starting with dash, it is terrible 
> for CLI tools. By the way, you promised underscores, not dashes.

Why?

I have no opinion about the possible dummy folder name, except that it
should fit the general pattern we already have "xx/..." or "MM/...".

>> +(format "%s/%s"
>> +(substring id 0 2)
>> +(substring id 2
>
> Ihor, I have not look into the code around, so my suggestions may have 
> no sense.
>
> Is it possible to pass empty string as ID to these functions? Should it 
> be explicitly checked?

It should not be possible under normal operation.

> What if ID contains "/" that can not be used in file name? Windows has 
> more forbidden characters.

Then, creating the attachment dir will fail in front of the user. I am
not sure if we really need to do much about this. At least, not until we
get a bug report.

If we really need to solve the edge cases with attach dir generation, we
may need to change the overall design in org-attach/org-id to something
more restrictive. I am not sure if it is worth the effort compared to
other directions where to improve Org.

> Do you expect any problem if here (and for timestamp-based ids) 
> directory component is just padded with some character (unsure what can 
> better than underscore) to required length?
>
> "x" -> "_x/x" or "__x/x"
> "xy" -> "xy/xy" or "_xy/xy"
> etc.
>
>  From my point of view it might be a bit better than "__" and "unknown".

Does it make sense?
Timestamp-based IDs are like 20220810T210121.478800 by default.
Then, the top-level folders will be like 202208/...
If the actual format is different, we cannot possibly expect the
timestamp to follow the same pattern, which is why I chose "unknown"
referring to unknown date.

On the other hand, if an ID has more than 6 characters, it will generate
nonsense top-level folder with or without the patch. We could go further
and match the ID against [1-9][0-9]\{5\} and put the folder into
"unknown/ID" instead, but that would be a breaking change.

In summary, I am more of less neutral towards this fallback format,
except that it would be nice to be able to recover the ID from the
directory path. Padding should be recoverable.

I slightly dislike the "___xx" compared to "__" because it will
create a proliferation of top-level folders as opposed to cramping the
non-standard IDs into a single "__" folder.

Attaching the updated patch.

>From 50d0f9de0acdf5d67b797476816cbeb40b19f554 Mon Sep 17 00:00:00 2001
Message-Id: <50d0f9de0acdf5d67b797476816cbeb40b19f554.1660137585.git.yanta...@gmail.com>
From: Ihor Radchenko 
Date: Sat, 23 Jul 2022 13:13:24 +0800
Subject: [PATCH v3] org-attach-dir-from-id: Do not rely on ID being over 6
 chars long

* lisp/org-attach.el (org-attach-id-uuid-folder-format): Fall back to
"__/ID" when the ID contains 2 chars or less and cannot be split into
the "xy/z" path.
(org-attach-id-ts-folder-format): Fall back to "__/ID" path format
when the ID contains less than 7 chars and cannot be split into the
"MM/rest" path.

Fixes https://orgmode.org/list/KC8PcypJapBpJQtJxM0kX5N7Z0THL2Lq6EQjBMzpw1-vgQf72egZ2JOIlTbPYiqAVD4MdSBhrhBZr2Ykf5DN1mocm1ANvvuKKZShlkgzKYM=@pm.me
---
 lisp/org-attach.el | 24 
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/lisp/org-attach.el b/lisp/org-attach.el
index fe49af6f3..0f5d5af82 100644
--- a/lisp/org-attach.el
+++ b/lisp/org-attach.el
@@ -159,19 +159,27 @@ (defcustom org-attach-archive-delete nil
 (defun org-attach-id-uuid-folder-format (id)
   "Translate an UUID ID into a folder-path.
 Default format for how Org translates ID properties to a path for
-attachments.  Useful if ID is generated with UUID."
-  (format "%s/%s"
-	  (substring id 0 2)
-	  (substring id 2)))
+attachments.  Useful if ID is generated with UUID.
+
+When ID is too short (less than 3 chars), return \"__/ID\"."
+  (if (< (length id) 3)
+  (format "__/%s" id)
+(format "%s/%s"
+	(substring id 0 2)
+	(substring id 2
 
 (defun org-attach-id-ts-folder-format (id)
   "Translate an ID based on a timestamp to a folder-path.
 Useful way of translation if ID is generated based on ISO8601
 timestamp.  Splits the attachment folder hierarchy into
-year-month, the rest."
-  (format "%s/%s"
-	  (substring id 0 6)
-	  (substring id 6)))
+year-month, the rest.
+
+When ID is too short (less than 7 chars), return \"__/ID\"."
+  (if (< (length id) 7)
+  (format "__/%s" id)
+(format "%s/%s"
+	(substring id 0 6)
+	(substring id 6
 
 (defcustom