Re: Yet another browser extension for capturing notes - LinkRemark

2020-12-26 Thread Ihor Radchenko
Maxim Nikulin  writes:

> I just inspected pages on several sites using developer tools and added
> code that handles noticed elements.

I see. I basically did the same, except some minimal support for
OpenGraph (though I stopped when I saw that even YouTube is not
following the standard, except the most basic fields).

> The only force to add some formal data is "share" buttons. Maybe some
> guides for web developers from social networks or search engines could
> be more useful than formal references, but I have not had a closer
> look.

It is also consistent with what I saw.  fields seems
to be very common.

>> Also, org-capture-ref does not really force the user to put BiBTeX into
>> the capture. Individual metadata fields are available using
>> org-capture-ref-get-bibtex-field (which extracts data from internal
>> alist structure). It's just that I mostly had BiBTeX in mind (with
>> distant goal of supporting export to LaTeX) for my use-cases.
>
> I do not have clear vision how to use collected data for queries. 
> Certainly I want to have more human-friendly representation than BibTeX 
> entries (maybe in addition to machine-parsable data) adjacent to my notes.

So far, I found author, website name, publication year, title, and
resource type useful. My standard capture template for links is:

*  [] () Title

Example:

* dash-docs-el [Github] Dash-Docs-El Helm-Dash: Browse Dash Docsets Inside Emacs

Such headlines can be easily searched later, especially when I also add
some #keywords manually.

> Personally, I would prefer to avoid http queries from Emacs. Sometimes 
> it is better to have current DOM state, not page source, that is why I 
> decided to gather data inside browser, despite security fences that are 
> placed quite strangely in some cases.

Completely agree here. That's why I directly reuse the current DOM state
from qutebrowser in my own setup. However, extension for qutebrowser was
easy to write for me as it can be simply a bash script. I know nothing
about Firefox/Chrome extensions and I do not know javascript.

On the other hand, having an ability to get html is still useful in my
case (Emacs package) when the capture is not done from browser. For
example, I often capture links from elfeed - http query from Emacs is
useful then.

>  From my point of view, you should be happy with any of projects you 
> mentioned below. Are all of them have some problems critical for you?

They are all javascript, except one (unicontent), which can be easily
replaced with built-in Elisp libraries (dom.el).

>> Finally, would you be interested to join efforts on metadata parsing?
>
> Could you, please, share a bit more details on your ideas? 

> Technically it should be possible to push e.g. raw 
> document.head.innerHtml to any external metadata parser using native 
> messaging (to deal with sites requiring authorization). However it could 
> cause an alarm during review before publication of the extension to the 
> browser catalogues.

That's unfortunate. Pushing raw html/dom is what I had in mind when
talking about joining efforts.

Another idea would be providing a callback from elisp to browser (I am
not sure if it is possible). org-capture-ref has a mechanism to check if
the link was captured in the past. If the link is already captured, the
information about the link location and todo-state can be messaged back
to the browser.

Example message (only qutebrowser is supported now):

Bookmark not saved!
Already captured into org-capture-ref:TODO maxnikulin [Github] linkremark: 
LinkRemark - page or link notes with context

>There is some room for improvement, but I do not think that quality of
> metadata for ordinary sites could be dramatically better. The case
> that is not handled it all is scientific publications, unfortunately
> currently I have quite little interest in it. Definitely results
> should be stored in some structured format such as BibTeX. I have seen
> huge  elements describing even all references. Certainly such
> lists are not for general-purpose notes (at least without explicit
> request from the user), they should be handled by some bibliography
> software to display citation graphs in the local library. On the other
> hand it is not a problem to feed such data to some tool using native
> messaging protocol. I have no idea if various publisher provide such
> data in a uniform way, I just hope that pressure from citation indices
> and bibliography management software has positive influence on
> standardization.

I think https://github.com/microlinkhq/metascraper#core-rules can be
used for ideas. It has generic parsing apart from site-specific rules.

For the scientific publications, the key point is usually getting
DOI/ISBN. Then, most of the metadata can be obtained using standard API
of doi.org or various ISBN databases. In addition, reference data is
generally available in OpenCitations.net (they also have all kinds of
web APIs).

Also, do you pass any of the parsed metadata 

Re: [9.4] Fixing logbook visibility during isearch

2020-12-26 Thread Ihor Radchenko
Kévin Le Gouguec  writes:

> Ihor Radchenko  writes:
>
>> Kévin Le Gouguec  writes:
>>
>>> 1.2. stumps me: is there an isearch API I can use while in the callback
>>> to know where the matches are located?
>>
>> I do not think that there is direct API for this, but the match should
>> be accessible through match-beginning/match-end, as I can see from the
>> isearch.el code.
>
> Right, I've seen this too; I wonder if it's a hard guarantee or an
> implementation detail.  I might page help-gnu-emacs about this.

Another way could by using isearch-filter-predicate. It is given the
search region directly.



Re: Yet another browser extension for capturing notes - LinkRemark

2020-12-26 Thread Maxim Nikulin

On 25/12/2020, Ihor Radchenko wrote:


Reading through the code, I can see that you are familiar with metadata
conventions. Do you know good references about what og: metadata is
commonly used? I looked through the official OpenGraph specification,
but popular websites appear to ignore most of the conventions.


I just inspected pages on several sites using developer tools and added
code that handles noticed elements.

I have not tried to find any resources on metadata (OK, once I searched 
for LD+JSON, essentially the outcome was the link to schema.org that I 
have seen in data already). Looking into page source, I realized that 
almost nobody cares if the site has metadata of appropriate quality. I 
think, search engines are advanced enough to work without metadata and 
even decrease page rank if something suspicious was added by SEO. The 
only force to add some formal data is "share" buttons. Maybe some guides 
for web developers from social networks or search engines could be more 
useful than formal references, but I have not had a closer look.



Also, org-capture-ref does not really force the user to put BiBTeX into
the capture. Individual metadata fields are available using
org-capture-ref-get-bibtex-field (which extracts data from internal
alist structure). It's just that I mostly had BiBTeX in mind (with
distant goal of supporting export to LaTeX) for my use-cases.


I do not have clear vision how to use collected data for queries. 
Certainly I want to have more human-friendly representation than BibTeX 
entries (maybe in addition to machine-parsable data) adjacent to my notes.


Personally, I would prefer to avoid http queries from Emacs. Sometimes 
it is better to have current DOM state, not page source, that is why I 
decided to gather data inside browser, despite security fences that are 
placed quite strangely in some cases.


From my point of view, you should be happy with any of projects you 
mentioned below. Are all of them have some problems critical for you?


Technically it should be possible to push e.g. raw 
document.head.innerHtml to any external metadata parser using native 
messaging (to deal with sites requiring authorization). However it could 
cause an alarm during review before publication of the extension to the 
browser catalogues.



Finally, would you be interested to join efforts on metadata parsing?


Could you, please, share a bit more details on your ideas? There is some 
room for improvement, but I do not think that quality of metadata for 
ordinary sites could be dramatically better. The case that is not 
handled it all is scientific publications, unfortunately currently I 
have quite little interest in it. Definitely results should be stored in 
some structured format such as BibTeX. I have seen huge  elements 
describing even all references. Certainly such lists are not for 
general-purpose notes (at least without explicit request from the user), 
they should be handled by some bibliography software to display citation 
graphs in the local library. On the other hand it is not a problem to 
feed such data to some tool using native messaging protocol. I have no 
idea if various publisher provide such data in a uniform way, I just 
hope that pressure from citation indices and bibliography management 
software has positive influence on standardization.


I am not going to blow up the code with recipes for particular sites. 
However I realize that some special cases still should be handled. I am 
not ready to adapt user script model used by 
Greasemonkey/Violentmonkey/Tampermonkey. I believe, it is better to 
create dedicated extension(s) that either adds and overwrites existing 
meta elements or allows to query gathered data using sendMessage 
webextensions interface. By the way, scripts for above mentioned 
extensions could be used as well. It should alleviate cases when some 
site with insane metadata is important for particular user.



P.S. Some links I collected myself when working on org-capture-ref. They
might also be of interest for you:

- https://github.com/ageitgey/node-unfluff
- https://github.com/gabceb/node-metainspector
- https://github.com/wikimedia/html-metadata
- https://github.com/microlinkhq/metascraper
- https://github.com/hboisgibault/unicontent


Thank you for the links. I should have a closer look at that projects. 
E.g. I considered itemprop="author" elements but postponed 
implementation of such features. For some reason I even did not tried to 
find existing projects for metadata extraction. Maybe I still hope that 
quite simple implementation could handle most of the cases.





Re: [9.4] Fixing logbook visibility during isearch

2020-12-26 Thread Kévin Le Gouguec
Ihor Radchenko  writes:

> Kévin Le Gouguec  writes:
>
>> 1.2. stumps me: is there an isearch API I can use while in the callback
>> to know where the matches are located?
>
> I do not think that there is direct API for this, but the match should
> be accessible through match-beginning/match-end, as I can see from the
> isearch.el code.

Right, I've seen this too; I wonder if it's a hard guarantee or an
implementation detail.  I might page help-gnu-emacs about this.



Re: Yet another browser extension for capturing notes - LinkRemark

2020-12-26 Thread Maxim Nikulin

On 26/12/2020, Samuel Wales wrote:


[... i can imagine great things possible with such extensions. for
example, you could have sets of tabs, selected by right click in
firefox, to save to a bunch of org entries.  then you could load that
particular set of entries into firefox whenever you want.  and you
could keep notes on each page and move the entries wherever you want.
this would be useful for such things as "i am researching rice
cookers; these are my tabs, but i don't want them cluttering firefox
and i want them with my org notes and to make notes on them and will
re-load them into firefox when i want to revisit".]


It should be possible since some tab management extension were used in 
mozilla to evaluate if webextensions are mature enough and if support of 
XUL add-ons could be dropped. On the other hand do not expect such 
feature soon. A kind of semi-blocker is absence of automatic tests to 
run before every release, and it will require a lot of time.


In the meanwhile, have you looked at the following comment?
https://github.com/sprig/org-capture-extension/issues/12#issuecomment-323569334
alphapapa commented Aug 20, 2017


You can do this with the "Copy all URLs" extension (ID:
djdmadneanknadilpjiknlnanaolmbfk). Use this as the custom format (note
the linebreak):

[[$url][$title]]


I am almost sure that similar extension should exist for Firefox as well.

Some points should be clarified in my opinion

- Do you expect that metadata should be captured in addition to URLs and 
titles? Browsers can unload some tabs making page content unavailable.
- Are you going to capture reviews of "rice cookers" that could be 
considered as ordinary pages or you are going to save items from online 
stores? I do not current state of affairs but I have heard about some 
activity for special metadata that allows search engines to display 
products in a special way. Could you inspect head element of pages in 
your favorite stores contains desired metadata using page source or 
inspect element tools?
- Should tab group be captured as single Org heading or it should be a 
tree with a section per tab? I am not sure that capture will have no 
problem with subtree. Certainly Emacs interface for org-protocol + 
capture are not suitable for sending each tab as a separate link. 
Another option is to create nested lists, anyway org formatter in my 
extension need improvements. Are you expecting headings subtree or 
nested lists?



[now if i can only debug the extra-blank-lines-in-capture problem.]


Fully agree that it is really annoying. It is among high priority items 
in my TODO list.


Accidentally I pressed =C-x C-o= and discovered 
[[help:delete-blank-lines]] innerText is not exactly the same as 
selection range toString but the rules could work in a similar way. 
Table rows, floating and absolutely positioned elements require 
newlines. Such elements are often abused by designers.

https://html.spec.whatwg.org/multipage/dom.html#dom-innertext