Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2017-01-05 Thread Lydia Pintscher
Hey folks :)

Andy and Pasleim just brought this topic to my attention again. Sorry
for having dropped the ball a bit.
I've created https://phabricator.wikimedia.org/T154660 with a strawman
proposal for the still open question of which length it should be.
Please add your arguments there.


Cheers
Lydia

-- 
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-10-08 Thread Egon Willighagen
Dear Thomas,

On Sat, Oct 8, 2016 at 12:07 PM, Thomas Douillard <
thomas.douill...@gmail.com> wrote:

> Probably a silly question but ... did you all consider creating a datatype
> for molecue representation ? This seem to be a very similar usecase than
> mathematica formula. Essentially we're not dealing with a raw string but a
> representation of molecule formulas, with its own encoding ...
>

The InChI is actually not a structural representation, but a derived unique
identifier.

What you propose would, however, apply to the SMILES. That one is generally
of about the same size as the InChI, and there your solution sounds like a
great idea!

Egon


> Changing the limit seem to be a poor workaround to a dedicated datatype -
> nobody seems to have found a relevant usecase and it seem to me that we're
> essentially abusing strings for storing blobs ...
>
> 2016-10-08 11:33 GMT+02:00 Egon Willighagen :
>
>>
>>
>> On Sat, Oct 8, 2016 at 11:28 AM, Lydia Pintscher <
>> lydia.pintsc...@wikimedia.de> wrote:
>>
>>> On Sat, Oct 8, 2016 at 11:23 AM, Egon Willighagen
>>>  wrote:
>>> > Ah, those numbers are for https://www.wikidata.org/wiki/Property:P234
>>> ...
>>>
>>> External identifier then. Cool. And for string like in
>>> https://www.wikidata.org/wiki/Property:P233? Sebastian's initial email
>>
>> says 1500 to 2000. Is this still a good number after this discussion?
>>>
>>
>> Yes, that would cover more than 99.9% of all InChIs in PubChem. (See
>> Sebastian's reply earlier in this thread.)
>>
>> Egon
>>
>> --
>> E.L. Willighagen
>> Department of Bioinformatics - BiGCaT
>> Maastricht University (http://www.bigcat.unimaas.nl/)
>> Homepage: http://egonw.github.com/
>> LinkedIn: http://se.linkedin.com/in/egonw
>> Blog: http://chem-bla-ics.blogspot.com/
>> PubList: http://www.citeulike.org/user/egonw/tag/papers
>> ORCID: -0001-7542-0286
>> ImpactStory: https://impactstory.org/u/egonwillighagen
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>


-- 
E.L. Willighagen
Department of Bioinformatics - BiGCaT
Maastricht University (http://www.bigcat.unimaas.nl/)
Homepage: http://egonw.github.com/
LinkedIn: http://se.linkedin.com/in/egonw
Blog: http://chem-bla-ics.blogspot.com/
PubList: http://www.citeulike.org/user/egonw/tag/papers
ORCID: -0001-7542-0286
ImpactStory: https://impactstory.org/u/egonwillighagen
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-10-08 Thread Daniel Kinzler
That was discussed and declined a while ago, see
. Though I think the proposed
realization was presentational rather than functional. I'll have to re-read the
discussion, though.

Am 08.10.2016 um 12:07 schrieb Thomas Douillard:
> Probably a silly question but ... did you all consider creating a datatype for
> molecue representation ? This seem to be a very similar usecase than 
> mathematica
> formula. Essentially we're not dealing with a raw string but a representation 
> of
> molecule formulas, with its own encoding ...
> 
> Changing the limit seem to be a poor workaround to a dedicated datatype - 
> nobody
> seems to have found a relevant usecase and it seem to me that we're 
> essentially
> abusing strings for storing blobs ...
> 
> 2016-10-08 11:33 GMT+02:00 Egon Willighagen  >:
> 
> 
> 
> On Sat, Oct 8, 2016 at 11:28 AM, Lydia Pintscher
> > 
> wrote:
> 
> On Sat, Oct 8, 2016 at 11:23 AM, Egon Willighagen
> > 
> wrote:
> > Ah, those numbers are for 
> https://www.wikidata.org/wiki/Property:P234
>  ...
> 
> External identifier then. Cool. And for string like in
> https://www.wikidata.org/wiki/Property:P233
> ? Sebastian's initial 
> email 
> 
> says 1500 to 2000. Is this still a good number after this discussion?
> 
> 
> Yes, that would cover more than 99.9% of all InChIs in PubChem. (See
> Sebastian's reply earlier in this thread.)
> 
> Egon
> 
> -- 
> E.L. Willighagen
> Department of Bioinformatics - BiGCaT
> Maastricht University (http://www.bigcat.unimaas.nl/)
> Homepage: http://egonw.github.com/
> LinkedIn: http://se.linkedin.com/in/egonw 
> 
> Blog: http://chem-bla-ics.blogspot.com/ 
> 
> PubList: http://www.citeulike.org/user/egonw/tag/papers
> 
> ORCID: -0001-7542-0286
> ImpactStory: https://impactstory.org/u/egonwillighagen
> 
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/wikidata
> 
> 
> 
> 
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
> 


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-10-08 Thread Thomas Douillard
Probably a silly question but ... did you all consider creating a datatype
for molecue representation ? This seem to be a very similar usecase than
mathematica formula. Essentially we're not dealing with a raw string but a
representation of molecule formulas, with its own encoding ...

Changing the limit seem to be a poor workaround to a dedicated datatype -
nobody seems to have found a relevant usecase and it seem to me that we're
essentially abusing strings for storing blobs ...

2016-10-08 11:33 GMT+02:00 Egon Willighagen :

>
>
> On Sat, Oct 8, 2016 at 11:28 AM, Lydia Pintscher <
> lydia.pintsc...@wikimedia.de> wrote:
>
>> On Sat, Oct 8, 2016 at 11:23 AM, Egon Willighagen
>>  wrote:
>> > Ah, those numbers are for https://www.wikidata.org/wiki/Property:P234
>> ...
>>
>> External identifier then. Cool. And for string like in
>> https://www.wikidata.org/wiki/Property:P233? Sebastian's initial email
>
> says 1500 to 2000. Is this still a good number after this discussion?
>>
>
> Yes, that would cover more than 99.9% of all InChIs in PubChem. (See
> Sebastian's reply earlier in this thread.)
>
> Egon
>
> --
> E.L. Willighagen
> Department of Bioinformatics - BiGCaT
> Maastricht University (http://www.bigcat.unimaas.nl/)
> Homepage: http://egonw.github.com/
> LinkedIn: http://se.linkedin.com/in/egonw
> Blog: http://chem-bla-ics.blogspot.com/
> PubList: http://www.citeulike.org/user/egonw/tag/papers
> ORCID: -0001-7542-0286
> ImpactStory: https://impactstory.org/u/egonwillighagen
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-10-08 Thread Egon Willighagen
On Sat, Oct 8, 2016 at 11:28 AM, Lydia Pintscher <
lydia.pintsc...@wikimedia.de> wrote:

> On Sat, Oct 8, 2016 at 11:23 AM, Egon Willighagen
>  wrote:
> > Ah, those numbers are for https://www.wikidata.org/wiki/Property:P234
> ...
>
> External identifier then. Cool. And for string like in
> https://www.wikidata.org/wiki/Property:P233? Sebastian's initial email

says 1500 to 2000. Is this still a good number after this discussion?
>

Yes, that would cover more than 99.9% of all InChIs in PubChem. (See
Sebastian's reply earlier in this thread.)

Egon

-- 
E.L. Willighagen
Department of Bioinformatics - BiGCaT
Maastricht University (http://www.bigcat.unimaas.nl/)
Homepage: http://egonw.github.com/
LinkedIn: http://se.linkedin.com/in/egonw
Blog: http://chem-bla-ics.blogspot.com/
PubList: http://www.citeulike.org/user/egonw/tag/papers
ORCID: -0001-7542-0286
ImpactStory: https://impactstory.org/u/egonwillighagen
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-10-08 Thread Lydia Pintscher
On Sat, Oct 8, 2016 at 11:23 AM, Egon Willighagen
 wrote:
> Ah, those numbers are for https://www.wikidata.org/wiki/Property:P234 ...

External identifier then. Cool. And for string like in
https://www.wikidata.org/wiki/Property:P233? Sebastian's initial email
says 1500 to 2000. Is this still a good number after this discussion?


Cheers
Lydia

-- 
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-10-08 Thread Egon Willighagen
On Sat, Oct 8, 2016 at 11:19 AM, Lydia Pintscher <
lydia.pintsc...@wikimedia.de> wrote:

> On Sat, Oct 8, 2016 at 11:14 AM, Egon Willighagen
>  wrote:
> > For small compounds this is answered by Sebastian's analysis... 5K would
> > cover all currently known small molecules. 1K would cover 99.9%.
>
> Ok. That is for strings, correct? Input for other use cases?


Ah, those numbers are for https://www.wikidata.org/wiki/Property:P234 ...

Egon

-- 
E.L. Willighagen
Department of Bioinformatics - BiGCaT
Maastricht University (http://www.bigcat.unimaas.nl/)
Homepage: http://egonw.github.com/
LinkedIn: http://se.linkedin.com/in/egonw
Blog: http://chem-bla-ics.blogspot.com/
PubList: http://www.citeulike.org/user/egonw/tag/papers
ORCID: -0001-7542-0286
ImpactStory: https://impactstory.org/u/egonwillighagen
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-10-08 Thread Egon Willighagen
On Sat, Oct 8, 2016 at 11:07 AM, Lydia Pintscher <
lydia.pintsc...@wikimedia.de> wrote:
>
> Based on this my proposal is to increase string and URL and
> potentially external identifier if you request it. One open question
> is still what the new limit should be.
>

For small compounds this is answered by Sebastian's analysis... 5K would
cover all currently known small molecules. 1K would cover 99.9%.

Lydia, do I understand that a formal request needs to be filed? Who will do
that?

Egon

-- 
E.L. Willighagen
Department of Bioinformatics - BiGCaT
Maastricht University (http://www.bigcat.unimaas.nl/)
Homepage: http://egonw.github.com/
LinkedIn: http://se.linkedin.com/in/egonw
Blog: http://chem-bla-ics.blogspot.com/
PubList: http://www.citeulike.org/user/egonw/tag/papers
ORCID: -0001-7542-0286
ImpactStory: https://impactstory.org/u/egonwillighagen
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-10-08 Thread Lydia Pintscher
Hi everyone,

I've been thinking more about this and we also discussed this within
the development team. Here's my thinking at this point:

* We do have data that you all want to see in Wikidata that is
currently prevented by the limit. That is not good.
* I agree that the general understanding of all of us is very good
when it comes to Wikidata not being the place to store long free
texts. However I still fear that especially new people initially do
not understand this. We could mitigate this by for example giving the
user a hint when their input is getting too long even if it is still
within the limit. Twitter does this in a nice way when you are getting
close to the 140 character limit. However that is not implemented
right now.
* I do worry about licensing and copyright issues with especially the
following properties: https://www.wikidata.org/wiki/Property:P2795
https://www.wikidata.org/wiki/Property:P1683
https://www.wikidata.org/wiki/Property:P1684
https://www.wikidata.org/wiki/Property:P2315 I took a rough survey of
for me potentially troublesome properties and it seems they are all
monolingual text. I am not worried about increasing external
identifier and URL. It looks like string is also okish at this point
in time.

Based on this my proposal is to increase string and URL and
potentially external identifier if you request it. One open question
is still what the new limit should be.


Cheers
Lydia

-- 
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-09-23 Thread Egon Willighagen
On Fri, Sep 23, 2016 at 5:53 PM, Denny Vrandečić 
wrote:

> One stupid question: due to the length of these identifiers, and since
> they are not simple intransparent identifiers but rather encode semantics -
> if I understand it correctly - could a single such identifier be encoding
> content or ideas which are potentially covered by copyright or patent law?
> Is there some background available on that?
>


Not the InChI. The standard itself is meant to be reused as much as
possible and the software is open source.

Some information here:
http://jcheminf.springeropen.com/articles/10.1186/1758-2946-5-7

Egon



> On Fri, Sep 23, 2016 at 3:27 AM Egon Willighagen <
> egon.willigha...@gmail.com> wrote:
>
>>
>> Sebastian, great you found time for it! I didn't :/ (Stats are worth a
>> tweet, IMHO :)
>>
>> Egon
>>
>> On Fri, Sep 23, 2016 at 12:20 PM, Sebastian Burgstaller <
>> sebastian.burgstal...@gmail.com> wrote:
>>
>>> Hi Denny,
>>> Sorry, I missed this email. just did the calculation for InChI string
>>> lengths on the 92 Mio PubChem compounds:
>>>   99% 99.9%  100%
>>>   311   676  4502
>>>
>>> That said, there is not upper limit for the length, but 4502 is the
>>> longest string in the PubChem database. The other IDs, canonical and
>>> isomeric SMILES have the same distribution shape, but are overall
>>> slightly shorter.
>>>
>>> Best,
>>> Sebastian
>>>
>>> On Sun, Sep 18, 2016 at 9:19 PM, Denny Vrandečić 
>>> wrote:
>>> > Can you figure out what a good limit would be for these two use cases?
>>> I.e.
>>> > what would support 99%, 99.9%, and 100%?
>>> >
>>> >
>>> > On Sun, Sep 18, 2016, 12:27 Egon Willighagen <
>>> egon.willigha...@gmail.com>
>>> > wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> sorry for joining the party late...
>>> >>
>>> >> On Tue, Sep 13, 2016 at 11:39 AM, Sebastian Burgstaller
>>> >>  wrote:
>>> >> > I think this topic might have been discussed many months ago. For
>>> >> > certain data types in the chemical compound space (P233, canonical
>>> >> > smiles, P2017 isomeric smiles and P234 Inchi key) a higher character
>>> >> > limit than 400 would be really helpful (1500 to 2000 chars (I sense
>>> >> > that this might cause problems with SPARQL)). Are there any plans on
>>> >> > implementing this? In general, for quality assurance, many string
>>> >> > property types would profit from a fixed max string length.
>>> >>
>>> >> 400 characters is not a lot for chemicals... InChIs can be a lot
>>> >> larger indeed. 2k would allow us to capture a lot more chemicals. BTW,
>>> >> this also applies to the canonical SMILES, which also doesn't have an
>>> >> upper bound. Tannic acid (Q427956) is an example (which looking at the
>>> >> InChIKey came up when running the bot :) From working with ChEMBL as
>>> >> RDF I know it has InChIs of length > 1024, which was the max length in
>>> >> Virtuoso... I think it's important for the biology and chemistry to
>>> >> increase the limit.
>>> >>
>>> >> Egon
>>> >>
>>> >> --
>>> >> E.L. Willighagen
>>> >> Department of Bioinformatics - BiGCaT
>>> >> Maastricht University (http://www.bigcat.unimaas.nl/)
>>> >> Homepage: http://egonw.github.com/
>>> >> LinkedIn: http://se.linkedin.com/in/egonw
>>> >> Blog: http://chem-bla-ics.blogspot.com/
>>> >> PubList: http://www.citeulike.org/user/egonw/tag/papers
>>> >> ORCID: -0001-7542-0286
>>> >> ImpactStory: https://impactstory.org/EgonWillighagen
>>> >>
>>> >> ___
>>> >> Wikidata mailing list
>>> >> Wikidata@lists.wikimedia.org
>>> >> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>> >
>>> >
>>> > ___
>>> > Wikidata mailing list
>>> > Wikidata@lists.wikimedia.org
>>> > https://lists.wikimedia.org/mailman/listinfo/wikidata
>>> >
>>>
>>> ___
>>> Wikidata mailing list
>>> Wikidata@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>
>>
>>
>>
>> --
>> E.L. Willighagen
>> Department of Bioinformatics - BiGCaT
>> Maastricht University (http://www.bigcat.unimaas.nl/)
>> Homepage: http://egonw.github.com/
>> LinkedIn: http://se.linkedin.com/in/egonw
>> Blog: http://chem-bla-ics.blogspot.com/
>> PubList: http://www.citeulike.org/user/egonw/tag/papers
>> ORCID: -0001-7542-0286
>> ImpactStory: https://impactstory.org/u/egonwillighagen
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>


-- 
E.L. Willighagen
Department of Bioinformatics - BiGCaT
Maastricht University (http://www.bigcat.unimaas.nl/)
Homepage: http://egonw.github.com/
LinkedIn: http://se.linkedin.com/in/egonw
Blog: 

Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-09-23 Thread Denny Vrandečić
Thank you! I am sure that this will help the Wikidata team to make the
right decision. Also, very interesting numbers.

One stupid question: due to the length of these identifiers, and since they
are not simple intransparent identifiers but rather encode semantics - if I
understand it correctly - could a single such identifier be encoding
content or ideas which are potentially covered by copyright or patent law?
Is there some background available on that?

On Fri, Sep 23, 2016 at 3:27 AM Egon Willighagen 
wrote:

>
> Sebastian, great you found time for it! I didn't :/ (Stats are worth a
> tweet, IMHO :)
>
> Egon
>
> On Fri, Sep 23, 2016 at 12:20 PM, Sebastian Burgstaller <
> sebastian.burgstal...@gmail.com> wrote:
>
>> Hi Denny,
>> Sorry, I missed this email. just did the calculation for InChI string
>> lengths on the 92 Mio PubChem compounds:
>>   99% 99.9%  100%
>>   311   676  4502
>>
>> That said, there is not upper limit for the length, but 4502 is the
>> longest string in the PubChem database. The other IDs, canonical and
>> isomeric SMILES have the same distribution shape, but are overall
>> slightly shorter.
>>
>> Best,
>> Sebastian
>>
>> On Sun, Sep 18, 2016 at 9:19 PM, Denny Vrandečić 
>> wrote:
>> > Can you figure out what a good limit would be for these two use cases?
>> I.e.
>> > what would support 99%, 99.9%, and 100%?
>> >
>> >
>> > On Sun, Sep 18, 2016, 12:27 Egon Willighagen <
>> egon.willigha...@gmail.com>
>> > wrote:
>> >>
>> >> Hi all,
>> >>
>> >> sorry for joining the party late...
>> >>
>> >> On Tue, Sep 13, 2016 at 11:39 AM, Sebastian Burgstaller
>> >>  wrote:
>> >> > I think this topic might have been discussed many months ago. For
>> >> > certain data types in the chemical compound space (P233, canonical
>> >> > smiles, P2017 isomeric smiles and P234 Inchi key) a higher character
>> >> > limit than 400 would be really helpful (1500 to 2000 chars (I sense
>> >> > that this might cause problems with SPARQL)). Are there any plans on
>> >> > implementing this? In general, for quality assurance, many string
>> >> > property types would profit from a fixed max string length.
>> >>
>> >> 400 characters is not a lot for chemicals... InChIs can be a lot
>> >> larger indeed. 2k would allow us to capture a lot more chemicals. BTW,
>> >> this also applies to the canonical SMILES, which also doesn't have an
>> >> upper bound. Tannic acid (Q427956) is an example (which looking at the
>> >> InChIKey came up when running the bot :) From working with ChEMBL as
>> >> RDF I know it has InChIs of length > 1024, which was the max length in
>> >> Virtuoso... I think it's important for the biology and chemistry to
>> >> increase the limit.
>> >>
>> >> Egon
>> >>
>> >> --
>> >> E.L. Willighagen
>> >> Department of Bioinformatics - BiGCaT
>> >> Maastricht University (http://www.bigcat.unimaas.nl/)
>> >> Homepage: http://egonw.github.com/
>> >> LinkedIn: http://se.linkedin.com/in/egonw
>> >> Blog: http://chem-bla-ics.blogspot.com/
>> >> PubList: http://www.citeulike.org/user/egonw/tag/papers
>> >> ORCID: -0001-7542-0286
>> >> ImpactStory: https://impactstory.org/EgonWillighagen
>> >>
>> >> ___
>> >> Wikidata mailing list
>> >> Wikidata@lists.wikimedia.org
>> >> https://lists.wikimedia.org/mailman/listinfo/wikidata
>> >
>> >
>> > ___
>> > Wikidata mailing list
>> > Wikidata@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/wikidata
>> >
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
>
>
> --
> E.L. Willighagen
> Department of Bioinformatics - BiGCaT
> Maastricht University (http://www.bigcat.unimaas.nl/)
> Homepage: http://egonw.github.com/
> LinkedIn: http://se.linkedin.com/in/egonw
> Blog: http://chem-bla-ics.blogspot.com/
> PubList: http://www.citeulike.org/user/egonw/tag/papers
> ORCID: -0001-7542-0286
> ImpactStory: https://impactstory.org/u/egonwillighagen
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-09-23 Thread Egon Willighagen
Sebastian, great you found time for it! I didn't :/ (Stats are worth a
tweet, IMHO :)

Egon

On Fri, Sep 23, 2016 at 12:20 PM, Sebastian Burgstaller <
sebastian.burgstal...@gmail.com> wrote:

> Hi Denny,
> Sorry, I missed this email. just did the calculation for InChI string
> lengths on the 92 Mio PubChem compounds:
>   99% 99.9%  100%
>   311   676  4502
>
> That said, there is not upper limit for the length, but 4502 is the
> longest string in the PubChem database. The other IDs, canonical and
> isomeric SMILES have the same distribution shape, but are overall
> slightly shorter.
>
> Best,
> Sebastian
>
> On Sun, Sep 18, 2016 at 9:19 PM, Denny Vrandečić 
> wrote:
> > Can you figure out what a good limit would be for these two use cases?
> I.e.
> > what would support 99%, 99.9%, and 100%?
> >
> >
> > On Sun, Sep 18, 2016, 12:27 Egon Willighagen  >
> > wrote:
> >>
> >> Hi all,
> >>
> >> sorry for joining the party late...
> >>
> >> On Tue, Sep 13, 2016 at 11:39 AM, Sebastian Burgstaller
> >>  wrote:
> >> > I think this topic might have been discussed many months ago. For
> >> > certain data types in the chemical compound space (P233, canonical
> >> > smiles, P2017 isomeric smiles and P234 Inchi key) a higher character
> >> > limit than 400 would be really helpful (1500 to 2000 chars (I sense
> >> > that this might cause problems with SPARQL)). Are there any plans on
> >> > implementing this? In general, for quality assurance, many string
> >> > property types would profit from a fixed max string length.
> >>
> >> 400 characters is not a lot for chemicals... InChIs can be a lot
> >> larger indeed. 2k would allow us to capture a lot more chemicals. BTW,
> >> this also applies to the canonical SMILES, which also doesn't have an
> >> upper bound. Tannic acid (Q427956) is an example (which looking at the
> >> InChIKey came up when running the bot :) From working with ChEMBL as
> >> RDF I know it has InChIs of length > 1024, which was the max length in
> >> Virtuoso... I think it's important for the biology and chemistry to
> >> increase the limit.
> >>
> >> Egon
> >>
> >> --
> >> E.L. Willighagen
> >> Department of Bioinformatics - BiGCaT
> >> Maastricht University (http://www.bigcat.unimaas.nl/)
> >> Homepage: http://egonw.github.com/
> >> LinkedIn: http://se.linkedin.com/in/egonw
> >> Blog: http://chem-bla-ics.blogspot.com/
> >> PubList: http://www.citeulike.org/user/egonw/tag/papers
> >> ORCID: -0001-7542-0286
> >> ImpactStory: https://impactstory.org/EgonWillighagen
> >>
> >> ___
> >> Wikidata mailing list
> >> Wikidata@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/wikidata
> >
> >
> > ___
> > Wikidata mailing list
> > Wikidata@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikidata
> >
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>



-- 
E.L. Willighagen
Department of Bioinformatics - BiGCaT
Maastricht University (http://www.bigcat.unimaas.nl/)
Homepage: http://egonw.github.com/
LinkedIn: http://se.linkedin.com/in/egonw
Blog: http://chem-bla-ics.blogspot.com/
PubList: http://www.citeulike.org/user/egonw/tag/papers
ORCID: -0001-7542-0286
ImpactStory: https://impactstory.org/u/egonwillighagen
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-09-19 Thread Markus Kroetzsch

On 19.09.2016 18:12, Lydia Pintscher wrote:

On Mon, Sep 19, 2016 at 6:19 AM, Denny Vrandečić  wrote:

Can you figure out what a good limit would be for these two use cases? I.e.
what would support 99%, 99.9%, and 100%?


Yes this would be extremely helpful. In general I agree that we can
now be more relaxed about this than we were at the beginning because
you all understand that Wikidata isn't a place to store long free
text. However I still think we need to have some measures in place.
One thing we could maybe do is a new datatype for longer text but I'm
undecided about this yet. I still don't feel too good about making
every string property several thousand characters long.


I am not excited about having another new datatype for this. The 
proposed difference of 400 vs. 2000 chars does not seem so fundamental, 
and the limits are rather arbitrary too, so it seems too much detail on 
the user level to name these things in special ways. Datatypes should be 
used if they have a benefit to the user (easier input, better display) 
and not to enforce constraints. There are very many relevant 
constraints, and length is hardly the most important one, so we should 
not give it the prominence of having an own type.


Best,

Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-09-19 Thread Sebastian Burgstaller
Thanks, guys! I am glad to hear that the technical hurdles for
implementation seem to be relatively low. Is there any realistic
timeline by when this could be done?

I agree with Lydia, that not all string properties should allow for
unlimited (or even very many) chars. It would be nice to determine at
property proposal how many chars a certain property should have.
Alternatively, implementing a new data type would also work for us.

Best,
Sebastian

On Mon, Sep 19, 2016 at 9:12 AM, Lydia Pintscher
 wrote:
> On Mon, Sep 19, 2016 at 6:19 AM, Denny Vrandečić  wrote:
>> Can you figure out what a good limit would be for these two use cases? I.e.
>> what would support 99%, 99.9%, and 100%?
>
> Yes this would be extremely helpful. In general I agree that we can
> now be more relaxed about this than we were at the beginning because
> you all understand that Wikidata isn't a place to store long free
> text. However I still think we need to have some measures in place.
> One thing we could maybe do is a new datatype for longer text but I'm
> undecided about this yet. I still don't feel too good about making
> every string property several thousand characters long.
>
>
> Cheers
> Lydia
>
> --
> Lydia Pintscher - http://about.me/lydia.pintscher
> Product Manager for Wikidata
>
> Wikimedia Deutschland e.V.
> Tempelhofer Ufer 23-24
> 10963 Berlin
> www.wikimedia.de
>
> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
>
> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
> unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
> Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-09-19 Thread Lydia Pintscher
On Mon, Sep 19, 2016 at 6:19 AM, Denny Vrandečić  wrote:
> Can you figure out what a good limit would be for these two use cases? I.e.
> what would support 99%, 99.9%, and 100%?

Yes this would be extremely helpful. In general I agree that we can
now be more relaxed about this than we were at the beginning because
you all understand that Wikidata isn't a place to store long free
text. However I still think we need to have some measures in place.
One thing we could maybe do is a new datatype for longer text but I'm
undecided about this yet. I still don't feel too good about making
every string property several thousand characters long.


Cheers
Lydia

-- 
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-09-18 Thread Egon Willighagen
Hi all,

sorry for joining the party late...

On Tue, Sep 13, 2016 at 11:39 AM, Sebastian Burgstaller
 wrote:
> I think this topic might have been discussed many months ago. For
> certain data types in the chemical compound space (P233, canonical
> smiles, P2017 isomeric smiles and P234 Inchi key) a higher character
> limit than 400 would be really helpful (1500 to 2000 chars (I sense
> that this might cause problems with SPARQL)). Are there any plans on
> implementing this? In general, for quality assurance, many string
> property types would profit from a fixed max string length.

400 characters is not a lot for chemicals... InChIs can be a lot
larger indeed. 2k would allow us to capture a lot more chemicals. BTW,
this also applies to the canonical SMILES, which also doesn't have an
upper bound. Tannic acid (Q427956) is an example (which looking at the
InChIKey came up when running the bot :) From working with ChEMBL as
RDF I know it has InChIs of length > 1024, which was the max length in
Virtuoso... I think it's important for the biology and chemistry to
increase the limit.

Egon

-- 
E.L. Willighagen
Department of Bioinformatics - BiGCaT
Maastricht University (http://www.bigcat.unimaas.nl/)
Homepage: http://egonw.github.com/
LinkedIn: http://se.linkedin.com/in/egonw
Blog: http://chem-bla-ics.blogspot.com/
PubList: http://www.citeulike.org/user/egonw/tag/papers
ORCID: -0001-7542-0286
ImpactStory: https://impactstory.org/EgonWillighagen

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-09-17 Thread Hay (Husky)
One other usecase for this would be citation URLs. For example, to get
the number of inhabitants of all Dutch municipalities you need a
800-character (1) permalink from the central bureau of statistics.

So this change would be very welcome indeed!

-- Hay

(1): 
http://statline.cbs.nl/Statweb/publication/?VW=T=SLNL=37230NED=0=57-60,63-65,67-71,73,75-77,81-82,84-86,88-91,93-97,100,102-105,109-110,112-119,121,124-126,129-131,133-135,137-143,145-152,154-156,158-159,161-162,165-168,170-171,173-176,178-183,186-196,198,200-201,203-204,206-208,210-214,216,219,221,225-232,234-238,240-241,243-247,249-251,253-257,259-262,264-268,270-276,279,281-285,288-301,303-308,310-311,316-318,320-321,323-326,332-333,335-336,338-340,342,345-348,350-351,353,355-356,358-360,365,367,369,371,373,376-380,382-387,389-394,396-399,401,403-410,412-420,424-430,432,435-438,440-442,444-445,449,452-456,459,461-465,467-469,471,473-475,477-480,483-487,490-494,496-501,503-509,511-518,520-526,528,530-531,534-538,541,543-545,547,549,551-556,559-561,563-564,566-567,569-572,575-578,580-581,583,586-594=182=160518-1036=G2=G1,T

On Tue, Sep 13, 2016 at 11:39 AM, Sebastian Burgstaller
 wrote:
> Hi all,
>
> I think this topic might have been discussed many months ago. For
> certain data types in the chemical compound space (P233, canonical
> smiles, P2017 isomeric smiles and P234 Inchi key) a higher character
> limit than 400 would be really helpful (1500 to 2000 chars (I sense
> that this might cause problems with SPARQL)). Are there any plans on
> implementing this? In general, for quality assurance, many string
> property types would profit from a fixed max string length.
>
> Best,
> Sebastian
>
> Sebastian Burgstaller-Muehlbacher, PhD
> Research Associate
> Andrew Su Lab
> MEM-216, Department of Molecular and Experimental Medicine
> The Scripps Research Institute
> 10550 North Torrey Pines Road
> La Jolla, CA 92037
> @sebotic
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-09-16 Thread Stas Malyshev
Hi!

> However, given that we now have such a well informed community with
> established practices and good quality checks, it seems unproblematic to
> lift the character limit. I don't think there are major technical
> reasons for having it. Surely, BlazeGraph (the WMF SPARQL engine) should
> not expect texts to be short, and I would be surprised if they did. So I
> would not expect problems on this side.

I don't think there should be much trouble in this department. Unless
one is literally trying to download megabytes of data or millions of
items from a query (which we are working on solution for, but not yet)
the size of the string doesn't matter much and there would probably be
no noticeable difference between 400 and 2K strings for most queries I
can think of. Searching within such strings won't probably work very
well but that's not the intent anyway, as I understand.

The only thing I can think of is that we now both store the whole item
as huge blob in the DB (and consequently load it in memory) so if we had
a lot of huge strings it may have negative performance impact. But I
don't think changing a property that is usually one per item from 400
bytes to 2K would change anything.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-09-16 Thread Daniel Kinzler
Am 16.09.2016 um 19:38 schrieb Denny Vrandečić:
> Markus' description of the decision for the limit corresponds with mine. I 
> also
> think that this decision can be revisited. I would still advice for caution, 
> due
> to technical issues, but I am sure that the development team will make a
> well-informed decision on this. It would be sad if valid usecases could not be
> supported due to that.

I agree, but re-considering this will have to wait until we have a better
solution for storing terms. The current mechanism, the wb_terms table, is a
massive performance bottleneck, and stuffing more data in there makes me very
uncomfortable.

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-09-16 Thread Denny Vrandečić
(in particular because I expect that character limit to have to change for
Wiktionary in Wikidata)

On Fri, Sep 16, 2016 at 10:38 AM Denny Vrandečić 
wrote:

> Markus' description of the decision for the limit corresponds with mine. I
> also think that this decision can be revisited. I would still advice for
> caution, due to technical issues, but I am sure that the development team
> will make a well-informed decision on this. It would be sad if valid
> usecases could not be supported due to that.
>
> On Fri, Sep 16, 2016 at 6:51 AM Markus Kroetzsch <
> markus.kroetz...@tu-dresden.de> wrote:
>
>> On 13.09.2016 11:39, Sebastian Burgstaller wrote:
>> > Hi all,
>> >
>> > I think this topic might have been discussed many months ago. For
>> > certain data types in the chemical compound space (P233, canonical
>> > smiles, P2017 isomeric smiles and P234 Inchi key) a higher character
>> > limit than 400 would be really helpful (1500 to 2000 chars (I sense
>> > that this might cause problems with SPARQL)). Are there any plans on
>> > implementing this? In general, for quality assurance, many string
>> > property types would profit from a fixed max string length.
>>
>> FWIW, I recall that the main reason for the char limit originally was to
>> discourage the use of Wikidata for textual content. Simply put, we did
>> not want Wikipedia articles in the data. Long texts could also make
>> copyright/license issues more relevant (though, in theory, a copyrighted
>> poem could be rather short).
>>
>> However, given that we now have such a well informed community with
>> established practices and good quality checks, it seems unproblematic to
>> lift the character limit. I don't think there are major technical
>> reasons for having it. Surely, BlazeGraph (the WMF SPARQL engine) should
>> not expect texts to be short, and I would be surprised if they did. So I
>> would not expect problems on this side.
>>
>> Best,
>> Markus
>>
>>
>> >
>> > Best,
>> > Sebastian
>> >
>> > Sebastian Burgstaller-Muehlbacher, PhD
>> > Research Associate
>> > Andrew Su Lab
>> > MEM-216, Department of Molecular and Experimental Medicine
>> > The Scripps Research Institute
>> > 10550 North Torrey Pines Road
>> > La Jolla, CA 92037
>> > @sebotic
>> >
>> > ___
>> > Wikidata mailing list
>> > Wikidata@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/wikidata
>> >
>>
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-09-16 Thread Denny Vrandečić
Markus' description of the decision for the limit corresponds with mine. I
also think that this decision can be revisited. I would still advice for
caution, due to technical issues, but I am sure that the development team
will make a well-informed decision on this. It would be sad if valid
usecases could not be supported due to that.

On Fri, Sep 16, 2016 at 6:51 AM Markus Kroetzsch <
markus.kroetz...@tu-dresden.de> wrote:

> On 13.09.2016 11:39, Sebastian Burgstaller wrote:
> > Hi all,
> >
> > I think this topic might have been discussed many months ago. For
> > certain data types in the chemical compound space (P233, canonical
> > smiles, P2017 isomeric smiles and P234 Inchi key) a higher character
> > limit than 400 would be really helpful (1500 to 2000 chars (I sense
> > that this might cause problems with SPARQL)). Are there any plans on
> > implementing this? In general, for quality assurance, many string
> > property types would profit from a fixed max string length.
>
> FWIW, I recall that the main reason for the char limit originally was to
> discourage the use of Wikidata for textual content. Simply put, we did
> not want Wikipedia articles in the data. Long texts could also make
> copyright/license issues more relevant (though, in theory, a copyrighted
> poem could be rather short).
>
> However, given that we now have such a well informed community with
> established practices and good quality checks, it seems unproblematic to
> lift the character limit. I don't think there are major technical
> reasons for having it. Surely, BlazeGraph (the WMF SPARQL engine) should
> not expect texts to be short, and I would be surprised if they did. So I
> would not expect problems on this side.
>
> Best,
> Markus
>
>
> >
> > Best,
> > Sebastian
> >
> > Sebastian Burgstaller-Muehlbacher, PhD
> > Research Associate
> > Andrew Su Lab
> > MEM-216, Department of Molecular and Experimental Medicine
> > The Scripps Research Institute
> > 10550 North Torrey Pines Road
> > La Jolla, CA 92037
> > @sebotic
> >
> > ___
> > Wikidata mailing list
> > Wikidata@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikidata
> >
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-09-16 Thread Markus Kroetzsch

On 13.09.2016 11:39, Sebastian Burgstaller wrote:

Hi all,

I think this topic might have been discussed many months ago. For
certain data types in the chemical compound space (P233, canonical
smiles, P2017 isomeric smiles and P234 Inchi key) a higher character
limit than 400 would be really helpful (1500 to 2000 chars (I sense
that this might cause problems with SPARQL)). Are there any plans on
implementing this? In general, for quality assurance, many string
property types would profit from a fixed max string length.


FWIW, I recall that the main reason for the char limit originally was to 
discourage the use of Wikidata for textual content. Simply put, we did 
not want Wikipedia articles in the data. Long texts could also make 
copyright/license issues more relevant (though, in theory, a copyrighted 
poem could be rather short).


However, given that we now have such a well informed community with 
established practices and good quality checks, it seems unproblematic to 
lift the character limit. I don't think there are major technical 
reasons for having it. Surely, BlazeGraph (the WMF SPARQL engine) should 
not expect texts to be short, and I would be surprised if they did. So I 
would not expect problems on this side.


Best,
Markus




Best,
Sebastian

Sebastian Burgstaller-Muehlbacher, PhD
Research Associate
Andrew Su Lab
MEM-216, Department of Molecular and Experimental Medicine
The Scripps Research Institute
10550 North Torrey Pines Road
La Jolla, CA 92037
@sebotic

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Greater than 400 char limit for Wikidata string data types

2016-09-13 Thread Sebastian Burgstaller
Hi all,

I think this topic might have been discussed many months ago. For
certain data types in the chemical compound space (P233, canonical
smiles, P2017 isomeric smiles and P234 Inchi key) a higher character
limit than 400 would be really helpful (1500 to 2000 chars (I sense
that this might cause problems with SPARQL)). Are there any plans on
implementing this? In general, for quality assurance, many string
property types would profit from a fixed max string length.

Best,
Sebastian

Sebastian Burgstaller-Muehlbacher, PhD
Research Associate
Andrew Su Lab
MEM-216, Department of Molecular and Experimental Medicine
The Scripps Research Institute
10550 North Torrey Pines Road
La Jolla, CA 92037
@sebotic

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata