Re: [Analytics] How to get the traces of requests to the Wikipedia site in each web server

2018-04-18 Thread Nuria Ruiz
> Is there any download link available for  the  *webrequest *datasets ?
No, sorry, there is no download of webrequest data nor is it kept long
term.

As I mentioned before the best dataset that might fit your needs is this
one: https://analytics.wikimedia.org/datasets/archive/public-
datasets/analytics/caching/  which is a different dataset than webrequest
and does not include the same fields, just a subset.



On Wed, Apr 18, 2018 at 8:25 AM, Ta-Yuan Hsu  wrote:

> Hi, Nuria:
>
>   I reviewed the closest data to what I am looking for, phabricator
> T128132, from  https://analytics.wikimedia.org/datasets/archive/public-
> datasets/analytics/caching/
> and the *webrequest* datasets : https://wikitech.wikimedia.org/wik
> i/Analytics/Data_Lake/Traffic/Webrequest. I still have a few questions.
>
> 1. Is `hashed_host_path' (in the cache dataset) the `hostname' or `
> uri_host '?  Phabricator T128132 shows the two fields. However, the
> available data only shows ` hashed_host_path'.
>
> 2. There are 6 fields - hashed_host_path, uri_query,
> content_type, response_size, time_firstbyte, and x_cache - in the caching
> dataset, as shown in the attachment screen snapshot.
>  Does the caching dataset not include page_id?   The *webrequest* dataset
> seems to contain page_id.
> 3. I didn't find the sequence field in the caching dataset. I learned
> that  sequence replaces time stamp. Is ` sequence' the file name of
> downloads in the caching dataset?
> 4. Does `dt' (in the  *webrequest* dataset)  mean a timestamp with  ISO
> 8601  format ? Probably, the
> *webrequest* dataset might be what I am looking for, if it can provide
> access traces per-second.
>
> 5. According the the descriptions in  the *webrequest* webpage, the  
> *webrequest
> *datasets should contain at least `hostname', `page_id', and `dt'. If
> true,  the  *webrequest *datasets  seem to cover most of my requirements.
> Is there any download link available for  the  *webrequest *datasets ?
>
> --
> Sincerely,
> TA-YUAN
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] How to get the traces of requests to the Wikipedia site in each web server

2018-04-09 Thread Nuria Ruiz
Hello,

I do not think our downloads or API provide a dataset like the one you are
interested on. From your question I get the feeling that your assumptions
on how our system works does not match reality, wikipedia might not be the
best fit for your study.

The closest data to what you are asking might be this one:
https://analytics.wikimedia.org/datasets/archive/public-datasets/analytics/caching/README,
I would read this ticket to understand the inners of dataset:
https://phabricator.wikimedia.org/T128132

Thanks,

Nuria



On Mon, Apr 9, 2018 at 10:48 AM, Ta-Yuan Hsu  wrote:

> Dear all,
>
>Since we are studying workloads including a sample of Wikipedia's
> traffic over a certain period of time, what we need is patterns of user
> access to web servers  in a decentralized hosting environment. The access
> patterns need to include real hits on their servers per time for one
> language. In other words, one trace record we require should contain at
> least four features - timestamp (like MM:DD:SS), web server id, page size,
> and operations (e.g., create, read, or update a page).
>
>We already reviewed some available downloaded datasets, such as
> https://dumps.wikimedia.org/other/pagecounts-raw/. However, they do not
> match our requirement. Does anyone know if it is possible to download a
> dataset with four features from Wikimedia website? Or should we use REST
> API to acquire it?   Thank you!
> --
> Sincerely,
> TA-YUAN
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] How to get the traces of requests to the Wikipedia site in each web server

2018-04-09 Thread Ta-Yuan Hsu
Dear all,

   Since we are studying workloads including a sample of Wikipedia's
traffic over a certain period of time, what we need is patterns of user
access to web servers  in a decentralized hosting environment. The access
patterns need to include real hits on their servers per time for one
language. In other words, one trace record we require should contain at
least four features - timestamp (like MM:DD:SS), web server id, page size,
and operations (e.g., create, read, or update a page).

   We already reviewed some available downloaded datasets, such as
https://dumps.wikimedia.org/other/pagecounts-raw/. However, they do not
match our requirement. Does anyone know if it is possible to download a
dataset with four features from Wikimedia website? Or should we use REST
API to acquire it?   Thank you!
-- 
Sincerely,
TA-YUAN
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics