Re: [GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

Alexander Bezzubov Mon, 04 Jul 2016 06:01:44 -0700

Hi Anish,

thanks for keeping us posted about a progress!

CommonCrawl is important dataset and it would be awesome if we could
find a way for you to build some notebooks for it though this this
years GSoC program.

How about running Zeppelin on a single big enough node in AWS for the
sake of this notebook?
If you use spot instance you could get even big instances for really
affordable price of 2-4$ a day, just need to make sure your persist
notebooks on S3 [1] to avoid loosing the data and shut down it for the
night.

AFAIK We do not have free any AWS credits for now, even for a GSoC
students. If somebody knows a way to provide\get some - please feel
free to chime in, I know there are some Amazonian people on the list
:)

But so far AWS spot instances is the most cost-effective solution I
could imagine of. Bonus: if you host your instance in region us-east-1
- transfer from\to S3 will be free, as that's where CommonCrawl
dataset is living.

One more thing - please check out awesome WarcBase library [2] build
by internet preservation community. I find it really helpful, working
with web archives.

On the notebook design:
 - to understand the context of this dataset better - please do some
research how other people use it. What for, etc.
   Would be a great material for the blog post
 - try provide examples of all available formats: WARC, WET, WAT (in
may be in same or different notebooks, it's up to you)
 - while using warcbase - mind that RDD persistence will not work
until [3] is resolved, so avoid using if for now

I understand that this can be a big task, so do not worry if that
takes time (learning AWS, etc) - just keep us posted on your progress
weekly and I'll be glad to help!

 1. 
http://zeppelin.apache.org/docs/0.6.0-SNAPSHOT/storage/storage.html#notebook-storage-in-s3
 2. https://github.com/lintool/warcbase
 3. https://github.com/lintool/warcbase/issues/227

On Mon, Jul 4, 2016 at 7:00 PM, anish singh <anish18...@gmail.com> wrote:
> Hello,
>
> (everything outside Zeppelin)
> I had started work on the common crawl datasets, and tried to first have a
> look at only the data for May 2016. Out of the three formats available, I
> chose the WET(plain text format). The data only for May is divided into
> segments and there are 24492 such segments. I downloaded only the first
> segment for May and got 432MB of data. Now the problem is that my laptop is
> a very modest machine with core 2 duo processor and 3GB of RAM such that
> even opening the downloaded data file in LibreWriter filled the RAM
> completely and hung the machine and bringing the data directly into
> zeppelin or analyzing it inside zeppelin seems impossible. As good as I
> know, there are two ways in which I can proceed :
>
> 1) Buying a new laptop with more RAM and processor.   OR
> 2) Choosing another dataset
>
> I have no problem with either of the above ways or anything that you might
> suggest but please let me know which way to proceed so that I may be able
> to work in speed. Meanwhile, I will read more papers and publications on
> possibilities of analyzing common crawl data.
>
> Thanks,
> Anish.

Re: [GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

Reply via email to