Re: [GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

anish singh Mon, 04 Jul 2016 06:40:36 -0700

Hello,

Thanks Alex, I'm so glad that you helped. Here's update : I've ordered new
machine with more RAM and processor that should come by tomorrow. I will
attempt to use it for the common crawl data and the AWS solution that you
provided in the previous mail. I'm presently reading papers and
publications regarding analysis of common crawl data. Warcbase tool will
definitely be used. I understand that common crawl datasets are important
and I will do everything it takes to make notebooks on them, the only
tension is that it may take more time than the previous notebooks.


Anish.

On Mon, Jul 4, 2016 at 6:30 PM, Alexander Bezzubov <b...@apache.org> wrote:

> Hi Anish,
>
> thanks for keeping us posted about a progress!
>
> CommonCrawl is important dataset and it would be awesome if we could
> find a way for you to build some notebooks for it though this this
> years GSoC program.
>
> How about running Zeppelin on a single big enough node in AWS for the
> sake of this notebook?
> If you use spot instance you could get even big instances for really
> affordable price of 2-4$ a day, just need to make sure your persist
> notebooks on S3 [1] to avoid loosing the data and shut down it for the
> night.
>
> AFAIK We do not have free any AWS credits for now, even for a GSoC
> students. If somebody knows a way to provide\get some - please feel
> free to chime in, I know there are some Amazonian people on the list
> :)
>
> But so far AWS spot instances is the most cost-effective solution I
> could imagine of. Bonus: if you host your instance in region us-east-1
> - transfer from\to S3 will be free, as that's where CommonCrawl
> dataset is living.
>
> One more thing - please check out awesome WarcBase library [2] build
> by internet preservation community. I find it really helpful, working
> with web archives.
>
> On the notebook design:
>  - to understand the context of this dataset better - please do some
> research how other people use it. What for, etc.
>    Would be a great material for the blog post
>  - try provide examples of all available formats: WARC, WET, WAT (in
> may be in same or different notebooks, it's up to you)
>  - while using warcbase - mind that RDD persistence will not work
> until [3] is resolved, so avoid using if for now
>
> I understand that this can be a big task, so do not worry if that
> takes time (learning AWS, etc) - just keep us posted on your progress
> weekly and I'll be glad to help!
>
>
>  1.
> http://zeppelin.apache.org/docs/0.6.0-SNAPSHOT/storage/storage.html#notebook-storage-in-s3
>  2. https://github.com/lintool/warcbase
>  3. https://github.com/lintool/warcbase/issues/227
>
> On Mon, Jul 4, 2016 at 7:00 PM, anish singh <anish18...@gmail.com> wrote:
> > Hello,
> >
> > (everything outside Zeppelin)
> > I had started work on the common crawl datasets, and tried to first have
> a
> > look at only the data for May 2016. Out of the three formats available, I
> > chose the WET(plain text format). The data only for May is divided into
> > segments and there are 24492 such segments. I downloaded only the first
> > segment for May and got 432MB of data. Now the problem is that my laptop
> is
> > a very modest machine with core 2 duo processor and 3GB of RAM such that
> > even opening the downloaded data file in LibreWriter filled the RAM
> > completely and hung the machine and bringing the data directly into
> > zeppelin or analyzing it inside zeppelin seems impossible. As good as I
> > know, there are two ways in which I can proceed :
> >
> > 1) Buying a new laptop with more RAM and processor.   OR
> > 2) Choosing another dataset
> >
> > I have no problem with either of the above ways or anything that you
> might
> > suggest but please let me know which way to proceed so that I may be able
> > to work in speed. Meanwhile, I will read more papers and publications on
> > possibilities of analyzing common crawl data.
> >
> > Thanks,
> > Anish.
>

Re: [GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

Reply via email to