[GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

anish singh Mon, 04 Jul 2016 03:01:56 -0700

Hello,

(everything outside Zeppelin)
I had started work on the common crawl datasets, and tried to first have a
look at only the data for May 2016. Out of the three formats available, I
chose the WET(plain text format). The data only for May is divided into
segments and there are 24492 such segments. I downloaded only the first
segment for May and got 432MB of data. Now the problem is that my laptop is
a very modest machine with core 2 duo processor and 3GB of RAM such that
even opening the downloaded data file in LibreWriter filled the RAM
completely and hung the machine and bringing the data directly into
zeppelin or analyzing it inside zeppelin seems impossible. As good as I
know, there are two ways in which I can proceed :


1) Buying a new laptop with more RAM and processor.   OR
2) Choosing another dataset

I have no problem with either of the above ways or anything that you might
suggest but please let me know which way to proceed so that I may be able
to work in speed. Meanwhile, I will read more papers and publications on
possibilities of analyzing common crawl data.

Thanks,
Anish.

[GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

Reply via email to