Re: [GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

anish singh Wed, 20 Jul 2016 02:08:05 -0700

Alex, some good news!

I just tried the first option you mentioned in the previous mail, increased
the driver memory to 16g, reduced caching space to 0.1% of total memory and
additionally trimmed the warc content to include only three domains and its
working (everything including reduceByKey()). Although, I had tried this
earlier, few days ago but it had not worked then.


I even understood the core problem : the original rdd( ~ 2GB) contained
exactly 53307 rdd elements and when I ran 'flatMap(
r => ExtractLinks(r.getUrl(), r.getContentstring())) on the this rdd it
resulted in explosion of data extracted from these many elements(web pages)
which the available memory was perhaps unable to handle. This also means
that the rest of the analysis in the notebook must be done on domains
extracted from the original warc files so it reduces the size of data to be
processed. In case, more RAM is needed I will try to use m4.2xlarge (32GB)
instance.

Thrilled to have it working after struggling for so many days, so now I can
proceed with the notebook.

Thanks again,
Anish.

On Wed, Jul 20, 2016 at 7:08 AM, Alexander Bezzubov <b...@apache.org> wrote:

> Hi Anish,
>
> thank you for sharing your progress and totally know what you mean - that's
> an expected pain of working with real BigData.
>
> I would advise to conduct a series of experiments:
>
> *1 moderate machine*, Spark 1.6 in local mode, 1 WARC input file (1Gb)
>  - Spark in local mode is a single JVM process, so fine-tune it and make
> sure it uses ALL available memory (i.e 16Gb)
>  - We are not going to use in-memory caching, so storage part can be turned
> off [1]  and [2]
>  - AFIAK DataFrames use memory more efficient than RDDs but not sure if we
> can benefit from it here
>  - Start with something simple, like `val mayBegLinks =
> mayBegData.keepValidPages().count()` and make sure it works
>  - Proceed further until few more complex queries work
>
> *Cluster of N machines*, Spark 1.6 in standalone cluster mode
>  - process fraction of the whole dataset i.e 1 segment
>
>
> I know that is not easy, but it's worth to try for 1 more week and see if
> the approach outlined above works.
> Last, but not least - do not hesitate to reach out to CommonCrawl community
> [3] for an advice, there are people using Apache Spark there as well.
>
> Please keep us posted!
>
>  1.
> http://spark.apache.org/docs/latest/tuning.html#memory-management-overview
>  2.
>
> http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
>  3. https://groups.google.com/forum/#!forum/common-crawl
>
> --
> Alex
>
>
> On Wed, Jul 20, 2016 at 2:27 AM, anish singh <anish18...@gmail.com> wrote:
>
> > Hello,
> >
> > The last two weeks have been tough and full of learning, the code in the
> > previous mail which performed only simple transformation and
> reduceByKey()
> > to count similar domain links did not work even on the first segment(1005
> > MB) of data. So I studied and read extensively on the web :
> blogs(cloudera,
> > databricks and stack overflow) and books on Spark, tried all the options
> > and configurations on memory and performance tuning but the code did not
> > run. My current configurations to SPARK_SUBMIT_OPTIONS are set to
> > "--driver-memory 9g --driver-java-options -XX:+UseG1GC
> > -XX:+UseCompressedOops --conf spark.storage.memoryFraction=0.1" and even
> > this does not work. Even simple operations such as rdd.count() after the
> > transformations in the previous mail does not work. All this on an
> > m4.xlarge machine.
> >
> > Moreover, in trying to set up standalone cluster on single machine by
> > following instructions in the book 'Learning Spark', I messed with file
> > '~/.ssh/authorized_keys' file which cut me out of the instance so I had
> to
> > terminate it and start all over again after losing all the work done in
> one
> > week.
> >
> > Today, I performed a comparison of memory and cpu load values using the
> > size of data and the machine configurations between two conditions:
> (when I
> > worked on my local machine) vs. (m4.xlarge single instance), where
> >
> > memory load = (data size) / (memory available for processing),
> > cpu load = (data size) / (cores available for processing)
> >
> > the results of the comparison indicate that with the amount of data, the
> > AWS instance is 100 times more constrained than the analysis that I
> > previously did on my machine (for calculations, please see sheet [0] ).
> > This has completely stalled work as I'm unable to perform any further
> > operations on the data sets. Further, choosing another instance (such as
> 32
> > GiB) may also not be sufficient (as per calculations in [0]). Please let
> me
> > know if I'm missing something or how to proceed with this.
> >
> > [0]. https://drive.google.com/open?id=0ByXTtaL2yHBuYnJSNGt6T2U2RjQ
> >
> > Thanks,
> > Anish.
> >
> >
> >
> > On Tue, Jul 12, 2016 at 12:35 PM, anish singh <anish18...@gmail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > I had been able to setup zeppelin with spark on aws ec2 m4.xlarge
> > instance
> > > a few days ago. In designing the notebook, I was trying to visualize
> the
> > > link structure by the following code :
> > >
> > > val mayBegLinks = mayBegData.keepValidPages()
> > >                             .flatMap(r => ExtractLinks(r.getUrl,
> > > r.getContentString))
> > >                             .map(r => (ExtractDomain(r._1),
> > > ExtractDomain(r._2)))
> > >                             .filter(r => (r._1.equals("
> www.fangraphs.com
> > ")
> > > || r._1.equals("www.osnews.com") ||   r._1.equals("www.dailytech.com
> ")))
> > >
> > > val linkWtMap = mayBegLinks.map(r => (r, 1)).reduceByKey((x, y) => x +
> y)
> > > linkWtMap.toDF().registerTempTable("LnkWtTbl")
> > >
> > > where 'mayBegData' is some 2GB of WARC for the first two segments of
> May.
> > > This paragraph runs smoothly but in the next paragraph using %sql and
> the
> > > following statement :-
> > >
> > > select W._1 as Links, W._2 as Weight from LnkWtTbl W
> > >
> > > I get errors which are always java.lang.OutOfMemoryError because of
> > > Garbage Collection space exceeded or heap space exceeded and the most
> > > recent one is the following:
> > >
> > > org.apache.thrift.transport.TTransportException at
> > >
> >
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
> > > at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
> at
> > >
> >
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
> > > at
> > >
> >
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
> > > at
> > >
> >
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
> > > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
> > at
> > >
> >
> org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:261)
> > > at
> > >
> >
> org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:245)
> > > at
> > >
> >
> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:312)
> > > at
> > >
> >
> org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
> > > at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:271) at
> > > org.apache.zeppelin.scheduler.Job.run(Job.java:176) at
> > >
> >
> org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329)
> > > at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> > > at
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> > > at
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > > at
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > > at java.lang.Thread.run(Thread.java:745)
> > >
> > > I just wanted to know that even with m4.xlarge instance, is it not
> > > possible to process such large(~ 2GB) of data because the above code is
> > > relatively simple, I guess. This is restricting the flexibility with
> > which
> > > the notebook can be designed. Please provide some hints/suggestions
> since
> > > I'm stuck on this since yesterday.
> > >
> > > Thanks,
> > > Anish.
> > >
> > >
> > > On Tue, Jul 5, 2016 at 12:28 PM, Alexander Bezzubov <b...@apache.org>
> > > wrote:
> > >
> > >> That sounds great, Anish!
> > >> Congratulations on getting a new machine.
> > >>
> > >> No worries, please take your time and keep us posted on your
> > exploration!
> > >> Quality is more important than quantity here.
> > >>
> > >> --
> > >> Alex
> > >>
> > >> On Mon, Jul 4, 2016 at 10:40 PM, anish singh <anish18...@gmail.com>
> > >> wrote:
> > >>
> > >> > Hello,
> > >> >
> > >> > Thanks Alex, I'm so glad that you helped. Here's update : I've
> ordered
> > >> new
> > >> > machine with more RAM and processor that should come by tomorrow. I
> > will
> > >> > attempt to use it for the common crawl data and the AWS solution
> that
> > >> you
> > >> > provided in the previous mail. I'm presently reading papers and
> > >> > publications regarding analysis of common crawl data. Warcbase tool
> > will
> > >> > definitely be used. I understand that common crawl datasets are
> > >> important
> > >> > and I will do everything it takes to make notebooks on them, the
> only
> > >> > tension is that it may take more time than the previous notebooks.
> > >> >
> > >> > Anish.
> > >> >
> > >> > On Mon, Jul 4, 2016 at 6:30 PM, Alexander Bezzubov <b...@apache.org>
> > >> wrote:
> > >> >
> > >> > > Hi Anish,
> > >> > >
> > >> > > thanks for keeping us posted about a progress!
> > >> > >
> > >> > > CommonCrawl is important dataset and it would be awesome if we
> could
> > >> > > find a way for you to build some notebooks for it though this this
> > >> > > years GSoC program.
> > >> > >
> > >> > > How about running Zeppelin on a single big enough node in AWS for
> > the
> > >> > > sake of this notebook?
> > >> > > If you use spot instance you could get even big instances for
> really
> > >> > > affordable price of 2-4$ a day, just need to make sure your
> persist
> > >> > > notebooks on S3 [1] to avoid loosing the data and shut down it for
> > the
> > >> > > night.
> > >> > >
> > >> > > AFAIK We do not have free any AWS credits for now, even for a GSoC
> > >> > > students. If somebody knows a way to provide\get some - please
> feel
> > >> > > free to chime in, I know there are some Amazonian people on the
> list
> > >> > > :)
> > >> > >
> > >> > > But so far AWS spot instances is the most cost-effective solution
> I
> > >> > > could imagine of. Bonus: if you host your instance in region
> > us-east-1
> > >> > > - transfer from\to S3 will be free, as that's where CommonCrawl
> > >> > > dataset is living.
> > >> > >
> > >> > > One more thing - please check out awesome WarcBase library [2]
> build
> > >> > > by internet preservation community. I find it really helpful,
> > working
> > >> > > with web archives.
> > >> > >
> > >> > > On the notebook design:
> > >> > >  - to understand the context of this dataset better - please do
> some
> > >> > > research how other people use it. What for, etc.
> > >> > >    Would be a great material for the blog post
> > >> > >  - try provide examples of all available formats: WARC, WET, WAT
> (in
> > >> > > may be in same or different notebooks, it's up to you)
> > >> > >  - while using warcbase - mind that RDD persistence will not work
> > >> > > until [3] is resolved, so avoid using if for now
> > >> > >
> > >> > > I understand that this can be a big task, so do not worry if that
> > >> > > takes time (learning AWS, etc) - just keep us posted on your
> > progress
> > >> > > weekly and I'll be glad to help!
> > >> > >
> > >> > >
> > >> > >  1.
> > >> > >
> > >> >
> > >>
> >
> http://zeppelin.apache.org/docs/0.6.0-SNAPSHOT/storage/storage.html#notebook-storage-in-s3
> > >> > >  2. https://github.com/lintool/warcbase
> > >> > >  3. https://github.com/lintool/warcbase/issues/227
> > >> > >
> > >> > > On Mon, Jul 4, 2016 at 7:00 PM, anish singh <anish18...@gmail.com
> >
> > >> > wrote:
> > >> > > > Hello,
> > >> > > >
> > >> > > > (everything outside Zeppelin)
> > >> > > > I had started work on the common crawl datasets, and tried to
> > first
> > >> > have
> > >> > > a
> > >> > > > look at only the data for May 2016. Out of the three formats
> > >> > available, I
> > >> > > > chose the WET(plain text format). The data only for May is
> divided
> > >> into
> > >> > > > segments and there are 24492 such segments. I downloaded only
> the
> > >> first
> > >> > > > segment for May and got 432MB of data. Now the problem is that
> my
> > >> > laptop
> > >> > > is
> > >> > > > a very modest machine with core 2 duo processor and 3GB of RAM
> > such
> > >> > that
> > >> > > > even opening the downloaded data file in LibreWriter filled the
> > RAM
> > >> > > > completely and hung the machine and bringing the data directly
> > into
> > >> > > > zeppelin or analyzing it inside zeppelin seems impossible. As
> good
> > >> as I
> > >> > > > know, there are two ways in which I can proceed :
> > >> > > >
> > >> > > > 1) Buying a new laptop with more RAM and processor.   OR
> > >> > > > 2) Choosing another dataset
> > >> > > >
> > >> > > > I have no problem with either of the above ways or anything that
> > you
> > >> > > might
> > >> > > > suggest but please let me know which way to proceed so that I
> may
> > be
> > >> > able
> > >> > > > to work in speed. Meanwhile, I will read more papers and
> > >> publications
> > >> > on
> > >> > > > possibilities of analyzing common crawl data.
> > >> > > >
> > >> > > > Thanks,
> > >> > > > Anish.
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: [GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

Reply via email to