Re: [GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

Alexander Bezzubov Sun, 24 Jul 2016 19:48:07 -0700

That sounds great Anish!

Please keep it up :)


--
Alex

On Wed, Jul 20, 2016, 18:07 anish singh <anish18...@gmail.com> wrote:

> Alex, some good news!
>
> I just tried the first option you mentioned in the previous mail, increased
> the driver memory to 16g, reduced caching space to 0.1% of total memory and
> additionally trimmed the warc content to include only three domains and its
> working (everything including reduceByKey()). Although, I had tried this
> earlier, few days ago but it had not worked then.
>
> I even understood the core problem : the original rdd( ~ 2GB) contained
> exactly 53307 rdd elements and when I ran 'flatMap(
> r => ExtractLinks(r.getUrl(), r.getContentstring())) on the this rdd it
> resulted in explosion of data extracted from these many elements(web pages)
> which the available memory was perhaps unable to handle. This also means
> that the rest of the analysis in the notebook must be done on domains
> extracted from the original warc files so it reduces the size of data to be
> processed. In case, more RAM is needed I will try to use m4.2xlarge (32GB)
> instance.
>
> Thrilled to have it working after struggling for so many days, so now I can
> proceed with the notebook.
>
> Thanks again,
> Anish.
>
> On Wed, Jul 20, 2016 at 7:08 AM, Alexander Bezzubov <b...@apache.org>
> wrote:
>
> > Hi Anish,
> >
> > thank you for sharing your progress and totally know what you mean -
> that's
> > an expected pain of working with real BigData.
> >
> > I would advise to conduct a series of experiments:
> >
> > *1 moderate machine*, Spark 1.6 in local mode, 1 WARC input file (1Gb)
> >  - Spark in local mode is a single JVM process, so fine-tune it and make
> > sure it uses ALL available memory (i.e 16Gb)
> >  - We are not going to use in-memory caching, so storage part can be
> turned
> > off [1]  and [2]
> >  - AFIAK DataFrames use memory more efficient than RDDs but not sure if
> we
> > can benefit from it here
> >  - Start with something simple, like `val mayBegLinks =
> > mayBegData.keepValidPages().count()` and make sure it works
> >  - Proceed further until few more complex queries work
> >
> > *Cluster of N machines*, Spark 1.6 in standalone cluster mode
> >  - process fraction of the whole dataset i.e 1 segment
> >
> >
> > I know that is not easy, but it's worth to try for 1 more week and see if
> > the approach outlined above works.
> > Last, but not least - do not hesitate to reach out to CommonCrawl
> community
> > [3] for an advice, there are people using Apache Spark there as well.
> >
> > Please keep us posted!
> >
> >  1.
> >
> http://spark.apache.org/docs/latest/tuning.html#memory-management-overview
> >  2.
> >
> >
> http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
> >  3. https://groups.google.com/forum/#!forum/common-crawl
> >
> > --
> > Alex
> >
> >
> > On Wed, Jul 20, 2016 at 2:27 AM, anish singh <anish18...@gmail.com>
> wrote:
> >
> > > Hello,
> > >
> > > The last two weeks have been tough and full of learning, the code in
> the
> > > previous mail which performed only simple transformation and
> > reduceByKey()
> > > to count similar domain links did not work even on the first
> segment(1005
> > > MB) of data. So I studied and read extensively on the web :
> > blogs(cloudera,
> > > databricks and stack overflow) and books on Spark, tried all the
> options
> > > and configurations on memory and performance tuning but the code did
> not
> > > run. My current configurations to SPARK_SUBMIT_OPTIONS are set to
> > > "--driver-memory 9g --driver-java-options -XX:+UseG1GC
> > > -XX:+UseCompressedOops --conf spark.storage.memoryFraction=0.1" and
> even
> > > this does not work. Even simple operations such as rdd.count() after
> the
> > > transformations in the previous mail does not work. All this on an
> > > m4.xlarge machine.
> > >
> > > Moreover, in trying to set up standalone cluster on single machine by
> > > following instructions in the book 'Learning Spark', I messed with file
> > > '~/.ssh/authorized_keys' file which cut me out of the instance so I had
> > to
> > > terminate it and start all over again after losing all the work done in
> > one
> > > week.
> > >
> > > Today, I performed a comparison of memory and cpu load values using the
> > > size of data and the machine configurations between two conditions:
> > (when I
> > > worked on my local machine) vs. (m4.xlarge single instance), where
> > >
> > > memory load = (data size) / (memory available for processing),
> > > cpu load = (data size) / (cores available for processing)
> > >
> > > the results of the comparison indicate that with the amount of data,
> the
> > > AWS instance is 100 times more constrained than the analysis that I
> > > previously did on my machine (for calculations, please see sheet [0] ).
> > > This has completely stalled work as I'm unable to perform any further
> > > operations on the data sets. Further, choosing another instance (such
> as
> > 32
> > > GiB) may also not be sufficient (as per calculations in [0]). Please
> let
> > me
> > > know if I'm missing something or how to proceed with this.
> > >
> > > [0]. https://drive.google.com/open?id=0ByXTtaL2yHBuYnJSNGt6T2U2RjQ
> > >
> > > Thanks,
> > > Anish.
> > >
> > >
> > >
> > > On Tue, Jul 12, 2016 at 12:35 PM, anish singh <anish18...@gmail.com>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I had been able to setup zeppelin with spark on aws ec2 m4.xlarge
> > > instance
> > > > a few days ago. In designing the notebook, I was trying to visualize
> > the
> > > > link structure by the following code :
> > > >
> > > > val mayBegLinks = mayBegData.keepValidPages()
> > > >                             .flatMap(r => ExtractLinks(r.getUrl,
> > > > r.getContentString))
> > > >                             .map(r => (ExtractDomain(r._1),
> > > > ExtractDomain(r._2)))
> > > >                             .filter(r => (r._1.equals("
> > www.fangraphs.com
> > > ")
> > > > || r._1.equals("www.osnews.com") ||   r._1.equals("www.dailytech.com
> > ")))
> > > >
> > > > val linkWtMap = mayBegLinks.map(r => (r, 1)).reduceByKey((x, y) => x
> +
> > y)
> > > > linkWtMap.toDF().registerTempTable("LnkWtTbl")
> > > >
> > > > where 'mayBegData' is some 2GB of WARC for the first two segments of
> > May.
> > > > This paragraph runs smoothly but in the next paragraph using %sql and
> > the
> > > > following statement :-
> > > >
> > > > select W._1 as Links, W._2 as Weight from LnkWtTbl W
> > > >
> > > > I get errors which are always java.lang.OutOfMemoryError because of
> > > > Garbage Collection space exceeded or heap space exceeded and the most
> > > > recent one is the following:
> > > >
> > > > org.apache.thrift.transport.TTransportException at
> > > >
> > >
> >
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
> > > > at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
> > at
> > > >
> > >
> >
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
> > > > at
> > > >
> > >
> >
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
> > > > at
> > > >
> > >
> >
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
> > > > at
> org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
> > > at
> > > >
> > >
> >
> org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:261)
> > > > at
> > > >
> > >
> >
> org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:245)
> > > > at
> > > >
> > >
> >
> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:312)
> > > > at
> > > >
> > >
> >
> org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
> > > > at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:271)
> at
> > > > org.apache.zeppelin.scheduler.Job.run(Job.java:176) at
> > > >
> > >
> >
> org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329)
> > > > at
> > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> > > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
> > > >
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> > > > at
> > > >
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> > > > at
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > > > at
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > > > at java.lang.Thread.run(Thread.java:745)
> > > >
> > > > I just wanted to know that even with m4.xlarge instance, is it not
> > > > possible to process such large(~ 2GB) of data because the above code
> is
> > > > relatively simple, I guess. This is restricting the flexibility with
> > > which
> > > > the notebook can be designed. Please provide some hints/suggestions
> > since
> > > > I'm stuck on this since yesterday.
> > > >
> > > > Thanks,
> > > > Anish.
> > > >
> > > >
> > > > On Tue, Jul 5, 2016 at 12:28 PM, Alexander Bezzubov <b...@apache.org>
> > > > wrote:
> > > >
> > > >> That sounds great, Anish!
> > > >> Congratulations on getting a new machine.
> > > >>
> > > >> No worries, please take your time and keep us posted on your
> > > exploration!
> > > >> Quality is more important than quantity here.
> > > >>
> > > >> --
> > > >> Alex
> > > >>
> > > >> On Mon, Jul 4, 2016 at 10:40 PM, anish singh <anish18...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > Hello,
> > > >> >
> > > >> > Thanks Alex, I'm so glad that you helped. Here's update : I've
> > ordered
> > > >> new
> > > >> > machine with more RAM and processor that should come by tomorrow.
> I
> > > will
> > > >> > attempt to use it for the common crawl data and the AWS solution
> > that
> > > >> you
> > > >> > provided in the previous mail. I'm presently reading papers and
> > > >> > publications regarding analysis of common crawl data. Warcbase
> tool
> > > will
> > > >> > definitely be used. I understand that common crawl datasets are
> > > >> important
> > > >> > and I will do everything it takes to make notebooks on them, the
> > only
> > > >> > tension is that it may take more time than the previous notebooks.
> > > >> >
> > > >> > Anish.
> > > >> >
> > > >> > On Mon, Jul 4, 2016 at 6:30 PM, Alexander Bezzubov <
> b...@apache.org>
> > > >> wrote:
> > > >> >
> > > >> > > Hi Anish,
> > > >> > >
> > > >> > > thanks for keeping us posted about a progress!
> > > >> > >
> > > >> > > CommonCrawl is important dataset and it would be awesome if we
> > could
> > > >> > > find a way for you to build some notebooks for it though this
> this
> > > >> > > years GSoC program.
> > > >> > >
> > > >> > > How about running Zeppelin on a single big enough node in AWS
> for
> > > the
> > > >> > > sake of this notebook?
> > > >> > > If you use spot instance you could get even big instances for
> > really
> > > >> > > affordable price of 2-4$ a day, just need to make sure your
> > persist
> > > >> > > notebooks on S3 [1] to avoid loosing the data and shut down it
> for
> > > the
> > > >> > > night.
> > > >> > >
> > > >> > > AFAIK We do not have free any AWS credits for now, even for a
> GSoC
> > > >> > > students. If somebody knows a way to provide\get some - please
> > feel
> > > >> > > free to chime in, I know there are some Amazonian people on the
> > list
> > > >> > > :)
> > > >> > >
> > > >> > > But so far AWS spot instances is the most cost-effective
> solution
> > I
> > > >> > > could imagine of. Bonus: if you host your instance in region
> > > us-east-1
> > > >> > > - transfer from\to S3 will be free, as that's where CommonCrawl
> > > >> > > dataset is living.
> > > >> > >
> > > >> > > One more thing - please check out awesome WarcBase library [2]
> > build
> > > >> > > by internet preservation community. I find it really helpful,
> > > working
> > > >> > > with web archives.
> > > >> > >
> > > >> > > On the notebook design:
> > > >> > >  - to understand the context of this dataset better - please do
> > some
> > > >> > > research how other people use it. What for, etc.
> > > >> > >    Would be a great material for the blog post
> > > >> > >  - try provide examples of all available formats: WARC, WET, WAT
> > (in
> > > >> > > may be in same or different notebooks, it's up to you)
> > > >> > >  - while using warcbase - mind that RDD persistence will not
> work
> > > >> > > until [3] is resolved, so avoid using if for now
> > > >> > >
> > > >> > > I understand that this can be a big task, so do not worry if
> that
> > > >> > > takes time (learning AWS, etc) - just keep us posted on your
> > > progress
> > > >> > > weekly and I'll be glad to help!
> > > >> > >
> > > >> > >
> > > >> > >  1.
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> http://zeppelin.apache.org/docs/0.6.0-SNAPSHOT/storage/storage.html#notebook-storage-in-s3
> > > >> > >  2. https://github.com/lintool/warcbase
> > > >> > >  3. https://github.com/lintool/warcbase/issues/227
> > > >> > >
> > > >> > > On Mon, Jul 4, 2016 at 7:00 PM, anish singh <
> anish18...@gmail.com
> > >
> > > >> > wrote:
> > > >> > > > Hello,
> > > >> > > >
> > > >> > > > (everything outside Zeppelin)
> > > >> > > > I had started work on the common crawl datasets, and tried to
> > > first
> > > >> > have
> > > >> > > a
> > > >> > > > look at only the data for May 2016. Out of the three formats
> > > >> > available, I
> > > >> > > > chose the WET(plain text format). The data only for May is
> > divided
> > > >> into
> > > >> > > > segments and there are 24492 such segments. I downloaded only
> > the
> > > >> first
> > > >> > > > segment for May and got 432MB of data. Now the problem is that
> > my
> > > >> > laptop
> > > >> > > is
> > > >> > > > a very modest machine with core 2 duo processor and 3GB of RAM
> > > such
> > > >> > that
> > > >> > > > even opening the downloaded data file in LibreWriter filled
> the
> > > RAM
> > > >> > > > completely and hung the machine and bringing the data directly
> > > into
> > > >> > > > zeppelin or analyzing it inside zeppelin seems impossible. As
> > good
> > > >> as I
> > > >> > > > know, there are two ways in which I can proceed :
> > > >> > > >
> > > >> > > > 1) Buying a new laptop with more RAM and processor.   OR
> > > >> > > > 2) Choosing another dataset
> > > >> > > >
> > > >> > > > I have no problem with either of the above ways or anything
> that
> > > you
> > > >> > > might
> > > >> > > > suggest but please let me know which way to proceed so that I
> > may
> > > be
> > > >> > able
> > > >> > > > to work in speed. Meanwhile, I will read more papers and
> > > >> publications
> > > >> > on
> > > >> > > > possibilities of analyzing common crawl data.
> > > >> > > >
> > > >> > > > Thanks,
> > > >> > > > Anish.
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: [GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

Reply via email to