Re: [GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

anish singh Tue, 19 Jul 2016 10:27:53 -0700

Hello,

The last two weeks have been tough and full of learning, the code in the
previous mail which performed only simple transformation and reduceByKey()
to count similar domain links did not work even on the first segment(1005
MB) of data. So I studied and read extensively on the web : blogs(cloudera,
databricks and stack overflow) and books on Spark, tried all the options
and configurations on memory and performance tuning but the code did not
run. My current configurations to SPARK_SUBMIT_OPTIONS are set to
"--driver-memory 9g --driver-java-options -XX:+UseG1GC
-XX:+UseCompressedOops --conf spark.storage.memoryFraction=0.1" and even
this does not work. Even simple operations such as rdd.count() after the
transformations in the previous mail does not work. All this on an
m4.xlarge machine.


Moreover, in trying to set up standalone cluster on single machine by
following instructions in the book 'Learning Spark', I messed with file
'~/.ssh/authorized_keys' file which cut me out of the instance so I had to
terminate it and start all over again after losing all the work done in one
week.

Today, I performed a comparison of memory and cpu load values using the
size of data and the machine configurations between two conditions: (when I
worked on my local machine) vs. (m4.xlarge single instance), where

memory load = (data size) / (memory available for processing),
cpu load = (data size) / (cores available for processing)

the results of the comparison indicate that with the amount of data, the
AWS instance is 100 times more constrained than the analysis that I
previously did on my machine (for calculations, please see sheet [0] ).
This has completely stalled work as I'm unable to perform any further
operations on the data sets. Further, choosing another instance (such as 32
GiB) may also not be sufficient (as per calculations in [0]). Please let me
know if I'm missing something or how to proceed with this.

[0]. https://drive.google.com/open?id=0ByXTtaL2yHBuYnJSNGt6T2U2RjQ

Thanks,
Anish.



On Tue, Jul 12, 2016 at 12:35 PM, anish singh <anish18...@gmail.com> wrote:

> Hello,
>
> I had been able to setup zeppelin with spark on aws ec2 m4.xlarge instance
> a few days ago. In designing the notebook, I was trying to visualize the
> link structure by the following code :
>
> val mayBegLinks = mayBegData.keepValidPages()
>                             .flatMap(r => ExtractLinks(r.getUrl,
> r.getContentString))
>                             .map(r => (ExtractDomain(r._1),
> ExtractDomain(r._2)))
>                             .filter(r => (r._1.equals("www.fangraphs.com")
> || r._1.equals("www.osnews.com") ||   r._1.equals("www.dailytech.com")))
>
> val linkWtMap = mayBegLinks.map(r => (r, 1)).reduceByKey((x, y) => x + y)
> linkWtMap.toDF().registerTempTable("LnkWtTbl")
>
> where 'mayBegData' is some 2GB of WARC for the first two segments of May.
> This paragraph runs smoothly but in the next paragraph using %sql and the
> following statement :-
>
> select W._1 as Links, W._2 as Weight from LnkWtTbl W
>
> I get errors which are always java.lang.OutOfMemoryError because of
> Garbage Collection space exceeded or heap space exceeded and the most
> recent one is the following:
>
> org.apache.thrift.transport.TTransportException at
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
> at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
> at
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
> at
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
> at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) at
> org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:261)
> at
> org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:245)
> at
> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:312)
> at
> org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
> at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:271) at
> org.apache.zeppelin.scheduler.Job.run(Job.java:176) at
> org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> I just wanted to know that even with m4.xlarge instance, is it not
> possible to process such large(~ 2GB) of data because the above code is
> relatively simple, I guess. This is restricting the flexibility with which
> the notebook can be designed. Please provide some hints/suggestions since
> I'm stuck on this since yesterday.
>
> Thanks,
> Anish.
>
>
> On Tue, Jul 5, 2016 at 12:28 PM, Alexander Bezzubov <b...@apache.org>
> wrote:
>
>> That sounds great, Anish!
>> Congratulations on getting a new machine.
>>
>> No worries, please take your time and keep us posted on your exploration!
>> Quality is more important than quantity here.
>>
>> --
>> Alex
>>
>> On Mon, Jul 4, 2016 at 10:40 PM, anish singh <anish18...@gmail.com>
>> wrote:
>>
>> > Hello,
>> >
>> > Thanks Alex, I'm so glad that you helped. Here's update : I've ordered
>> new
>> > machine with more RAM and processor that should come by tomorrow. I will
>> > attempt to use it for the common crawl data and the AWS solution that
>> you
>> > provided in the previous mail. I'm presently reading papers and
>> > publications regarding analysis of common crawl data. Warcbase tool will
>> > definitely be used. I understand that common crawl datasets are
>> important
>> > and I will do everything it takes to make notebooks on them, the only
>> > tension is that it may take more time than the previous notebooks.
>> >
>> > Anish.
>> >
>> > On Mon, Jul 4, 2016 at 6:30 PM, Alexander Bezzubov <b...@apache.org>
>> wrote:
>> >
>> > > Hi Anish,
>> > >
>> > > thanks for keeping us posted about a progress!
>> > >
>> > > CommonCrawl is important dataset and it would be awesome if we could
>> > > find a way for you to build some notebooks for it though this this
>> > > years GSoC program.
>> > >
>> > > How about running Zeppelin on a single big enough node in AWS for the
>> > > sake of this notebook?
>> > > If you use spot instance you could get even big instances for really
>> > > affordable price of 2-4$ a day, just need to make sure your persist
>> > > notebooks on S3 [1] to avoid loosing the data and shut down it for the
>> > > night.
>> > >
>> > > AFAIK We do not have free any AWS credits for now, even for a GSoC
>> > > students. If somebody knows a way to provide\get some - please feel
>> > > free to chime in, I know there are some Amazonian people on the list
>> > > :)
>> > >
>> > > But so far AWS spot instances is the most cost-effective solution I
>> > > could imagine of. Bonus: if you host your instance in region us-east-1
>> > > - transfer from\to S3 will be free, as that's where CommonCrawl
>> > > dataset is living.
>> > >
>> > > One more thing - please check out awesome WarcBase library [2] build
>> > > by internet preservation community. I find it really helpful, working
>> > > with web archives.
>> > >
>> > > On the notebook design:
>> > >  - to understand the context of this dataset better - please do some
>> > > research how other people use it. What for, etc.
>> > >    Would be a great material for the blog post
>> > >  - try provide examples of all available formats: WARC, WET, WAT (in
>> > > may be in same or different notebooks, it's up to you)
>> > >  - while using warcbase - mind that RDD persistence will not work
>> > > until [3] is resolved, so avoid using if for now
>> > >
>> > > I understand that this can be a big task, so do not worry if that
>> > > takes time (learning AWS, etc) - just keep us posted on your progress
>> > > weekly and I'll be glad to help!
>> > >
>> > >
>> > >  1.
>> > >
>> >
>> http://zeppelin.apache.org/docs/0.6.0-SNAPSHOT/storage/storage.html#notebook-storage-in-s3
>> > >  2. https://github.com/lintool/warcbase
>> > >  3. https://github.com/lintool/warcbase/issues/227
>> > >
>> > > On Mon, Jul 4, 2016 at 7:00 PM, anish singh <anish18...@gmail.com>
>> > wrote:
>> > > > Hello,
>> > > >
>> > > > (everything outside Zeppelin)
>> > > > I had started work on the common crawl datasets, and tried to first
>> > have
>> > > a
>> > > > look at only the data for May 2016. Out of the three formats
>> > available, I
>> > > > chose the WET(plain text format). The data only for May is divided
>> into
>> > > > segments and there are 24492 such segments. I downloaded only the
>> first
>> > > > segment for May and got 432MB of data. Now the problem is that my
>> > laptop
>> > > is
>> > > > a very modest machine with core 2 duo processor and 3GB of RAM such
>> > that
>> > > > even opening the downloaded data file in LibreWriter filled the RAM
>> > > > completely and hung the machine and bringing the data directly into
>> > > > zeppelin or analyzing it inside zeppelin seems impossible. As good
>> as I
>> > > > know, there are two ways in which I can proceed :
>> > > >
>> > > > 1) Buying a new laptop with more RAM and processor.   OR
>> > > > 2) Choosing another dataset
>> > > >
>> > > > I have no problem with either of the above ways or anything that you
>> > > might
>> > > > suggest but please let me know which way to proceed so that I may be
>> > able
>> > > > to work in speed. Meanwhile, I will read more papers and
>> publications
>> > on
>> > > > possibilities of analyzing common crawl data.
>> > > >
>> > > > Thanks,
>> > > > Anish.
>> > >
>> >
>>
>
>

Re: [GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

Reply via email to