Re: [Dbpedia-gsoc] GSoC 2015 Introduction and Parallel processing in DBpedia extraction Framework

Navin Pai Sun, 08 Mar 2015 04:46:58 -0700

Yup, looking at the changelog of Apache Spark and having worked on
upgrading much smaller applications across Spark versions, I can attest
that this process shouldn't take too much time. The number of breaking
changes are very minimal in recent versions.


An idea I had, which I would like feedback on is having a configuration
picker, rather than a list of preconfigured container/images. Kind of along
the lines of Fedora's Revisor project [1]. You could mix and match
depending on the configuration you want to use and a customized
image/container is created for you. Of course, the feasibility of this is
an open question...

Honestly, if you ask me, this one project could probably be broken up into
multiple projects, each with a different end goal. Docker brings in a very
interesting set of things to play with, and it would be great if some of
the mentors could provide more feedback on what the end goal of this
specific GSoC project is. :)

Thanks

[1] http://revisor.fedoraunity.org/



Hi Xiao, and welcome!
>
> Some thoughts from my initial impression and I appreciate your feedback:
> > ?- The project ?uses?? ?spark 0.9.1 while the latest version? of spark?
> is
> > bumped to 1.2.1.? I suppose there will be some work on upgrade it to the
> > new version.?
> >
>
> It'll perhaps be good to port the code to Spark 1.2.1; I can't imagine
> it'll take too much work because the Spark API has been pretty stable since
> that.
>
>
> > - It looks like the process is putting the data into HDFS, using spark
> the
> > exact data and writing result back to HDFS. ?Are there any design
> document
> > for this project?
> >
>
> Yes, but it can also work without HDFS. On a single-node cluster you can
> write directly

to the file system (I'm not sure if there is enough
> documentation on that, but there should be; it's mostly about substituting
> hdfs:///home/user/blah with file:///home/user/blah). On a multi-node
> cluster with NFS you can also work without HDFS.
>
> I have been meaning to write a proper paper on the project since a few
> months but never managed to get around to it.
>
> - Spark can works with various distributed file system (S3, GlusterFS, etc)
> > not limited to HDFS. So I suppose this could be configurable.
>
>
> It'd be a good idea to make this configurable, and I suppose it fits in
> well with the docker containers idea too. Different kinds of configurations
> for EC2/S3, Google Cloud etc.
>
> Feel free to ask any other questions that you may have while running it.
>
> Cheers,
> Nilesh
>
> You can also email me at cont...@nileshc.com or visit my website
> <http://nileshc.com/>
>
>
> On Thu, Mar 5, 2015 at 8:27 PM, Xiao Meng <xiaom...@gmail.com> wrote:
>
> > Hi,
> >
> > My name is?
> >  Xiao, currently a PhD student in Simon Fraser University, Canada.
> > ?
> >
> >
> > A little background on myself:
> >
> > - My research is mainly on data management especially on NoSQL databases.
> > - I worked for GSoC 2008 on PostgreSQL [1] when I was an undergraduate
> > student:-)
> > -
> > ?Now ?
> > I have been working on some open source projects for one year.
> > ?They?
> >  include Apache Hive[2] and Apache Drill[3], both are SQL-on-Hadoop
> > engines. I've
> > ?also ?
> > played
> > ?Apache S?
> > park for a while and have some hand-on experiences.
> > ?I am learning scala and pretty like it.?
> >
> > - During the period
> > ? of working on Hadoop ecosystem?
> > , I gained experience on deploying clusters for dev and test. Docker is a
> > great tool for this purpose and I have been building several complex
> docker
> > containers [4].
> >
> > I've heard the
> > ?great
> >  DBpedia project long times ago and always want to play with it:-)
> >
> > Given my background,  I am pretty interested in the following project:
> > ? ?
> > Parallel processing in DBpedia extraction Framework
> > ?[5]?.
> >
> >
> > Some thoughts from my initial impression and I appreciate your feedback:
> >
> > ?- The project ?
> > uses?
> > ? ?
> > spark 0.9.1 while the latest version
> > ? of spark?
> > is bumped to 1.2.1.
> > ?
> > I suppose there will be some work on upgrade it to the new version.
> > ?
> > - I
> > t looks like the process is putting the data into HDFS, using spark the
> > exact data and writing result back to HDFS.
> > ?
> > Are there any design document for this project?
> >     - Spark can works with various distributed file system (S3,
> GlusterFS,
> > etc) not limited to HDFS. So I suppose this could be configurable.
> >
> > ?I will try it out in following days.
> > ? Any suggestions for evolving this project?
> > ?
> >
> > ?Look forward to contributing to DBpedia!
> >
> >
> > [1] https://wiki.postgresql.org/wiki/GSoC_2008
> > [2] https://github.com/xiaom/docker-drill
> > [3] https://github.com/apache/hive
> > [4] https://github.com/apache/drill
> > [5] https://github.com/dbpedia/distributed-extraction-framework
>
>

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/

_______________________________________________
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] GSoC 2015 Introduction and Parallel processing in DBpedia extraction Framework

Reply via email to