Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-26 Thread qingyang li
Egor, i encounter the same problem which you have asked in this thread:

http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3CCAMrx5DwJVJS0g_FE7_2qwMu4Xf0y5VfV=tlyauv2kh5v4k6...@mail.gmail.com%3E

have you fixed this problem?

i am using shark to read a table which i have created on hdfs.

i found in shark lib_managed directory there are two protobuf*.jar:
[root@bigdata001 shark-0.9.0]# find . -name "proto*.jar"
./lib_managed/jars/org.spark-project.protobuf/protobuf-java/protobuf-java-2.4.1-shaded.jar
./lib_managed/bundles/com.google.protobuf/protobuf-java/protobuf-java-2.5.0.jar


my hadoop is using protobuf-java-2.5.0.jar .


Re: ALS memory limits

2014-03-26 Thread Sean Owen
Much of this sounds related to the memory issue mentioned earlier in this
thread. Are you using a build that has fixed that? That would be by far
most important here.

If the raw memory requirement is 8GB, the actual heap size necessary could
be a lot larger -- object overhead, all the other stuff in memory,
overheads within the heap allocation, etc. So I would expect total memory
requirement to be significantly more than 9GB.

Still, this is the *total* requirement across the cluster. Each worker is
just loading part of the matrix. If you have 10 workers I would imagine it
roughly chops the per-worker memory requirement by 10x.

This in turn depends on also letting workers use more than their default
amount of memory. May need to increase executor memory here.

Separately, I have observed issues with too many files open and lots of
/tmp files. You may have to use ulimit to increase the number of open files
allowed.

On Wed, Mar 26, 2014 at 6:06 AM, Debasish Das wrote:

> Hi,
>
> For our usecases we are looking into 20 x 1M matrices which comes in the
> similar ranges as outlined by the paper over here:
>
> http://sandeeptata.blogspot.com/2012/12/sparkler-large-scale-matrix.html
>
> Is the exponential runtime growth in spark ALS as outlined by the blog
> still exists in recommendation.ALS ?
>
> I am running a spark cluster of 10 nodes with total memory of around 1 TB
> with 80 cores
>
> With rank = 50, the memory requirements for ALS should be 20Mx50 doubles on
> every worker which is around 8 GB
>
> Even if both the factor matrices are cached in memory I should be bounded
> by ~ 9 GB but even with 32 GB per worker I see GC errors...
>
> I am debugging the scalability and memory requirements of the algorithm
> further but any insights will be very helpful...
>
> Also there are two other issues:
>
> 1. If GC errors are hit, that worker JVM goes down and I have to restart it
> manually. Is this expected ?
>
> 2. When I try to make use of all 80 cores on the cluster I get some issues
> related to java.io.File not found exception on /tmp/ ? Is there some OS
> limit that how many cores can simultaneously access /tmp from a process ?
>
> Thanks.
> Deb
>
>


Re: [VOTE] Release Apache Spark 0.9.1 (rc1)

2014-03-26 Thread Tathagata Das
This voting thread has been canceled to accommodate an important fix
regarding Spark's use of the ASM library. Refer to
https://spark-project.atlassian.net/browse/SPARK-782 for more details.

TD


On Tue, Mar 25, 2014 at 10:36 PM, Matei Zaharia wrote:

> Actually I found one minor issue, which is that the support for Tachyon in
> make-distribution.sh seems to rely on GNU sed flags and doesn't work on Mac
> OS X. But I'd be okay pushing that to a later release since this is a
> packaging operation that you do only once, and presumably you'd do it on a
> Linux cluster. I opened
> https://spark-project.atlassian.net/browse/SPARK-1326 to track it. We can
> put it in another RC if we find bigger issues.
>
> Matei
>
> On Mar 25, 2014, at 10:31 PM, Matei Zaharia 
> wrote:
>
> > +1 looks good to me. I tried both the source and CDH4 versions and
> looked at the new streaming docs.
> >
> > The release notes seem slightly incomplete, but I guess you're still
> working on them? Anyway those don't go into the release tarball so it's
> okay.
> >
> > Matei
> >
> > On Mar 24, 2014, at 2:01 PM, Tathagata Das 
> wrote:
> >
> >> Please vote on releasing the following candidate as Apache Spark
> version 0.9.1
> >> A draft of the release notes along with the CHANGES.txt file is
> attached to this e-mail.
> >>
> >> The tag to be voted on is v0.9.1 (commit 81c6a06c):
> >>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=81c6a06c796a87aaeb5f129f36e4c3396e27d652
> >>
> >>
> >>
> >> The release files, including signatures, digests, etc can be found at:
> >> http://people.apache.org/~tdas/spark-0.9.1-rc1/
> >>
> >> Release artifacts are signed with the following key:
> >> https://people.apache.org/keys/committer/tdas.asc
> >>
> >>
> >>
> >> The staging repository for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1007/
> >>
> >>
> >>
> >> The documentation corresponding to this release can be found at:
> >> http://people.apache.org/~tdas/spark-0.9.1-rc1-docs/
> >>
> >> Please vote on releasing this package as Apache Spark 0.9.1!
> >>
> >> The vote is open until Thursday, March 27, at 22:00 UTC and passes if
> >> a majority of at least 3 +1 PMC votes are cast.
> >>
> >> [ ] +1 Release this package as Apache Spark 0.9.1
> >>
> >>
> >> [ ] -1 Do not release this package because ...
> >>
> >>
> >>
> >> To learn more about Apache Spark, please see
> >> http://spark.apache.org/
> >> 
> >
>
>


Re: ALS memory limits

2014-03-26 Thread Debasish Das
Thanks Sean. Looking into executor memory options now...

I am at incubator_spark head. Does that has all the fixes or I need spark
head ? I can deploy the spark head as well...

I am not running implicit feedback yet...I remember memory enhancements
were mainly for implicit right ?

For ulimit let me look into centos settingsI am curious how map-reduce
resolves itby using 1 core from 1 process ? I am running 2 tb yarn jobs
as well for etl, pre processing etchave not seen the too many files
opened yet

when there is gc error, the worker diesthat's mystery as well...any
insights from spark core team ? Yarn container gets killed if gc boundaries
are about to hitsimilar ideas can be used here as well ? Also which
tool do we use for memory debugging in spark ?
 On Mar 26, 2014 1:45 AM, "Sean Owen"  wrote:

> Much of this sounds related to the memory issue mentioned earlier in this
> thread. Are you using a build that has fixed that? That would be by far
> most important here.
>
> If the raw memory requirement is 8GB, the actual heap size necessary could
> be a lot larger -- object overhead, all the other stuff in memory,
> overheads within the heap allocation, etc. So I would expect total memory
> requirement to be significantly more than 9GB.
>
> Still, this is the *total* requirement across the cluster. Each worker is
> just loading part of the matrix. If you have 10 workers I would imagine it
> roughly chops the per-worker memory requirement by 10x.
>
> This in turn depends on also letting workers use more than their default
> amount of memory. May need to increase executor memory here.
>
> Separately, I have observed issues with too many files open and lots of
> /tmp files. You may have to use ulimit to increase the number of open files
> allowed.
>
> On Wed, Mar 26, 2014 at 6:06 AM, Debasish Das  >wrote:
>
> > Hi,
> >
> > For our usecases we are looking into 20 x 1M matrices which comes in the
> > similar ranges as outlined by the paper over here:
> >
> > http://sandeeptata.blogspot.com/2012/12/sparkler-large-scale-matrix.html
> >
> > Is the exponential runtime growth in spark ALS as outlined by the blog
> > still exists in recommendation.ALS ?
> >
> > I am running a spark cluster of 10 nodes with total memory of around 1 TB
> > with 80 cores
> >
> > With rank = 50, the memory requirements for ALS should be 20Mx50 doubles
> on
> > every worker which is around 8 GB
> >
> > Even if both the factor matrices are cached in memory I should be bounded
> > by ~ 9 GB but even with 32 GB per worker I see GC errors...
> >
> > I am debugging the scalability and memory requirements of the algorithm
> > further but any insights will be very helpful...
> >
> > Also there are two other issues:
> >
> > 1. If GC errors are hit, that worker JVM goes down and I have to restart
> it
> > manually. Is this expected ?
> >
> > 2. When I try to make use of all 80 cores on the cluster I get some
> issues
> > related to java.io.File not found exception on /tmp/ ? Is there some OS
> > limit that how many cores can simultaneously access /tmp from a process ?
> >
> > Thanks.
> > Deb
> >
> >
>


Re: ALS memory limits

2014-03-26 Thread Debasish Das
Just a clarification, I am using Spark ALS explicit feedback on standalone
cluster without deploying zookeeper master HA option yet...

When in the standalone spark cluster, worker fails due to GC error, the
worker dies as well and I have to restart the workerUnderstanding this
issue will be useful as we deploy the solution...


On Wed, Mar 26, 2014 at 7:31 AM, Debasish Das wrote:

> Thanks Sean. Looking into executor memory options now...
>
> I am at incubator_spark head. Does that has all the fixes or I need spark
> head ? I can deploy the spark head as well...
>
> I am not running implicit feedback yet...I remember memory enhancements
> were mainly for implicit right ?
>
> For ulimit let me look into centos settingsI am curious how map-reduce
> resolves itby using 1 core from 1 process ? I am running 2 tb yarn jobs
> as well for etl, pre processing etchave not seen the too many files
> opened yet
>
> when there is gc error, the worker diesthat's mystery as well...any
> insights from spark core team ? Yarn container gets killed if gc boundaries
> are about to hitsimilar ideas can be used here as well ? Also which
> tool do we use for memory debugging in spark ?
>  On Mar 26, 2014 1:45 AM, "Sean Owen"  wrote:
>
>> Much of this sounds related to the memory issue mentioned earlier in this
>> thread. Are you using a build that has fixed that? That would be by far
>> most important here.
>>
>> If the raw memory requirement is 8GB, the actual heap size necessary could
>> be a lot larger -- object overhead, all the other stuff in memory,
>> overheads within the heap allocation, etc. So I would expect total memory
>> requirement to be significantly more than 9GB.
>>
>> Still, this is the *total* requirement across the cluster. Each worker is
>> just loading part of the matrix. If you have 10 workers I would imagine it
>> roughly chops the per-worker memory requirement by 10x.
>>
>> This in turn depends on also letting workers use more than their default
>> amount of memory. May need to increase executor memory here.
>>
>> Separately, I have observed issues with too many files open and lots of
>> /tmp files. You may have to use ulimit to increase the number of open
>> files
>> allowed.
>>
>> On Wed, Mar 26, 2014 at 6:06 AM, Debasish Das > >wrote:
>>
>> > Hi,
>> >
>> > For our usecases we are looking into 20 x 1M matrices which comes in the
>> > similar ranges as outlined by the paper over here:
>> >
>> >
>> http://sandeeptata.blogspot.com/2012/12/sparkler-large-scale-matrix.html
>> >
>> > Is the exponential runtime growth in spark ALS as outlined by the blog
>> > still exists in recommendation.ALS ?
>> >
>> > I am running a spark cluster of 10 nodes with total memory of around 1
>> TB
>> > with 80 cores
>> >
>> > With rank = 50, the memory requirements for ALS should be 20Mx50
>> doubles on
>> > every worker which is around 8 GB
>> >
>> > Even if both the factor matrices are cached in memory I should be
>> bounded
>> > by ~ 9 GB but even with 32 GB per worker I see GC errors...
>> >
>> > I am debugging the scalability and memory requirements of the algorithm
>> > further but any insights will be very helpful...
>> >
>> > Also there are two other issues:
>> >
>> > 1. If GC errors are hit, that worker JVM goes down and I have to
>> restart it
>> > manually. Is this expected ?
>> >
>> > 2. When I try to make use of all 80 cores on the cluster I get some
>> issues
>> > related to java.io.File not found exception on /tmp/ ? Is there some OS
>> > limit that how many cores can simultaneously access /tmp from a process
>> ?
>> >
>> > Thanks.
>> > Deb
>> >
>> >
>>
>


Re: Spark 0.9.1 release

2014-03-26 Thread Patrick Wendell
Hey TD,

This one we just merged into master this morning:
https://spark-project.atlassian.net/browse/SPARK-1322

It should definitely go into the 0.9 branch because there was a bug in the
semantics of top() which at this point is unreleased in Python.

I didn't backport it yet because I figured you might want to do this at a
specific time. So please go ahead and backport it. Not sure whether this
warrants another RC.

- Patrick


On Tue, Mar 25, 2014 at 10:47 PM, Mridul Muralidharan wrote:

> On Wed, Mar 26, 2014 at 10:53 AM, Tathagata Das
>  wrote:
> > PR 159 seems like a fairly big patch to me. And quite recent, so its
> impact
> > on the scheduling is not clear. It may also depend on other changes that
> > may have gotten into the DAGScheduler but not pulled into branch 0.9. I
> am
> > not sure it is a good idea to pull that in. We can pull those changes
> later
> > for 0.9.2 if required.
>
>
> There is no impact on scheduling : it only has an impact on error
> handling - it ensures that you can actually use spark on yarn in
> multi-tennent clusters more reliably.
> Currently, any reasonably long running job (30 mins+) working on non
> trivial dataset will fail due to accumulated failures in spark.
>
>
> Regards,
> Mridul
>
>
> >
> > TD
> >
> >
> >
> >
> > On Tue, Mar 25, 2014 at 8:44 PM, Mridul Muralidharan  >wrote:
> >
> >> Forgot to mention this in the earlier request for PR's.
> >> If there is another RC being cut, please add
> >> https://github.com/apache/spark/pull/159 to it too (if not done
> >> already !).
> >>
> >> Thanks,
> >> Mridul
> >>
> >> On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das
> >>  wrote:
> >> >  Hello everyone,
> >> >
> >> > Since the release of Spark 0.9, we have received a number of important
> >> bug
> >> > fixes and we would like to make a bug-fix release of Spark 0.9.1. We
> are
> >> > going to cut a release candidate soon and we would love it if people
> test
> >> > it out. We have backported several bug fixes into the 0.9 and updated
> >> JIRA
> >> > accordingly<
> >>
> https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)
> >> >.
> >> > Please let me know if there are fixes that were not backported but you
> >> > would like to see them in 0.9.1.
> >> >
> >> > Thanks!
> >> >
> >> > TD
> >>
>


Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-26 Thread yao
@qingyang, spark 0.9.0 works for me perfectly when accessing (read/write)
data on hdfs. BTW, if you look at pom.xml, you have to choose yarn profile
to compile spark, so that it won't include protobuf 2.4.1 in your final
jars. Here is the command line we use to compile spark with hadoop 2.2:

mvn -U -Dyarn.version=2.2.0 -Dhadoop.version=2.2.0 -Pyarn  -DskipTests
package

Thanks
-Shengzhe


On Wed, Mar 26, 2014 at 12:04 AM, qingyang li wrote:

> Egor, i encounter the same problem which you have asked in this thread:
>
>
> http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3CCAMrx5DwJVJS0g_FE7_2qwMu4Xf0y5VfV=tlyauv2kh5v4k6...@mail.gmail.com%3E
>
> have you fixed this problem?
>
> i am using shark to read a table which i have created on hdfs.
>
> i found in shark lib_managed directory there are two protobuf*.jar:
> [root@bigdata001 shark-0.9.0]# find . -name "proto*.jar"
>
> ./lib_managed/jars/org.spark-project.protobuf/protobuf-java/protobuf-java-2.4.1-shaded.jar
>
> ./lib_managed/bundles/com.google.protobuf/protobuf-java/protobuf-java-2.5.0.jar
>
>
> my hadoop is using protobuf-java-2.5.0.jar .
>


Fwd: Apache projects based Business

2014-03-26 Thread Edward J. Yoon
Hi community,

If there are things that I need to be aware of, please give me a heads up.

-- Forwarded message --
From: Edward J. Yoon 
Date: Wed, Mar 26, 2014 at 11:13 AM
Subject: Apache projects based Business
To: legal-disc...@apache.org
Cc: Commons Developers List 


Hi,

I recently established my own company called DataSayer, Co, Ltd.[1],
we'll develop the Big Data product using Apache projects (Apache
Hadoop, Hama, Spark, MRQL), it's currently located in Seoul and also
plan on move to USA.

Before announcing the official website and publishing some press on
the web, please review the web text content and let me know if there
are things that I need to be aware of.

Thanks!

1. http://datasayer.com/

--
Edward J. Yoon (@eddieyoon)
Chief Executive Officer
DataSayer, Inc.


-- 
Edward J. Yoon (@eddieyoon)
Chief Executive Officer
DataSayer, Inc.


Re: Spark 0.9.1 release

2014-03-26 Thread Tathagata Das
Updates:
1. Fix for the ASM
problemthat
Kevin mentioned is already in Spark 0.9.1 RC2
2. Fix for pyspark's RDD.top()
that Patrick
mentioned has been pulled into branch 0.9. This will get into the next RC
if there is one.

TD


On Wed, Mar 26, 2014 at 9:21 AM, Patrick Wendell  wrote:

> Hey TD,
>
> This one we just merged into master this morning:
> https://spark-project.atlassian.net/browse/SPARK-1322
>
> It should definitely go into the 0.9 branch because there was a bug in the
> semantics of top() which at this point is unreleased in Python.
>
> I didn't backport it yet because I figured you might want to do this at a
> specific time. So please go ahead and backport it. Not sure whether this
> warrants another RC.
>
> - Patrick
>
>
> On Tue, Mar 25, 2014 at 10:47 PM, Mridul Muralidharan  >wrote:
>
> > On Wed, Mar 26, 2014 at 10:53 AM, Tathagata Das
> >  wrote:
> > > PR 159 seems like a fairly big patch to me. And quite recent, so its
> > impact
> > > on the scheduling is not clear. It may also depend on other changes
> that
> > > may have gotten into the DAGScheduler but not pulled into branch 0.9. I
> > am
> > > not sure it is a good idea to pull that in. We can pull those changes
> > later
> > > for 0.9.2 if required.
> >
> >
> > There is no impact on scheduling : it only has an impact on error
> > handling - it ensures that you can actually use spark on yarn in
> > multi-tennent clusters more reliably.
> > Currently, any reasonably long running job (30 mins+) working on non
> > trivial dataset will fail due to accumulated failures in spark.
> >
> >
> > Regards,
> > Mridul
> >
> >
> > >
> > > TD
> > >
> > >
> > >
> > >
> > > On Tue, Mar 25, 2014 at 8:44 PM, Mridul Muralidharan  > >wrote:
> > >
> > >> Forgot to mention this in the earlier request for PR's.
> > >> If there is another RC being cut, please add
> > >> https://github.com/apache/spark/pull/159 to it too (if not done
> > >> already !).
> > >>
> > >> Thanks,
> > >> Mridul
> > >>
> > >> On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das
> > >>  wrote:
> > >> >  Hello everyone,
> > >> >
> > >> > Since the release of Spark 0.9, we have received a number of
> important
> > >> bug
> > >> > fixes and we would like to make a bug-fix release of Spark 0.9.1. We
> > are
> > >> > going to cut a release candidate soon and we would love it if people
> > test
> > >> > it out. We have backported several bug fixes into the 0.9 and
> updated
> > >> JIRA
> > >> > accordingly<
> > >>
> >
> https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)
> > >> >.
> > >> > Please let me know if there are fixes that were not backported but
> you
> > >> > would like to see them in 0.9.1.
> > >> >
> > >> > Thanks!
> > >> >
> > >> > TD
> > >>
> >
>


Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-03-26 Thread Josh Suereth
Cool. It sounds like focusing on sbt-pom-reader would be a good thing for
you guys then.

There's a few... fun... issues around maven parent projects that are still
running around with sbt-pom-reader that appear to be fundamental ivy-maven
hate-based issues.

IN any case, while I'm generally swamped with things, let me know anything
I can do to improve the situation.  The shading thing frustrates me
slightly, as I thinka lot of projects that shade have been abandoned (that
I looked into).  Which plugin are you using?


On Fri, Mar 14, 2014 at 10:21 PM, Matei Zaharia wrote:

> I like the pom-reader approach as well -- in particular, that it lets you
> add extra stuff in your SBT build after loading the dependencies from the
> POM. Profiles would be the one thing missing to be able to pass options
> through.
>
> Matei
>
> On Mar 14, 2014, at 10:03 AM, Patrick Wendell  wrote:
>
> > Hey Josh,
> >
> > I spent some time playing with the sbt-pom-reader and actually I think
> > it might suite our needs pretty well. Basically our situation is that
> > Maven has more mature support for packaging (shading, official
> > assembly plug-in) and is used by pretty much every downstream packager
> > of Spark. SBT is used by the developers heavily for incremental
> > compilation, the console, etc. So far we've been maintaining entirely
> > isolated builds in sbt and maven which is a huge development cost and
> > something that's lead to divergence over time.
> >
> > The sbt-pom-reader I think could give us the best of both worlds. But
> > it doesn't support profiles in Maven which we use pretty heavily. I
> > played a bit this week and tried to get it to work without much luck:
> >
> https://github.com/pwendell/sbt-pom-reader/commit/ca1f1f2d6bf8891acb7212facf4807baaca8974d
> >
> > Any pointers or interest in helping us with that feature? I think with
> > that in place we might have a good solution where the dependencies are
> > declared in one place (poms) and we use Maven for packaging but we can
> > use sbt for day to day development.
> >
> >
> >
> > On Fri, Mar 14, 2014 at 9:41 AM, Evan Chan  wrote:
> >> Jsuereth -
> >>
> >> Thanks for jumping on this thread.  There are a couple things which
> >> would help SBT IMHO (w.r.t. issues mentioned here):
> >>
> >> - A way of generating a project-wide pom.xml  (I believe make-pom only
> >> generates it for each sub-project)
> >>   - and probably better or more feature-complete POMs, but the Maven
> >> folks here can speak to that
> >> - Make the sbt-pom-reader plugin officially part of SBT, and make it
> >> work well (again the Maven folks need to jump in here, though plugins
> >> can't be directly translated)
> >> - Have the sbt-assembly plugin officially maintained by Typesafe, or
> >> part of SBT  most Maven folks expect not to have to include a
> >> plugin to generate a fat jar, and it's a pretty essential plugin for
> >> just about every SBT project
> >> - Also there is no equivalent (AFAIK) to the maven shader plugin.
> >>
> >> I also wish that the dependency-graph plugin was included by default,
> >> but that's just me  :)
> >>
> >> -Evan
> >>
> >>
> >> On Fri, Mar 14, 2014 at 6:47 AM, jsuereth 
> wrote:
> >>> Hey guys -
> >>>
> >>> If there's anything we can do to improve the sbt experience, let me
> know.
> >>> I'd be extremely interested to see how/where there are issues
> integrating
> >>> sbt with the existing Hadoop ecosystem.
> >>>
> >>> Particularly the difficulties in using Sbt + Maven together (something
> which
> >>> tends block more than just spark from adopting sbt).
> >>>
> >>> I'm more than happy to listen and see what we can do on the sbt side
> to make
> >>> this as seamless as possible for all parties.
> >>>
> >>> Thanks!
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-tp2315p5682.html
> >>> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >>
> >>
> >>
> >> --
> >> --
> >> Evan Chan
> >> Staff Engineer
> >> e...@ooyala.com  |
>
>


Re: Error executing sql using shark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-26 Thread qingyang li
my spark can also work well with hadoop2.2.0
my shark can not work well with hadoop2.2.0 because protobuf version
problem.

in shark direcotry , i found two vesions of protobuf,  and they all are
loaded into classpath.

[root@bigdata001 shark-0.9.0]# find . -name "proto*.jar"

./lib_managed/jars/org.spark-project.protobuf/protobuf-java/protobuf-java-2.4.1-shaded.jar
./lib_managed/bundles/com.google.protobuf/protobuf-java/protobuf-java-2.5.0.jar




2014-03-27 2:26 GMT+08:00 yao :

> @qingyang, spark 0.9.0 works for me perfectly when accessing (read/write)
> data on hdfs. BTW, if you look at pom.xml, you have to choose yarn profile
> to compile spark, so that it won't include protobuf 2.4.1 in your final
> jars. Here is the command line we use to compile spark with hadoop 2.2:
>
> mvn -U -Dyarn.version=2.2.0 -Dhadoop.version=2.2.0 -Pyarn  -DskipTests
> package
>
> Thanks
> -Shengzhe
>
>
> On Wed, Mar 26, 2014 at 12:04 AM, qingyang li  >wrote:
>
> > Egor, i encounter the same problem which you have asked in this thread:
> >
> >
> >
> http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3CCAMrx5DwJVJS0g_FE7_2qwMu4Xf0y5VfV=tlyauv2kh5v4k6...@mail.gmail.com%3E
> >
> > have you fixed this problem?
> >
> > i am using shark to read a table which i have created on hdfs.
> >
> > i found in shark lib_managed directory there are two protobuf*.jar:
> > [root@bigdata001 shark-0.9.0]# find . -name "proto*.jar"
> >
> >
> ./lib_managed/jars/org.spark-project.protobuf/protobuf-java/protobuf-java-2.4.1-shaded.jar
> >
> >
> ./lib_managed/bundles/com.google.protobuf/protobuf-java/protobuf-java-2.5.0.jar
> >
> >
> > my hadoop is using protobuf-java-2.5.0.jar .
> >
>


Re: Error executing sql using shark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-26 Thread qingyang li
my spark can also work well with hadoop2.2.0
my shark can not work well with hadoop2.2.0 because protobuf version
problem.

in shark direcotry , i found two vesions of protobuf,  and they all are
loaded into classpath.

[root@bigdata001 shark-0.9.0]# find . -name "proto*.jar"

./lib_managed/jars/org.spark-project.protobuf/protobuf-java/protobuf-java-2.4.1-shaded.jar
./lib_managed/bundles/com.google.protobuf/protobuf-java/protobuf-java-2.5.0.jar




2014-03-27 2:26 GMT+08:00 yao :

> @qingyang, spark 0.9.0 works for me perfectly when accessing (read/write)
> data on hdfs. BTW, if you look at pom.xml, you have to choose yarn profile
> to compile spark, so that it won't include protobuf 2.4.1 in your final
> jars. Here is the command line we use to compile spark with hadoop 2.2:
>
> mvn -U -Dyarn.version=2.2.0 -Dhadoop.version=2.2.0 -Pyarn  -DskipTests
> package
>
> Thanks
> -Shengzhe
>
>
> On Wed, Mar 26, 2014 at 12:04 AM, qingyang li  >wrote:
>
> > Egor, i encounter the same problem which you have asked in this thread:
> >
> >
> >
> http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3CCAMrx5DwJVJS0g_FE7_2qwMu4Xf0y5VfV=tlyauv2kh5v4k6...@mail.gmail.com%3E
> >
> > have you fixed this problem?
> >
> > i am using shark to read a table which i have created on hdfs.
> >
> > i found in shark lib_managed directory there are two protobuf*.jar:
> > [root@bigdata001 shark-0.9.0]# find . -name "proto*.jar"
> >
> >
> ./lib_managed/jars/org.spark-project.protobuf/protobuf-java/protobuf-java-2.4.1-shaded.jar
> >
> >
> ./lib_managed/bundles/com.google.protobuf/protobuf-java/protobuf-java-2.5.0.jar
> >
> >
> > my hadoop is using protobuf-java-2.5.0.jar .
> >
>