Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-26 Thread qingyang li
Egor, i encounter the same problem which you have asked in this thread:

http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3CCAMrx5DwJVJS0g_FE7_2qwMu4Xf0y5VfV=tlyauv2kh5v4k6...@mail.gmail.com%3E

have you fixed this problem?

i am using shark to read a table which i have created on hdfs.

i found in shark lib_managed directory there are two protobuf*.jar:
[root@bigdata001 shark-0.9.0]# find . -name proto*.jar
./lib_managed/jars/org.spark-project.protobuf/protobuf-java/protobuf-java-2.4.1-shaded.jar
./lib_managed/bundles/com.google.protobuf/protobuf-java/protobuf-java-2.5.0.jar


my hadoop is using protobuf-java-2.5.0.jar .


Re: ALS memory limits

2014-03-26 Thread Sean Owen
Much of this sounds related to the memory issue mentioned earlier in this
thread. Are you using a build that has fixed that? That would be by far
most important here.

If the raw memory requirement is 8GB, the actual heap size necessary could
be a lot larger -- object overhead, all the other stuff in memory,
overheads within the heap allocation, etc. So I would expect total memory
requirement to be significantly more than 9GB.

Still, this is the *total* requirement across the cluster. Each worker is
just loading part of the matrix. If you have 10 workers I would imagine it
roughly chops the per-worker memory requirement by 10x.

This in turn depends on also letting workers use more than their default
amount of memory. May need to increase executor memory here.

Separately, I have observed issues with too many files open and lots of
/tmp files. You may have to use ulimit to increase the number of open files
allowed.

On Wed, Mar 26, 2014 at 6:06 AM, Debasish Das debasish.da...@gmail.comwrote:

 Hi,

 For our usecases we are looking into 20 x 1M matrices which comes in the
 similar ranges as outlined by the paper over here:

 http://sandeeptata.blogspot.com/2012/12/sparkler-large-scale-matrix.html

 Is the exponential runtime growth in spark ALS as outlined by the blog
 still exists in recommendation.ALS ?

 I am running a spark cluster of 10 nodes with total memory of around 1 TB
 with 80 cores

 With rank = 50, the memory requirements for ALS should be 20Mx50 doubles on
 every worker which is around 8 GB

 Even if both the factor matrices are cached in memory I should be bounded
 by ~ 9 GB but even with 32 GB per worker I see GC errors...

 I am debugging the scalability and memory requirements of the algorithm
 further but any insights will be very helpful...

 Also there are two other issues:

 1. If GC errors are hit, that worker JVM goes down and I have to restart it
 manually. Is this expected ?

 2. When I try to make use of all 80 cores on the cluster I get some issues
 related to java.io.File not found exception on /tmp/ ? Is there some OS
 limit that how many cores can simultaneously access /tmp from a process ?

 Thanks.
 Deb




Re: ALS memory limits

2014-03-26 Thread Debasish Das
Thanks Sean. Looking into executor memory options now...

I am at incubator_spark head. Does that has all the fixes or I need spark
head ? I can deploy the spark head as well...

I am not running implicit feedback yet...I remember memory enhancements
were mainly for implicit right ?

For ulimit let me look into centos settingsI am curious how map-reduce
resolves itby using 1 core from 1 process ? I am running 2 tb yarn jobs
as well for etl, pre processing etchave not seen the too many files
opened yet

when there is gc error, the worker diesthat's mystery as well...any
insights from spark core team ? Yarn container gets killed if gc boundaries
are about to hitsimilar ideas can be used here as well ? Also which
tool do we use for memory debugging in spark ?
 On Mar 26, 2014 1:45 AM, Sean Owen so...@cloudera.com wrote:

 Much of this sounds related to the memory issue mentioned earlier in this
 thread. Are you using a build that has fixed that? That would be by far
 most important here.

 If the raw memory requirement is 8GB, the actual heap size necessary could
 be a lot larger -- object overhead, all the other stuff in memory,
 overheads within the heap allocation, etc. So I would expect total memory
 requirement to be significantly more than 9GB.

 Still, this is the *total* requirement across the cluster. Each worker is
 just loading part of the matrix. If you have 10 workers I would imagine it
 roughly chops the per-worker memory requirement by 10x.

 This in turn depends on also letting workers use more than their default
 amount of memory. May need to increase executor memory here.

 Separately, I have observed issues with too many files open and lots of
 /tmp files. You may have to use ulimit to increase the number of open files
 allowed.

 On Wed, Mar 26, 2014 at 6:06 AM, Debasish Das debasish.da...@gmail.com
 wrote:

  Hi,
 
  For our usecases we are looking into 20 x 1M matrices which comes in the
  similar ranges as outlined by the paper over here:
 
  http://sandeeptata.blogspot.com/2012/12/sparkler-large-scale-matrix.html
 
  Is the exponential runtime growth in spark ALS as outlined by the blog
  still exists in recommendation.ALS ?
 
  I am running a spark cluster of 10 nodes with total memory of around 1 TB
  with 80 cores
 
  With rank = 50, the memory requirements for ALS should be 20Mx50 doubles
 on
  every worker which is around 8 GB
 
  Even if both the factor matrices are cached in memory I should be bounded
  by ~ 9 GB but even with 32 GB per worker I see GC errors...
 
  I am debugging the scalability and memory requirements of the algorithm
  further but any insights will be very helpful...
 
  Also there are two other issues:
 
  1. If GC errors are hit, that worker JVM goes down and I have to restart
 it
  manually. Is this expected ?
 
  2. When I try to make use of all 80 cores on the cluster I get some
 issues
  related to java.io.File not found exception on /tmp/ ? Is there some OS
  limit that how many cores can simultaneously access /tmp from a process ?
 
  Thanks.
  Deb
 
 



Re: Spark 0.9.1 release

2014-03-26 Thread Patrick Wendell
Hey TD,

This one we just merged into master this morning:
https://spark-project.atlassian.net/browse/SPARK-1322

It should definitely go into the 0.9 branch because there was a bug in the
semantics of top() which at this point is unreleased in Python.

I didn't backport it yet because I figured you might want to do this at a
specific time. So please go ahead and backport it. Not sure whether this
warrants another RC.

- Patrick


On Tue, Mar 25, 2014 at 10:47 PM, Mridul Muralidharan mri...@gmail.comwrote:

 On Wed, Mar 26, 2014 at 10:53 AM, Tathagata Das
 tathagata.das1...@gmail.com wrote:
  PR 159 seems like a fairly big patch to me. And quite recent, so its
 impact
  on the scheduling is not clear. It may also depend on other changes that
  may have gotten into the DAGScheduler but not pulled into branch 0.9. I
 am
  not sure it is a good idea to pull that in. We can pull those changes
 later
  for 0.9.2 if required.


 There is no impact on scheduling : it only has an impact on error
 handling - it ensures that you can actually use spark on yarn in
 multi-tennent clusters more reliably.
 Currently, any reasonably long running job (30 mins+) working on non
 trivial dataset will fail due to accumulated failures in spark.


 Regards,
 Mridul


 
  TD
 
 
 
 
  On Tue, Mar 25, 2014 at 8:44 PM, Mridul Muralidharan mri...@gmail.com
 wrote:
 
  Forgot to mention this in the earlier request for PR's.
  If there is another RC being cut, please add
  https://github.com/apache/spark/pull/159 to it too (if not done
  already !).
 
  Thanks,
  Mridul
 
  On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das
  tathagata.das1...@gmail.com wrote:
Hello everyone,
  
   Since the release of Spark 0.9, we have received a number of important
  bug
   fixes and we would like to make a bug-fix release of Spark 0.9.1. We
 are
   going to cut a release candidate soon and we would love it if people
 test
   it out. We have backported several bug fixes into the 0.9 and updated
  JIRA
   accordingly
 
 https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)
  .
   Please let me know if there are fixes that were not backported but you
   would like to see them in 0.9.1.
  
   Thanks!
  
   TD
 



Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-26 Thread yao
@qingyang, spark 0.9.0 works for me perfectly when accessing (read/write)
data on hdfs. BTW, if you look at pom.xml, you have to choose yarn profile
to compile spark, so that it won't include protobuf 2.4.1 in your final
jars. Here is the command line we use to compile spark with hadoop 2.2:

mvn -U -Dyarn.version=2.2.0 -Dhadoop.version=2.2.0 -Pyarn  -DskipTests
package

Thanks
-Shengzhe


On Wed, Mar 26, 2014 at 12:04 AM, qingyang li liqingyang1...@gmail.comwrote:

 Egor, i encounter the same problem which you have asked in this thread:


 http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3CCAMrx5DwJVJS0g_FE7_2qwMu4Xf0y5VfV=tlyauv2kh5v4k6...@mail.gmail.com%3E

 have you fixed this problem?

 i am using shark to read a table which i have created on hdfs.

 i found in shark lib_managed directory there are two protobuf*.jar:
 [root@bigdata001 shark-0.9.0]# find . -name proto*.jar

 ./lib_managed/jars/org.spark-project.protobuf/protobuf-java/protobuf-java-2.4.1-shaded.jar

 ./lib_managed/bundles/com.google.protobuf/protobuf-java/protobuf-java-2.5.0.jar


 my hadoop is using protobuf-java-2.5.0.jar .



Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-03-26 Thread Josh Suereth
Cool. It sounds like focusing on sbt-pom-reader would be a good thing for
you guys then.

There's a few... fun... issues around maven parent projects that are still
running around with sbt-pom-reader that appear to be fundamental ivy-maven
hate-based issues.

IN any case, while I'm generally swamped with things, let me know anything
I can do to improve the situation.  The shading thing frustrates me
slightly, as I thinka lot of projects that shade have been abandoned (that
I looked into).  Which plugin are you using?


On Fri, Mar 14, 2014 at 10:21 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

 I like the pom-reader approach as well -- in particular, that it lets you
 add extra stuff in your SBT build after loading the dependencies from the
 POM. Profiles would be the one thing missing to be able to pass options
 through.

 Matei

 On Mar 14, 2014, at 10:03 AM, Patrick Wendell pwend...@gmail.com wrote:

  Hey Josh,
 
  I spent some time playing with the sbt-pom-reader and actually I think
  it might suite our needs pretty well. Basically our situation is that
  Maven has more mature support for packaging (shading, official
  assembly plug-in) and is used by pretty much every downstream packager
  of Spark. SBT is used by the developers heavily for incremental
  compilation, the console, etc. So far we've been maintaining entirely
  isolated builds in sbt and maven which is a huge development cost and
  something that's lead to divergence over time.
 
  The sbt-pom-reader I think could give us the best of both worlds. But
  it doesn't support profiles in Maven which we use pretty heavily. I
  played a bit this week and tried to get it to work without much luck:
 
 https://github.com/pwendell/sbt-pom-reader/commit/ca1f1f2d6bf8891acb7212facf4807baaca8974d
 
  Any pointers or interest in helping us with that feature? I think with
  that in place we might have a good solution where the dependencies are
  declared in one place (poms) and we use Maven for packaging but we can
  use sbt for day to day development.
 
 
 
  On Fri, Mar 14, 2014 at 9:41 AM, Evan Chan e...@ooyala.com wrote:
  Jsuereth -
 
  Thanks for jumping on this thread.  There are a couple things which
  would help SBT IMHO (w.r.t. issues mentioned here):
 
  - A way of generating a project-wide pom.xml  (I believe make-pom only
  generates it for each sub-project)
- and probably better or more feature-complete POMs, but the Maven
  folks here can speak to that
  - Make the sbt-pom-reader plugin officially part of SBT, and make it
  work well (again the Maven folks need to jump in here, though plugins
  can't be directly translated)
  - Have the sbt-assembly plugin officially maintained by Typesafe, or
  part of SBT  most Maven folks expect not to have to include a
  plugin to generate a fat jar, and it's a pretty essential plugin for
  just about every SBT project
  - Also there is no equivalent (AFAIK) to the maven shader plugin.
 
  I also wish that the dependency-graph plugin was included by default,
  but that's just me  :)
 
  -Evan
 
 
  On Fri, Mar 14, 2014 at 6:47 AM, jsuereth joshua.suer...@gmail.com
 wrote:
  Hey guys -
 
  If there's anything we can do to improve the sbt experience, let me
 know.
  I'd be extremely interested to see how/where there are issues
 integrating
  sbt with the existing Hadoop ecosystem.
 
  Particularly the difficulties in using Sbt + Maven together (something
 which
  tends block more than just spark from adopting sbt).
 
  I'm more than happy to listen and see what we can do on the sbt side
 to make
  this as seamless as possible for all parties.
 
  Thanks!
 
 
 
  --
  View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-tp2315p5682.html
  Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.
 
 
 
  --
  --
  Evan Chan
  Staff Engineer
  e...@ooyala.com  |




Re: Error executing sql using shark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-26 Thread qingyang li
my spark can also work well with hadoop2.2.0
my shark can not work well with hadoop2.2.0 because protobuf version
problem.

in shark direcotry , i found two vesions of protobuf,  and they all are
loaded into classpath.

[root@bigdata001 shark-0.9.0]# find . -name proto*.jar

./lib_managed/jars/org.spark-project.protobuf/protobuf-java/protobuf-java-2.4.1-shaded.jar
./lib_managed/bundles/com.google.protobuf/protobuf-java/protobuf-java-2.5.0.jar




2014-03-27 2:26 GMT+08:00 yao yaosheng...@gmail.com:

 @qingyang, spark 0.9.0 works for me perfectly when accessing (read/write)
 data on hdfs. BTW, if you look at pom.xml, you have to choose yarn profile
 to compile spark, so that it won't include protobuf 2.4.1 in your final
 jars. Here is the command line we use to compile spark with hadoop 2.2:

 mvn -U -Dyarn.version=2.2.0 -Dhadoop.version=2.2.0 -Pyarn  -DskipTests
 package

 Thanks
 -Shengzhe


 On Wed, Mar 26, 2014 at 12:04 AM, qingyang li liqingyang1...@gmail.com
 wrote:

  Egor, i encounter the same problem which you have asked in this thread:
 
 
 
 http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3CCAMrx5DwJVJS0g_FE7_2qwMu4Xf0y5VfV=tlyauv2kh5v4k6...@mail.gmail.com%3E
 
  have you fixed this problem?
 
  i am using shark to read a table which i have created on hdfs.
 
  i found in shark lib_managed directory there are two protobuf*.jar:
  [root@bigdata001 shark-0.9.0]# find . -name proto*.jar
 
 
 ./lib_managed/jars/org.spark-project.protobuf/protobuf-java/protobuf-java-2.4.1-shaded.jar
 
 
 ./lib_managed/bundles/com.google.protobuf/protobuf-java/protobuf-java-2.5.0.jar
 
 
  my hadoop is using protobuf-java-2.5.0.jar .