Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1
Egor, i encounter the same problem which you have asked in this thread: http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3CCAMrx5DwJVJS0g_FE7_2qwMu4Xf0y5VfV=tlyauv2kh5v4k6...@mail.gmail.com%3E have you fixed this problem? i am using shark to read a table which i have created on hdfs. i found in shark lib_managed directory there are two protobuf*.jar: [root@bigdata001 shark-0.9.0]# find . -name proto*.jar ./lib_managed/jars/org.spark-project.protobuf/protobuf-java/protobuf-java-2.4.1-shaded.jar ./lib_managed/bundles/com.google.protobuf/protobuf-java/protobuf-java-2.5.0.jar my hadoop is using protobuf-java-2.5.0.jar .
Re: ALS memory limits
Much of this sounds related to the memory issue mentioned earlier in this thread. Are you using a build that has fixed that? That would be by far most important here. If the raw memory requirement is 8GB, the actual heap size necessary could be a lot larger -- object overhead, all the other stuff in memory, overheads within the heap allocation, etc. So I would expect total memory requirement to be significantly more than 9GB. Still, this is the *total* requirement across the cluster. Each worker is just loading part of the matrix. If you have 10 workers I would imagine it roughly chops the per-worker memory requirement by 10x. This in turn depends on also letting workers use more than their default amount of memory. May need to increase executor memory here. Separately, I have observed issues with too many files open and lots of /tmp files. You may have to use ulimit to increase the number of open files allowed. On Wed, Mar 26, 2014 at 6:06 AM, Debasish Das debasish.da...@gmail.comwrote: Hi, For our usecases we are looking into 20 x 1M matrices which comes in the similar ranges as outlined by the paper over here: http://sandeeptata.blogspot.com/2012/12/sparkler-large-scale-matrix.html Is the exponential runtime growth in spark ALS as outlined by the blog still exists in recommendation.ALS ? I am running a spark cluster of 10 nodes with total memory of around 1 TB with 80 cores With rank = 50, the memory requirements for ALS should be 20Mx50 doubles on every worker which is around 8 GB Even if both the factor matrices are cached in memory I should be bounded by ~ 9 GB but even with 32 GB per worker I see GC errors... I am debugging the scalability and memory requirements of the algorithm further but any insights will be very helpful... Also there are two other issues: 1. If GC errors are hit, that worker JVM goes down and I have to restart it manually. Is this expected ? 2. When I try to make use of all 80 cores on the cluster I get some issues related to java.io.File not found exception on /tmp/ ? Is there some OS limit that how many cores can simultaneously access /tmp from a process ? Thanks. Deb
Re: ALS memory limits
Thanks Sean. Looking into executor memory options now... I am at incubator_spark head. Does that has all the fixes or I need spark head ? I can deploy the spark head as well... I am not running implicit feedback yet...I remember memory enhancements were mainly for implicit right ? For ulimit let me look into centos settingsI am curious how map-reduce resolves itby using 1 core from 1 process ? I am running 2 tb yarn jobs as well for etl, pre processing etchave not seen the too many files opened yet when there is gc error, the worker diesthat's mystery as well...any insights from spark core team ? Yarn container gets killed if gc boundaries are about to hitsimilar ideas can be used here as well ? Also which tool do we use for memory debugging in spark ? On Mar 26, 2014 1:45 AM, Sean Owen so...@cloudera.com wrote: Much of this sounds related to the memory issue mentioned earlier in this thread. Are you using a build that has fixed that? That would be by far most important here. If the raw memory requirement is 8GB, the actual heap size necessary could be a lot larger -- object overhead, all the other stuff in memory, overheads within the heap allocation, etc. So I would expect total memory requirement to be significantly more than 9GB. Still, this is the *total* requirement across the cluster. Each worker is just loading part of the matrix. If you have 10 workers I would imagine it roughly chops the per-worker memory requirement by 10x. This in turn depends on also letting workers use more than their default amount of memory. May need to increase executor memory here. Separately, I have observed issues with too many files open and lots of /tmp files. You may have to use ulimit to increase the number of open files allowed. On Wed, Mar 26, 2014 at 6:06 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, For our usecases we are looking into 20 x 1M matrices which comes in the similar ranges as outlined by the paper over here: http://sandeeptata.blogspot.com/2012/12/sparkler-large-scale-matrix.html Is the exponential runtime growth in spark ALS as outlined by the blog still exists in recommendation.ALS ? I am running a spark cluster of 10 nodes with total memory of around 1 TB with 80 cores With rank = 50, the memory requirements for ALS should be 20Mx50 doubles on every worker which is around 8 GB Even if both the factor matrices are cached in memory I should be bounded by ~ 9 GB but even with 32 GB per worker I see GC errors... I am debugging the scalability and memory requirements of the algorithm further but any insights will be very helpful... Also there are two other issues: 1. If GC errors are hit, that worker JVM goes down and I have to restart it manually. Is this expected ? 2. When I try to make use of all 80 cores on the cluster I get some issues related to java.io.File not found exception on /tmp/ ? Is there some OS limit that how many cores can simultaneously access /tmp from a process ? Thanks. Deb
Re: Spark 0.9.1 release
Hey TD, This one we just merged into master this morning: https://spark-project.atlassian.net/browse/SPARK-1322 It should definitely go into the 0.9 branch because there was a bug in the semantics of top() which at this point is unreleased in Python. I didn't backport it yet because I figured you might want to do this at a specific time. So please go ahead and backport it. Not sure whether this warrants another RC. - Patrick On Tue, Mar 25, 2014 at 10:47 PM, Mridul Muralidharan mri...@gmail.comwrote: On Wed, Mar 26, 2014 at 10:53 AM, Tathagata Das tathagata.das1...@gmail.com wrote: PR 159 seems like a fairly big patch to me. And quite recent, so its impact on the scheduling is not clear. It may also depend on other changes that may have gotten into the DAGScheduler but not pulled into branch 0.9. I am not sure it is a good idea to pull that in. We can pull those changes later for 0.9.2 if required. There is no impact on scheduling : it only has an impact on error handling - it ensures that you can actually use spark on yarn in multi-tennent clusters more reliably. Currently, any reasonably long running job (30 mins+) working on non trivial dataset will fail due to accumulated failures in spark. Regards, Mridul TD On Tue, Mar 25, 2014 at 8:44 PM, Mridul Muralidharan mri...@gmail.com wrote: Forgot to mention this in the earlier request for PR's. If there is another RC being cut, please add https://github.com/apache/spark/pull/159 to it too (if not done already !). Thanks, Mridul On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Hello everyone, Since the release of Spark 0.9, we have received a number of important bug fixes and we would like to make a bug-fix release of Spark 0.9.1. We are going to cut a release candidate soon and we would love it if people test it out. We have backported several bug fixes into the 0.9 and updated JIRA accordingly https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed) . Please let me know if there are fixes that were not backported but you would like to see them in 0.9.1. Thanks! TD
Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1
@qingyang, spark 0.9.0 works for me perfectly when accessing (read/write) data on hdfs. BTW, if you look at pom.xml, you have to choose yarn profile to compile spark, so that it won't include protobuf 2.4.1 in your final jars. Here is the command line we use to compile spark with hadoop 2.2: mvn -U -Dyarn.version=2.2.0 -Dhadoop.version=2.2.0 -Pyarn -DskipTests package Thanks -Shengzhe On Wed, Mar 26, 2014 at 12:04 AM, qingyang li liqingyang1...@gmail.comwrote: Egor, i encounter the same problem which you have asked in this thread: http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3CCAMrx5DwJVJS0g_FE7_2qwMu4Xf0y5VfV=tlyauv2kh5v4k6...@mail.gmail.com%3E have you fixed this problem? i am using shark to read a table which i have created on hdfs. i found in shark lib_managed directory there are two protobuf*.jar: [root@bigdata001 shark-0.9.0]# find . -name proto*.jar ./lib_managed/jars/org.spark-project.protobuf/protobuf-java/protobuf-java-2.4.1-shaded.jar ./lib_managed/bundles/com.google.protobuf/protobuf-java/protobuf-java-2.5.0.jar my hadoop is using protobuf-java-2.5.0.jar .
Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark
Cool. It sounds like focusing on sbt-pom-reader would be a good thing for you guys then. There's a few... fun... issues around maven parent projects that are still running around with sbt-pom-reader that appear to be fundamental ivy-maven hate-based issues. IN any case, while I'm generally swamped with things, let me know anything I can do to improve the situation. The shading thing frustrates me slightly, as I thinka lot of projects that shade have been abandoned (that I looked into). Which plugin are you using? On Fri, Mar 14, 2014 at 10:21 PM, Matei Zaharia matei.zaha...@gmail.comwrote: I like the pom-reader approach as well -- in particular, that it lets you add extra stuff in your SBT build after loading the dependencies from the POM. Profiles would be the one thing missing to be able to pass options through. Matei On Mar 14, 2014, at 10:03 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Josh, I spent some time playing with the sbt-pom-reader and actually I think it might suite our needs pretty well. Basically our situation is that Maven has more mature support for packaging (shading, official assembly plug-in) and is used by pretty much every downstream packager of Spark. SBT is used by the developers heavily for incremental compilation, the console, etc. So far we've been maintaining entirely isolated builds in sbt and maven which is a huge development cost and something that's lead to divergence over time. The sbt-pom-reader I think could give us the best of both worlds. But it doesn't support profiles in Maven which we use pretty heavily. I played a bit this week and tried to get it to work without much luck: https://github.com/pwendell/sbt-pom-reader/commit/ca1f1f2d6bf8891acb7212facf4807baaca8974d Any pointers or interest in helping us with that feature? I think with that in place we might have a good solution where the dependencies are declared in one place (poms) and we use Maven for packaging but we can use sbt for day to day development. On Fri, Mar 14, 2014 at 9:41 AM, Evan Chan e...@ooyala.com wrote: Jsuereth - Thanks for jumping on this thread. There are a couple things which would help SBT IMHO (w.r.t. issues mentioned here): - A way of generating a project-wide pom.xml (I believe make-pom only generates it for each sub-project) - and probably better or more feature-complete POMs, but the Maven folks here can speak to that - Make the sbt-pom-reader plugin officially part of SBT, and make it work well (again the Maven folks need to jump in here, though plugins can't be directly translated) - Have the sbt-assembly plugin officially maintained by Typesafe, or part of SBT most Maven folks expect not to have to include a plugin to generate a fat jar, and it's a pretty essential plugin for just about every SBT project - Also there is no equivalent (AFAIK) to the maven shader plugin. I also wish that the dependency-graph plugin was included by default, but that's just me :) -Evan On Fri, Mar 14, 2014 at 6:47 AM, jsuereth joshua.suer...@gmail.com wrote: Hey guys - If there's anything we can do to improve the sbt experience, let me know. I'd be extremely interested to see how/where there are issues integrating sbt with the existing Hadoop ecosystem. Particularly the difficulties in using Sbt + Maven together (something which tends block more than just spark from adopting sbt). I'm more than happy to listen and see what we can do on the sbt side to make this as seamless as possible for all parties. Thanks! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-tp2315p5682.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: Error executing sql using shark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1
my spark can also work well with hadoop2.2.0 my shark can not work well with hadoop2.2.0 because protobuf version problem. in shark direcotry , i found two vesions of protobuf, and they all are loaded into classpath. [root@bigdata001 shark-0.9.0]# find . -name proto*.jar ./lib_managed/jars/org.spark-project.protobuf/protobuf-java/protobuf-java-2.4.1-shaded.jar ./lib_managed/bundles/com.google.protobuf/protobuf-java/protobuf-java-2.5.0.jar 2014-03-27 2:26 GMT+08:00 yao yaosheng...@gmail.com: @qingyang, spark 0.9.0 works for me perfectly when accessing (read/write) data on hdfs. BTW, if you look at pom.xml, you have to choose yarn profile to compile spark, so that it won't include protobuf 2.4.1 in your final jars. Here is the command line we use to compile spark with hadoop 2.2: mvn -U -Dyarn.version=2.2.0 -Dhadoop.version=2.2.0 -Pyarn -DskipTests package Thanks -Shengzhe On Wed, Mar 26, 2014 at 12:04 AM, qingyang li liqingyang1...@gmail.com wrote: Egor, i encounter the same problem which you have asked in this thread: http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3CCAMrx5DwJVJS0g_FE7_2qwMu4Xf0y5VfV=tlyauv2kh5v4k6...@mail.gmail.com%3E have you fixed this problem? i am using shark to read a table which i have created on hdfs. i found in shark lib_managed directory there are two protobuf*.jar: [root@bigdata001 shark-0.9.0]# find . -name proto*.jar ./lib_managed/jars/org.spark-project.protobuf/protobuf-java/protobuf-java-2.4.1-shaded.jar ./lib_managed/bundles/com.google.protobuf/protobuf-java/protobuf-java-2.5.0.jar my hadoop is using protobuf-java-2.5.0.jar .