Memory config issues
All, I'm getting out of memory exceptions in SparkSQL GROUP BY queries. I have plenty of RAM, so I should be able to brute-force my way through, but I can't quite figure out what memory option affects what process. My current memory configuration is the following: export SPARK_WORKER_MEMORY=83971m export SPARK_DAEMON_MEMORY=15744m What does each of these config options do exactly? Also, how come the executors page of the web UI shows no memory usage: 0.0 B / 42.4 GB And where does 42.4 GB come from? Alex
GraphX doc: triangleCount() requirement overstatement?
According to: https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#triangle-counting Note that TriangleCount requires the edges to be in canonical orientation (srcId dstId) But isn't this overstating the requirement? Isn't the requirement really that IF there are duplicate edges between two vertices, THEN those edges must all be in the same direction (in order for the groupEdges() at the beginning of triangleCount() to produce the intermediate results that triangleCount() expects)? If so, should I enter a JIRA ticket to clarify the documentation? Or is it the case that https://issues.apache.org/jira/browse/SPARK-3650 will make it into Spark 1.3 anyway? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Memory config issues
Akhil, Ah, very good point. I guess SET spark.sql.shuffle.partitions=1024 should do it. Alex On Sun, Jan 18, 2015 at 10:29 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Its the executor memory (spark.executor.memory) which you can set while creating the spark context. By default it uses 0.6% of the executor memory for Storage. Now, to show some memory usage, you need to cache (persist) the RDD. Regarding the OOM Exception, you can increase the level of parallelism (also you can increase the number of partitions depending on your data size) and it should be fine. Thanks Best Regards On Mon, Jan 19, 2015 at 11:36 AM, Alessandro Baretta alexbare...@gmail.com wrote: All, I'm getting out of memory exceptions in SparkSQL GROUP BY queries. I have plenty of RAM, so I should be able to brute-force my way through, but I can't quite figure out what memory option affects what process. My current memory configuration is the following: export SPARK_WORKER_MEMORY=83971m export SPARK_DAEMON_MEMORY=15744m What does each of these config options do exactly? Also, how come the executors page of the web UI shows no memory usage: 0.0 B / 42.4 GB And where does 42.4 GB come from? Alex
Re: RDD order guarantees
Hi Ewan, Not sure if there is a JIRA ticket (there are too many that I lose track). I chatted briefly with Aaron on this. The way we can solve it is to create a new FileSystem implementation that overrides the listStatus method, and then in Hadoop Conf set the fs.file.impl to that. Shouldn't be too hard. Would you be interested in working on it? On Fri, Jan 16, 2015 at 3:36 PM, Ewan Higgs ewan.hi...@ugent.be wrote: Yes, I am running on a local file system. Is there a bug open for this? Mingyu Kim reported the problem last April: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html -Ewan On 01/16/2015 07:41 PM, Reynold Xin wrote: You are running on a local file system right? HDFS orders the file based on names, but local file system often don't. I think that's why the difference. We might be able to do a sort and order the partitions when we create a RDD to make this universal though. On Fri, Jan 16, 2015 at 8:26 AM, Ewan Higgs ewan.hi...@ugent.be wrote: Hi all, Quick one: when reading files, are the orders of partitions guaranteed to be preserved? I am finding some weird behaviour where I run sortByKeys() on an RDD (which has 16 byte keys) and write it to disk. If I open a python shell and run the following: for part in range(29): print map(ord, open('/home/ehiggs/data/terasort_out/part-r-000{0:02}'.format(part), 'r').read(16)) Then each partition is in order based on the first value of each partition. I can also call TeraValidate.validate from TeraSort and it is happy with the results. It seems to be on loading the file that the reordering happens. If this is expected, is there a way to ask Spark nicely to give me the RDD in the order it was saved? This is based on trying to fix my TeraValidate code on this branch: https://github.com/ehiggs/spark/tree/terasort Thanks, Ewan - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: GraphX doc: triangleCount() requirement overstatement?
We will merge https://issues.apache.org/jira/browse/SPARK-3650 for 1.3. Thanks for reminding! On Sun, Jan 18, 2015 at 8:34 PM, Michael Malak michaelma...@yahoo.com.invalid wrote: According to: https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#triangle-counting Note that TriangleCount requires the edges to be in canonical orientation (srcId dstId) But isn't this overstating the requirement? Isn't the requirement really that IF there are duplicate edges between two vertices, THEN those edges must all be in the same direction (in order for the groupEdges() at the beginning of triangleCount() to produce the intermediate results that triangleCount() expects)? If so, should I enter a JIRA ticket to clarify the documentation? Or is it the case that https://issues.apache.org/jira/browse/SPARK-3650 will make it into Spark 1.3 anyway? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Memory config issues
Its the executor memory (spark.executor.memory) which you can set while creating the spark context. By default it uses 0.6% of the executor memory for Storage. Now, to show some memory usage, you need to cache (persist) the RDD. Regarding the OOM Exception, you can increase the level of parallelism (also you can increase the number of partitions depending on your data size) and it should be fine. Thanks Best Regards On Mon, Jan 19, 2015 at 11:36 AM, Alessandro Baretta alexbare...@gmail.com wrote: All, I'm getting out of memory exceptions in SparkSQL GROUP BY queries. I have plenty of RAM, so I should be able to brute-force my way through, but I can't quite figure out what memory option affects what process. My current memory configuration is the following: export SPARK_WORKER_MEMORY=83971m export SPARK_DAEMON_MEMORY=15744m What does each of these config options do exactly? Also, how come the executors page of the web UI shows no memory usage: 0.0 B / 42.4 GB And where does 42.4 GB come from? Alex
Re: run time exceptions in Spark 1.2.0 manual build together with OpenStack hadoop driver
Please tale a look at SPARK-4048 and SPARK-5108 Cheers On Sat, Jan 17, 2015 at 10:26 PM, Gil Vernik g...@il.ibm.com wrote: Hi, I took a source code of Spark 1.2.0 and tried to build it together with hadoop-openstack.jar ( To allow Spark an access to OpenStack Swift ) I used Hadoop 2.6.0. The build was fine without problems, however in run time, while trying to access swift:// name space i got an exception: java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass at org.codehaus.jackson.map.introspect.JacksonAnnotationIntrospector.findDeserializationType(JacksonAnnotationIntrospector.java:524) at org.codehaus.jackson.map.deser.BasicDeserializerFactory.modifyTypeByAnnotation(BasicDeserializerFactory.java:732) ...and the long stack trace goes here Digging into the problem i saw the following: Jackson versions 1.9.X are not backward compatible, in particular they removed JsonClass annotation. Hadoop 2.6.0 uses jackson-asl version 1.9.13, while Spark has reference to older version of jackson. This is the main pom.xml of Spark 1.2.0 : dependency !-- Matches the version of jackson-core-asl pulled in by avro -- groupIdorg.codehaus.jackson/groupId artifactIdjackson-mapper-asl/artifactId version1.8.8/version /dependency Referencing 1.8.8 version, which is not compatible with Hadoop 2.6.0 . If we change version to 1.9.13, than all will work fine and there will be no run time exceptions while accessing Swift. The following change will solve the problem: dependency !-- Matches the version of jackson-core-asl pulled in by avro -- groupIdorg.codehaus.jackson/groupId artifactIdjackson-mapper-asl/artifactId version1.9.13/version /dependency I am trying to resolve this somehow so people will not get into this issue. Is there any particular need in Spark for jackson 1.8.8 and not 1.9.13? Can we remove 1.8.8 and put 1.9.13 for Avro? It looks to me that all works fine when Spark build with jackson 1.9.13, but i am not an expert and not sure what should be tested. Thanks, Gil Vernik.
Re: Semantics of LGTM
Maybe just to avoid LGTM as a single token when it is not actually according to Patrick's definition, but anybody can still leave comments like: The direction of the PR looks good to me. or +1 on the direction The build part looks good to me ... On Sat, Jan 17, 2015 at 8:49 PM, Kay Ousterhout k...@eecs.berkeley.edu wrote: +1 to Patrick's proposal of strong LGTM semantics. On past projects, I've heard the semantics of LGTM expressed as I've looked at this thoroughly and take as much ownership as if I wrote the patch myself. My understanding is that this is the level of review we expect for all patches that ultimately go into Spark, so it's important to have a way to concisely describe when this has been done. Aaron / Sandy, when have you found the weaker LGTM to be useful? In the cases I've seen, if someone else says I looked at this very quickly and didn't see any glaring problems, it doesn't add any value for subsequent reviewers (someone still needs to take a thorough look). -Kay On Sat, Jan 17, 2015 at 8:04 PM, sandy.r...@cloudera.com wrote: Yeah, the ASF +1 has become partly overloaded to mean both I would like to see this feature and this patch should be committed, although, at least in Hadoop, using +1 on JIRA (as opposed to, say, in a release vote) should unambiguously mean the latter unless qualified in some other way. I don't have any opinion on the specific characters, but I agree with Aaron that it would be nice to have some sort of abbreviation for both the strong and weak forms of approval. -Sandy On Jan 17, 2015, at 7:25 PM, Patrick Wendell pwend...@gmail.com wrote: I think the ASF +1 is *slightly* different than Google's LGTM, because it might convey wanting the patch/feature to be merged but not necessarily saying you did a thorough review and stand behind it's technical contents. For instance, I've seen people pile on +1's to try and indicate support for a feature or patch in some projects, even though they didn't do a thorough technical review. This +1 is definitely a useful mechanism. There is definitely much overlap though in the meaning, though, and it's largely because Spark had it's own culture around reviews before it was donated to the ASF, so there is a mix of two styles. Nonetheless, I'd prefer to stick with the stronger LGTM semantics I proposed originally (unlike the one Sandy proposed, e.g.). This is what I've seen every project using the LGTM convention do (Google, and some open source projects such as Impala) to indicate technical sign-off. - Patrick On Sat, Jan 17, 2015 at 7:09 PM, Aaron Davidson ilike...@gmail.com wrote: I think I've seen something like +2 = strong LGTM and +1 = weak LGTM; someone else should review before. It's nice to have a shortcut which isn't a sentence when talking about weaker forms of LGTM. On Sat, Jan 17, 2015 at 6:59 PM, sandy.r...@cloudera.com wrote: I think clarifying these semantics is definitely worthwhile. Maybe this complicates the process with additional terminology, but the way I've used these has been: +1 - I think this is safe to merge and, barring objections from others, would merge it immediately. LGTM - I have no concerns about this patch, but I don't necessarily feel qualified to make a final call about it. The TM part acknowledges the judgment as a little more subjective. I think having some concise way to express both of these is useful. -Sandy On Jan 17, 2015, at 5:40 PM, Patrick Wendell pwend...@gmail.com wrote: Hey All, Just wanted to ping about a minor issue - but one that ends up having consequence given Spark's volume of reviews and commits. As much as possible, I think that we should try and gear towards Google Style LGTM on reviews. What I mean by this is that LGTM has the following semantics: I know this code well, or I've looked at it close enough to feel confident it should be merged. If there are issues/bugs with this code later on, I feel confident I can help with them. Here is an alternative semantic: Based on what I know about this part of the code, I don't see any show-stopper problems with this patch. The issue with the latter is that it ultimately erodes the significance of LGTM, since subsequent reviewers need to reason about what the person meant by saying LGTM. In contrast, having strong semantics around LGTM can help streamline reviews a lot, especially as reviewers get more experienced and gain trust from the comittership. There are several easy ways to give a more limited endorsement of a patch: - I'm not familiar with this code, but style, etc look good (general endorsement) - The build changes in this code LGTM, but I haven't reviewed the rest (limited LGTM) If people are okay with this, I might add a short
Re: run time exceptions in Spark 1.2.0 manual build together with OpenStack hadoop driver
Agree, I think this can / should be fixed with a slightly more conservative version of https://github.com/apache/spark/pull/3938 related to SPARK-5108. On Sun, Jan 18, 2015 at 3:41 PM, Ted Yu yuzhih...@gmail.com wrote: Please tale a look at SPARK-4048 and SPARK-5108 Cheers On Sat, Jan 17, 2015 at 10:26 PM, Gil Vernik g...@il.ibm.com wrote: Hi, I took a source code of Spark 1.2.0 and tried to build it together with hadoop-openstack.jar ( To allow Spark an access to OpenStack Swift ) I used Hadoop 2.6.0. The build was fine without problems, however in run time, while trying to access swift:// name space i got an exception: java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass at org.codehaus.jackson.map.introspect.JacksonAnnotationIntrospector.findDeserializationType(JacksonAnnotationIntrospector.java:524) at org.codehaus.jackson.map.deser.BasicDeserializerFactory.modifyTypeByAnnotation(BasicDeserializerFactory.java:732) ...and the long stack trace goes here Digging into the problem i saw the following: Jackson versions 1.9.X are not backward compatible, in particular they removed JsonClass annotation. Hadoop 2.6.0 uses jackson-asl version 1.9.13, while Spark has reference to older version of jackson. This is the main pom.xml of Spark 1.2.0 : dependency !-- Matches the version of jackson-core-asl pulled in by avro -- groupIdorg.codehaus.jackson/groupId artifactIdjackson-mapper-asl/artifactId version1.8.8/version /dependency Referencing 1.8.8 version, which is not compatible with Hadoop 2.6.0 . If we change version to 1.9.13, than all will work fine and there will be no run time exceptions while accessing Swift. The following change will solve the problem: dependency !-- Matches the version of jackson-core-asl pulled in by avro -- groupIdorg.codehaus.jackson/groupId artifactIdjackson-mapper-asl/artifactId version1.9.13/version /dependency I am trying to resolve this somehow so people will not get into this issue. Is there any particular need in Spark for jackson 1.8.8 and not 1.9.13? Can we remove 1.8.8 and put 1.9.13 for Avro? It looks to me that all works fine when Spark build with jackson 1.9.13, but i am not an expert and not sure what should be tested. Thanks, Gil Vernik. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org