Will Spark-SQL support vectorized query engine someday?
Hi, Correct me if I were wrong. It looks like, the current version of Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical operator produces a tuple by recursively call child-execute . There are papers that illustrate the benefits of vectorized query engine. And Hive-Stinger also embrace this style. So, the question is, will Spark-SQL give a support to vectorized query execution someday? Thanks
GraphX ShortestPaths backwards?
GraphX ShortestPaths seems to be following edges backwards instead of forwards: import org.apache.spark.graphx._ val g = Graph(sc.makeRDD(Array((1L,), (2L,), (3L,))), sc.makeRDD(Array(Edge(1L,2L,), Edge(2L,3L, lib.ShortestPaths.run(g,Array(3)).vertices.collect res1: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map()), (3,Map(3 - 0)), (2,Map())) lib.ShortestPaths.run(g,Array(1)).vertices.collect res2: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(1 - 0)), (3,Map(1 - 2)), (2,Map(1 - 1))) If I am not mistaken about my assessment, then I believe the following changes will make it run forward: Change one occurrence of src to dst in https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L64 Change three occurrences of dst to src in https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L65 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Join the developer community of spark
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Enjoy! Alex On Mon, Jan 19, 2015 at 6:44 PM, Jeff Wang jingjingwang...@gmail.com wrote: Hi: I would like to contribute to the code of spark. Can I join the community? Thanks, Jeff
Is there any way to support multiple users executing SQL on thrift server?
Is there any way to support multiple users executing SQL on one thrift server? I think there are some problems for spark 1.2.0, for example: 1. Start thrift server with user A 2. Connect to thrift server via beeline with user B 3. Execute “insert into table dest select … from table src” then we found these items on hdfs: |drwxr-xr-x - B supergroup 0 2015-01-16 16:42 /tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1 drwxr-xr-x - B supergroup 0 2015-01-16 16:42 /tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary drwxr-xr-x - B supergroup 0 2015-01-16 16:42 /tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0 drwxr-xr-x - A supergroup 0 2015-01-16 16:42 /tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0/_temporary drwxr-xr-x - A supergroup 0 2015-01-16 16:42 /tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0/task_201501161642_0022_m_00 -rw-r--r-- 3 A supergroup 2671 2015-01-16 16:42 /tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0/task_201501161642_0022_m_00/part-0 | You can see all the temporary path created on driver side (thrift server side) is owned by user B (which is what we expected). But all the output data created on executor side is owned by user A, (which is NOT what we expected). error owner of the output data cause |org.apache.hadoop.security.AccessControlException| while the driver side moving output data into |dest| table. Is anyone know how to resolve this problem?
Re: Optimize encoding/decoding strings when using Parquet
Added a JIRA to track https://issues.apache.org/jira/browse/SPARK-5309 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10189.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Will Spark-SQL support vectorized query engine someday?
It will probably eventually make its way into part of the query engine, one way or another. Note that there are in general a lot of other lower hanging fruits before you have to do vectorization. As far as I know, Hive doesn't really have vectorization because the vectorization in Hive is simply writing everything in small batches, in order to avoid the virtual function call overhead, and hoping the JVM can unroll some of the loops. There is no SIMD involved. Something that is pretty useful, which isn't exactly from vectorization but comes from similar lines of research, is being able to push predicates down into the columnar compression encoding. For example, one can turn string comparisons into integer comparisons. These will probably give much larger performance improvements in common queries. On Mon, Jan 19, 2015 at 6:27 PM, Xuelin Cao xuelincao2...@gmail.com wrote: Hi, Correct me if I were wrong. It looks like, the current version of Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical operator produces a tuple by recursively call child-execute . There are papers that illustrate the benefits of vectorized query engine. And Hive-Stinger also embrace this style. So, the question is, will Spark-SQL give a support to vectorized query execution someday? Thanks
Re: Memory config issues
On Mon, Jan 19, 2015 at 6:29 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Its the executor memory (spark.executor.memory) which you can set while creating the spark context. By default it uses 0.6% of the executor memory (Uses 0.6 or 60%) - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark client reconnect to driver in yarn-cluster deployment mode
in yarn-client mode it only controls the environment of the executor launcher So you either use yarn-client mode, and then your app keeps running and controlling the process Or you use yarn-cluster mode, and then you send a jar to YARN, and that jar should have code to report the result back to you *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Thu, Jan 15, 2015 at 1:52 PM, preeze etan...@gmail.com wrote: From the official spark documentation (http://spark.apache.org/docs/1.2.0/running-on-yarn.html): In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. Is there any designed way that the client connects back to the driver (still running in YARN) for collecting results at a later stage? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-client-reconnect-to-driver-in-yarn-cluster-deployment-mode-tp10122.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Optimize encoding/decoding strings when using Parquet
Here are some timings showing effect of caching last Binary-String conversion. Query times are reduced significantly and variation in timings due to reduction in garbage is very significant. Set of sample queries selecting various columns, applying some filtering and then aggregating Spark 1.2.0 Query 1 mean time 8353.3 millis, std deviation 480.91511147441025 millis Query 2 mean time 8677.6 millis, std deviation 3193.345518417949 millis Query 3 mean time 11302.5 millis, std deviation 2989.9406998950476 millis Query 4 mean time 10537.0 millis, std deviation 5166.024024549462 millis Query 5 mean time 9559.9 millis, std deviation 4141.487667493409 millis Query 6 mean time 12638.1 millis, std deviation 3639.4505522430477 millis Spark 1.2.0 - cache last Binary-String conversion Query 1 mean time 5118.9 millis, std deviation 549.6670608448152 millis Query 2 mean time 3761.3 millis, std deviation 202.57785883183013 millis Query 3 mean time 7358.8 millis, std deviation 242.58918176850162 millis Query 4 mean time 4173.5 millis, std deviation 179.802515122688 millis Query 5 mean time 3857.0 millis, std deviation 140.71957930579526 millis Query 6 mean time 7512.0 millis, std deviation 198.32633040858022 millis -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10193.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: RDD order guarantees
Hi Reynold. I'll take a look. SPARK-5300 is open for this issue. -Ewan On 19/01/15 08:39, Reynold Xin wrote: Hi Ewan, Not sure if there is a JIRA ticket (there are too many that I lose track). I chatted briefly with Aaron on this. The way we can solve it is to create a new FileSystem implementation that overrides the listStatus method, and then in Hadoop Conf set the fs.file.impl to that. Shouldn't be too hard. Would you be interested in working on it? On Fri, Jan 16, 2015 at 3:36 PM, Ewan Higgs ewan.hi...@ugent.be mailto:ewan.hi...@ugent.be wrote: Yes, I am running on a local file system. Is there a bug open for this? Mingyu Kim reported the problem last April: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html -Ewan On 01/16/2015 07:41 PM, Reynold Xin wrote: You are running on a local file system right? HDFS orders the file based on names, but local file system often don't. I think that's why the difference. We might be able to do a sort and order the partitions when we create a RDD to make this universal though. On Fri, Jan 16, 2015 at 8:26 AM, Ewan Higgs ewan.hi...@ugent.be mailto:ewan.hi...@ugent.be wrote: Hi all, Quick one: when reading files, are the orders of partitions guaranteed to be preserved? I am finding some weird behaviour where I run sortByKeys() on an RDD (which has 16 byte keys) and write it to disk. If I open a python shell and run the following: for part in range(29): print map(ord, open('/home/ehiggs/data/terasort_out/part-r-000{0:02}'.format(part), 'r').read(16)) Then each partition is in order based on the first value of each partition. I can also call TeraValidate.validate from TeraSort and it is happy with the results. It seems to be on loading the file that the reordering happens. If this is expected, is there a way to ask Spark nicely to give me the RDD in the order it was saved? This is based on trying to fix my TeraValidate code on this branch: https://github.com/ehiggs/spark/tree/terasort Thanks, Ewan - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org mailto:dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org mailto:dev-h...@spark.apache.org
Re: Semantics of LGTM
Patrick's original proposal LGTM :). However until now, I have been in the impression of LGTM with special emphasis on TM part. That said, I will be okay/happy(or Responsible ) for the patch, if it goes in. Prashant Sharma On Sun, Jan 18, 2015 at 2:33 PM, Reynold Xin r...@databricks.com wrote: Maybe just to avoid LGTM as a single token when it is not actually according to Patrick's definition, but anybody can still leave comments like: The direction of the PR looks good to me. or +1 on the direction The build part looks good to me ... On Sat, Jan 17, 2015 at 8:49 PM, Kay Ousterhout k...@eecs.berkeley.edu wrote: +1 to Patrick's proposal of strong LGTM semantics. On past projects, I've heard the semantics of LGTM expressed as I've looked at this thoroughly and take as much ownership as if I wrote the patch myself. My understanding is that this is the level of review we expect for all patches that ultimately go into Spark, so it's important to have a way to concisely describe when this has been done. Aaron / Sandy, when have you found the weaker LGTM to be useful? In the cases I've seen, if someone else says I looked at this very quickly and didn't see any glaring problems, it doesn't add any value for subsequent reviewers (someone still needs to take a thorough look). -Kay On Sat, Jan 17, 2015 at 8:04 PM, sandy.r...@cloudera.com wrote: Yeah, the ASF +1 has become partly overloaded to mean both I would like to see this feature and this patch should be committed, although, at least in Hadoop, using +1 on JIRA (as opposed to, say, in a release vote) should unambiguously mean the latter unless qualified in some other way. I don't have any opinion on the specific characters, but I agree with Aaron that it would be nice to have some sort of abbreviation for both the strong and weak forms of approval. -Sandy On Jan 17, 2015, at 7:25 PM, Patrick Wendell pwend...@gmail.com wrote: I think the ASF +1 is *slightly* different than Google's LGTM, because it might convey wanting the patch/feature to be merged but not necessarily saying you did a thorough review and stand behind it's technical contents. For instance, I've seen people pile on +1's to try and indicate support for a feature or patch in some projects, even though they didn't do a thorough technical review. This +1 is definitely a useful mechanism. There is definitely much overlap though in the meaning, though, and it's largely because Spark had it's own culture around reviews before it was donated to the ASF, so there is a mix of two styles. Nonetheless, I'd prefer to stick with the stronger LGTM semantics I proposed originally (unlike the one Sandy proposed, e.g.). This is what I've seen every project using the LGTM convention do (Google, and some open source projects such as Impala) to indicate technical sign-off. - Patrick On Sat, Jan 17, 2015 at 7:09 PM, Aaron Davidson ilike...@gmail.com wrote: I think I've seen something like +2 = strong LGTM and +1 = weak LGTM; someone else should review before. It's nice to have a shortcut which isn't a sentence when talking about weaker forms of LGTM. On Sat, Jan 17, 2015 at 6:59 PM, sandy.r...@cloudera.com wrote: I think clarifying these semantics is definitely worthwhile. Maybe this complicates the process with additional terminology, but the way I've used these has been: +1 - I think this is safe to merge and, barring objections from others, would merge it immediately. LGTM - I have no concerns about this patch, but I don't necessarily feel qualified to make a final call about it. The TM part acknowledges the judgment as a little more subjective. I think having some concise way to express both of these is useful. -Sandy On Jan 17, 2015, at 5:40 PM, Patrick Wendell pwend...@gmail.com wrote: Hey All, Just wanted to ping about a minor issue - but one that ends up having consequence given Spark's volume of reviews and commits. As much as possible, I think that we should try and gear towards Google Style LGTM on reviews. What I mean by this is that LGTM has the following semantics: I know this code well, or I've looked at it close enough to feel confident it should be merged. If there are issues/bugs with this code later on, I feel confident I can help with them. Here is an alternative semantic: Based on what I know about this part of the code, I don't see any show-stopper problems with this patch. The issue with the latter is that it ultimately erodes the significance of LGTM, since subsequent reviewers need to reason about what the person meant by saying LGTM. In contrast, having strong semantics around LGTM can help
Re: Optimize encoding/decoding strings when using Parquet
Looking at Parquet code - it looks like hooks are already in place to support this. In particular PrimitiveConverter has methods hasDictionarySupport and addValueFromDictionary for this purpose. These are not used by CatalystPrimitiveConverter. I think that it would be pretty straightforward to add this. Has anyone considered this? Shall I get a pull request together for it. Mick -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10195.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Optimize encoding/decoding strings when using Parquet
Definitely go for a pull request! On Mon, Jan 19, 2015 at 10:10 AM, Mick Davies michael.belldav...@gmail.com wrote: Looking at Parquet code - it looks like hooks are already in place to support this. In particular PrimitiveConverter has methods hasDictionarySupport and addValueFromDictionary for this purpose. These are not used by CatalystPrimitiveConverter. I think that it would be pretty straightforward to add this. Has anyone considered this? Shall I get a pull request together for it. Mick -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10195.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: GraphX vertex partition/location strategy
No - the vertices are hash-partitioned onto workers independently of the edges. It would be nice for each vertex to be on the worker with the most adjacent edges, but we haven't done this yet since it would add a lot of complexity to avoid load imbalance while reducing the overall communication by a small factor. We refer to the number of partitions containing adjacent edges for a particular vertex as the vertex's replication factor. I think the typical replication factor for power-law graphs with 100-200 partitions is 10-15, and placing the vertex at the ideal location would only reduce the replication factor by 1. Ankur http://www.ankurdave.com/ On Mon, Jan 19, 2015 at 12:20 PM, Michael Malak michaelma...@yahoo.com.invalid wrote: Does GraphX make an effort to co-locate vertices onto the same workers as the majority (or even some) of its edges?
Re: GraphX vertex partition/location strategy
But wouldn't the gain be greater under something similar to EdgePartition1D (but perhaps better load-balanced based on number of edges for each vertex) and an algorithm that primarily follows edges in the forward direction? From: Ankur Dave ankurd...@gmail.com To: Michael Malak michaelma...@yahoo.com Cc: dev@spark.apache.org dev@spark.apache.org Sent: Monday, January 19, 2015 2:08 PM Subject: Re: GraphX vertex partition/location strategy No - the vertices are hash-partitioned onto workers independently of the edges. It would be nice for each vertex to be on the worker with the most adjacent edges, but we haven't done this yet since it would add a lot of complexity to avoid load imbalance while reducing the overall communication by a small factor. We refer to the number of partitions containing adjacent edges for a particular vertex as the vertex's replication factor. I think the typical replication factor for power-law graphs with 100-200 partitions is 10-15, and placing the vertex at the ideal location would only reduce the replication factor by 1. Ankur On Mon, Jan 19, 2015 at 12:20 PM, Michael Malak michaelma...@yahoo.com.invalid wrote: Does GraphX make an effort to co-locate vertices onto the same workers as the majority (or even some) of its edges?
Re: Semantics of LGTM
The wiki does not seem to be operational ATM, but I will do this when it is back up. On Mon, Jan 19, 2015 at 12:00 PM, Patrick Wendell pwend...@gmail.com wrote: Okay - so given all this I was going to put the following on the wiki tentatively: ## Reviewing Code Community code review is Spark's fundamental quality assurance process. When reviewing a patch, your goal should be to help streamline the committing process by giving committers confidence this patch has been verified by an additional party. It's encouraged to (politely) submit technical feedback to the author to identify areas for improvement or potential bugs. If you feel a patch is ready for inclusion in Spark, indicate this to committers with a comment: I think this patch looks good. Spark uses the LGTM convention for indicating the highest level of technical sign-off on a patch: simply comment with the word LGTM. An LGTM is a strong statement, it should be interpreted as the following: I've looked at this thoroughly and take as much ownership as if I wrote the patch myself. If you comment LGTM you will be expected to help with bugs or follow-up issues on the patch. Judicious use of LGTM's is a great way to gain credibility as a reviewer with the broader community. It's also welcome for reviewers to argue against the inclusion of a feature or patch. Simply indicate this in the comments. - Patrick On Mon, Jan 19, 2015 at 2:40 AM, Prashant Sharma scrapco...@gmail.com wrote: Patrick's original proposal LGTM :). However until now, I have been in the impression of LGTM with special emphasis on TM part. That said, I will be okay/happy(or Responsible ) for the patch, if it goes in. Prashant Sharma On Sun, Jan 18, 2015 at 2:33 PM, Reynold Xin r...@databricks.com wrote: Maybe just to avoid LGTM as a single token when it is not actually according to Patrick's definition, but anybody can still leave comments like: The direction of the PR looks good to me. or +1 on the direction The build part looks good to me ... On Sat, Jan 17, 2015 at 8:49 PM, Kay Ousterhout k...@eecs.berkeley.edu wrote: +1 to Patrick's proposal of strong LGTM semantics. On past projects, I've heard the semantics of LGTM expressed as I've looked at this thoroughly and take as much ownership as if I wrote the patch myself. My understanding is that this is the level of review we expect for all patches that ultimately go into Spark, so it's important to have a way to concisely describe when this has been done. Aaron / Sandy, when have you found the weaker LGTM to be useful? In the cases I've seen, if someone else says I looked at this very quickly and didn't see any glaring problems, it doesn't add any value for subsequent reviewers (someone still needs to take a thorough look). -Kay On Sat, Jan 17, 2015 at 8:04 PM, sandy.r...@cloudera.com wrote: Yeah, the ASF +1 has become partly overloaded to mean both I would like to see this feature and this patch should be committed, although, at least in Hadoop, using +1 on JIRA (as opposed to, say, in a release vote) should unambiguously mean the latter unless qualified in some other way. I don't have any opinion on the specific characters, but I agree with Aaron that it would be nice to have some sort of abbreviation for both the strong and weak forms of approval. -Sandy On Jan 17, 2015, at 7:25 PM, Patrick Wendell pwend...@gmail.com wrote: I think the ASF +1 is *slightly* different than Google's LGTM, because it might convey wanting the patch/feature to be merged but not necessarily saying you did a thorough review and stand behind it's technical contents. For instance, I've seen people pile on +1's to try and indicate support for a feature or patch in some projects, even though they didn't do a thorough technical review. This +1 is definitely a useful mechanism. There is definitely much overlap though in the meaning, though, and it's largely because Spark had it's own culture around reviews before it was donated to the ASF, so there is a mix of two styles. Nonetheless, I'd prefer to stick with the stronger LGTM semantics I proposed originally (unlike the one Sandy proposed, e.g.). This is what I've seen every project using the LGTM convention do (Google, and some open source projects such as Impala) to indicate technical sign-off. - Patrick On Sat, Jan 17, 2015 at 7:09 PM, Aaron Davidson ilike...@gmail.com wrote: I think I've seen something like +2 = strong LGTM and +1 = weak LGTM; someone else should review before. It's nice to have a shortcut which isn't a sentence when talking about weaker forms of LGTM. On Sat, Jan 17, 2015 at 6:59 PM, sandy.r...@cloudera.com wrote: I think clarifying these semantics is definitely worthwhile. Maybe this
Re: Semantics of LGTM
Okay - so given all this I was going to put the following on the wiki tentatively: ## Reviewing Code Community code review is Spark's fundamental quality assurance process. When reviewing a patch, your goal should be to help streamline the committing process by giving committers confidence this patch has been verified by an additional party. It's encouraged to (politely) submit technical feedback to the author to identify areas for improvement or potential bugs. If you feel a patch is ready for inclusion in Spark, indicate this to committers with a comment: I think this patch looks good. Spark uses the LGTM convention for indicating the highest level of technical sign-off on a patch: simply comment with the word LGTM. An LGTM is a strong statement, it should be interpreted as the following: I've looked at this thoroughly and take as much ownership as if I wrote the patch myself. If you comment LGTM you will be expected to help with bugs or follow-up issues on the patch. Judicious use of LGTM's is a great way to gain credibility as a reviewer with the broader community. It's also welcome for reviewers to argue against the inclusion of a feature or patch. Simply indicate this in the comments. - Patrick On Mon, Jan 19, 2015 at 2:40 AM, Prashant Sharma scrapco...@gmail.com wrote: Patrick's original proposal LGTM :). However until now, I have been in the impression of LGTM with special emphasis on TM part. That said, I will be okay/happy(or Responsible ) for the patch, if it goes in. Prashant Sharma On Sun, Jan 18, 2015 at 2:33 PM, Reynold Xin r...@databricks.com wrote: Maybe just to avoid LGTM as a single token when it is not actually according to Patrick's definition, but anybody can still leave comments like: The direction of the PR looks good to me. or +1 on the direction The build part looks good to me ... On Sat, Jan 17, 2015 at 8:49 PM, Kay Ousterhout k...@eecs.berkeley.edu wrote: +1 to Patrick's proposal of strong LGTM semantics. On past projects, I've heard the semantics of LGTM expressed as I've looked at this thoroughly and take as much ownership as if I wrote the patch myself. My understanding is that this is the level of review we expect for all patches that ultimately go into Spark, so it's important to have a way to concisely describe when this has been done. Aaron / Sandy, when have you found the weaker LGTM to be useful? In the cases I've seen, if someone else says I looked at this very quickly and didn't see any glaring problems, it doesn't add any value for subsequent reviewers (someone still needs to take a thorough look). -Kay On Sat, Jan 17, 2015 at 8:04 PM, sandy.r...@cloudera.com wrote: Yeah, the ASF +1 has become partly overloaded to mean both I would like to see this feature and this patch should be committed, although, at least in Hadoop, using +1 on JIRA (as opposed to, say, in a release vote) should unambiguously mean the latter unless qualified in some other way. I don't have any opinion on the specific characters, but I agree with Aaron that it would be nice to have some sort of abbreviation for both the strong and weak forms of approval. -Sandy On Jan 17, 2015, at 7:25 PM, Patrick Wendell pwend...@gmail.com wrote: I think the ASF +1 is *slightly* different than Google's LGTM, because it might convey wanting the patch/feature to be merged but not necessarily saying you did a thorough review and stand behind it's technical contents. For instance, I've seen people pile on +1's to try and indicate support for a feature or patch in some projects, even though they didn't do a thorough technical review. This +1 is definitely a useful mechanism. There is definitely much overlap though in the meaning, though, and it's largely because Spark had it's own culture around reviews before it was donated to the ASF, so there is a mix of two styles. Nonetheless, I'd prefer to stick with the stronger LGTM semantics I proposed originally (unlike the one Sandy proposed, e.g.). This is what I've seen every project using the LGTM convention do (Google, and some open source projects such as Impala) to indicate technical sign-off. - Patrick On Sat, Jan 17, 2015 at 7:09 PM, Aaron Davidson ilike...@gmail.com wrote: I think I've seen something like +2 = strong LGTM and +1 = weak LGTM; someone else should review before. It's nice to have a shortcut which isn't a sentence when talking about weaker forms of LGTM. On Sat, Jan 17, 2015 at 6:59 PM, sandy.r...@cloudera.com wrote: I think clarifying these semantics is definitely worthwhile. Maybe this complicates the process with additional terminology, but the way I've used these has been: +1 - I think this is safe to merge and, barring objections from others,
GraphX vertex partition/location strategy
Does GraphX make an effort to co-locate vertices onto the same workers as the majority (or even some) of its edges? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org