Re: Handling stale PRs
Hey Nicholas, Thanks for bringing this up. There are a few dimensions to this... one is that it's actually precedurally difficult for us to close pull requests. I've proposed several different solutions to ASF infra to streamline the process, but thus far they haven't been open to any of my ideas: https://issues.apache.org/jira/browse/INFRA-7918 https://issues.apache.org/jira/browse/INFRA-8241 The more important thing, maybe, is how we want to deal with this culturally. And I think we need to do a better job of making sure no pull requests go unattended (i.e. waiting for committer feedback). If patches go stale, it should be because the user hasn't responded, not us. Another thing is that we should, IMO, err on the side of explicitly saying no or not yet to patches, rather than letting them linger without attention. We do get patches where the user is well intentioned, but it is a feature that doesn't make sense to add, or isn't well thought out or explained, or the review effort would be so large it's not within our capacity to look at just yet. Most other ASF projects I know just ignore these patches. I'd prefer if we took the approach of politely explaining why in the current form the patch isn't acceptable and closing it (potentially w/ tips on how to improve it or narrow the scope). - Patrick On Mon, Aug 25, 2014 at 9:57 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Nicholas, In general we've been looking at these periodically (at least I have) and asking people to close out of date ones, but it's true that the list has gotten fairly large. We should probably have an expiry time of a few months and close them automatically. I agree that it's daunting to see so many open PRs. Matei On August 25, 2014 at 9:03:09 PM, Nicholas Chammas ( nicholas.cham...@gmail.com) wrote: Check this out: https://github.com/apache/spark/pulls?q=is%3Aopen+is%3Apr+sort%3Aupdated-asc We're hitting close to 300 open PRs. Those are the least recently updated ones. I think having a low number of stale (i.e. not recently updated) PRs is a good thing to shoot for. It doesn't leave contributors hanging (which feels bad for contributors), and reduces project clutter (which feels bad for maintainers/committers). What is our approach to tackling this problem? I think communicating and enforcing a clear policy on how stale PRs are handled might be a good way to reduce the number of stale PRs we have without making contributors feel rejected. I don't know what such a policy would look like, but it should be enforceable and lightweight--i.e. it shouldn't feel like a hammer used to reject people's work, but rather a necessary tool to keep the project's contributions relevant and manageable. Nick
Re: [Spark SQL] off-heap columnar store
What would be the timeline for the parquet caching work? The reason I'm asking about the columnar compressed format is that there are some problems for which Parquet is not practical. On Mon, Aug 25, 2014 at 1:13 PM, Michael Armbrust mich...@databricks.com wrote: What is the plan for getting Tachyon/off-heap support for the columnar compressed store? It's not in 1.1 is it? It is not in 1.1 and there are not concrete plans for adding it at this point. Currently, there is more engineering investment going into caching parquet data in Tachyon instead. This approach is going to have much better support for nested data, leverages other work being done on parquet, and alleviates your concerns about wire format compatibility. That said, if someone really wants to try and implement it, I don't think it would be very hard. The primary issue is going to be designing a clean interface that is not too tied to this one implementation. Also, how likely is the wire format for the columnar compressed data to change? That would be a problem for write-through or persistence. We aren't making any guarantees at the moment that it won't change. Its currently only intended for temporary caching of data. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [Spark SQL] off-heap columnar store
Any initial proposal or design about the caching to Tachyon that you can share so far? Caching parquet files in tachyon with saveAsParquetFile and then reading them with parquetFile should already work. You can use SQL on these tables by using registerTempTable. Some of the general parquet work that we have been doing includes: #1935 https://github.com/apache/spark/pull/1935, SPARK-2721 https://issues.apache.org/jira/browse/SPARK-2721, SPARK-3036 https://issues.apache.org/jira/browse/SPARK-3036, SPARK-3037 https://issues.apache.org/jira/browse/SPARK-3037 and #1819 https://github.com/apache/spark/pull/1819 The reason I'm asking about the columnar compressed format is that there are some problems for which Parquet is not practical. Can you elaborate?
CoHadoop Papers
One of my colleagues has been questioning me as to why Spark/HDFS makes no attempts to try to co-locate related data blocks. He pointed to this paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the CoHadoop research and the performance improvements it yielded for Map/Reduce jobs. Would leveraging these ideas for writing data from Spark make sense/be worthwhile?
Re: Handling stale PRs
On 08/26/2014 04:57 AM, Sean Owen wrote: On Tue, Aug 26, 2014 at 7:02 AM, Patrick Wendell pwend...@gmail.com wrote: Most other ASF projects I know just ignore these patches. I'd prefer if we Agree, this drives me crazy. It kills part of JIRA's usefulness. Spark is blessed/cursed with incredible inbound load, but would love to still see the project get this right-er than, say, Hadoop. totally agree, this applies to patches as well as jiras. i'll add that projects who let things simply linger are missing an opportunity to engage their community. spark should capitalize on its momentum to build a smoothly running community (vs not and accept an unbounded backlog as inevitable). The more important thing, maybe, is how we want to deal with this culturally. And I think we need to do a better job of making sure no pull requests go unattended (i.e. waiting for committer feedback). If patches go stale, it should be because the user hasn't responded, not us. Stale JIRAs are a symptom, not a problem per se. I also want to see the backlog cleared, but automatically closing doesn't help, if the problem is too many JIRAs and not enough committer-hours to look at them. Some noise gets closed, but some easy or important fixes may disappear as well. engagement in the community really needs to go both ways. it's reasonable for PRs that stop merging or have open comments that need resolution by the PRer to be loudly timed out. a similar thing goes for jiras, if there's a request for more information to resolve a bug and that information does not appear, half of the communication is gone and a loud time out is reasonable. easy and important are in the eyes of the beholder. timeouts can go both ways. a jira or pr that has been around for a period of time (say 1/3 the to-close timeout) should bump up for evaluation, hopefully resulting in few easy or important issues falling through the cracks. fyi, i'm periodically going through the pyspark jiras, trying to reproduce issues, coalesce duplicates and ask for details. i've not been given any sort of permission to do this, i don't have any special position in the community to do this - in a well functioning community everyone should feel free to jump in and help. Another thing is that we should, IMO, err on the side of explicitly saying no or not yet to patches, rather than letting them linger without attention. We do get patches where the user is well intentioned, but it is Completely agree. The solution is partly more supply of committer time on JIRAs. But that is detracting from the work the committers themselves want to do. More of the solution is reducing demand by helping people create useful, actionable, non-duplicate JIRAs from the start. Or encouraging people to resolve existing JIRAs and shepherding those in. saying no/not-yet is a vitally important piece of information. Elsewhere, I've found people reluctant to close JIRA for fear of offending or turning off contributors. I think the opposite is true. There is nothing wrong with no or not now especially accompanied with constructive feedback. Better to state for the record what is not being looked at and why, than let people work on and open the same JIRAs repeatedly. well stated! I have also found in the past that a culture of tolerating eternal JIRAs led people to file JIRAs in order to forget about a problem -- it's in JIRA, and it's in progress, so it feels like someone else is going to fix it later and so it can be forgotten now. there's some value in these now-i-can-forget jira, though i'm not personally a fan. it can be good to keep them around and reachable by search, but they should be clearly marked as no/not-yet or something similar. For what it's worth, I think these project and culture mechanics are so important and it's my #1 concern for Spark at this stage. This challenge exists so much more here, exactly because there is so much potential. I'd love to help by trying to identify and close stale JIRAs but am afraid that tagging them is just adding to the heap of work. +1 concern and potential! best, matt - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: CoHadoop Papers
It appears support for this type of control over block placement is going out in the next version of HDFS: https://issues.apache.org/jira/browse/HDFS-2576 On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com wrote: One of my colleagues has been questioning me as to why Spark/HDFS makes no attempts to try to co-locate related data blocks. He pointed to this paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the CoHadoop research and the performance improvements it yielded for Map/Reduce jobs. Would leveraging these ideas for writing data from Spark make sense/be worthwhile?
Re: too many CancelledKeyException throwed from ConnectionManager
Hi Shengzhe I faced to same situation. I think, Connection and ConnectionManager have some race condition issues and the error you mentioned may be caused by the issues. Now I'm trying to resolve the issue in https://github.com/apache/spark/pull/2019. Please check it out. - Kousuke (2014/08/26 8:53), yao wrote: Hi Folks, We are testing our home-made KMeans algorithm using Spark on Yarn. Recently, we've found that the application failed frequently when doing clustering over 300,000,000 users (each user is represented by a feature vector and the whole data set is around 600,000,000). After digging into the job log, we've found that there are many CancelledKeyException throwed by ConnectionManager but not observed other exceptions. We double frequent CancelledKeyException brings the whole application down since the application often failed on the third or fourth iteration for large datasets. Welcome to any directional suggestions. *Errors in job log*: java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:363) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116) 14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(lsv-289.rfiserve.net,43199) 14/08/25 19:04:32 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found 14/08/25 19:04:32 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@2570cd62 14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@2570cd62 java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:363) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116) 14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(lsv-289.rfiserve.net,56727) 14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(lsv-289.rfiserve.net,56727) 14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(lsv-289.rfiserve.net,56727) 14/08/25 19:04:32 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@37c8b85a 14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@37c8b85a java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:287) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116) 14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(lsv-668.rfiserve.net,41913) 14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(lsv-668.rfiserve.net,41913) 14/08/25 19:04:32 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@fcea3a4 14/08/25 19:04:32 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found 14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@fcea3a4 Best Shengzhe - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Handling stale PRs
Sean Owen wrote Stale JIRAs are a symptom, not a problem per se. I also want to see the backlog cleared, but automatically closing doesn't help, if the problem is too many JIRAs and not enough committer-hours to look at them. Some noise gets closed, but some easy or important fixes may disappear as well. Agreed. All of the problems mentioned in this thread are symptoms. There's no shortage of talent and enthusiasm within the Spark community. The people and the product are wonderful. The process: not so much. Spark has been wildly successful, some growing pains are to be expected. Given 100+ contributors, Spark is a big project. As with big data, big projects can run into scaling issues. There's no magic to running a successful big project, but it does require greater planning and discipline. JIRA is great for issue tracking, but it's not a replacement for a project plan. Quarterly releases are a great idea, everyone knows the schedule. What we need is concise plan for each release with a clear scope statement. Without knowing what is in scope and out of scope for a release, we end up with a laundry list of things to do, but no clear goal. Laundry lists don't scale well. I don't mind helping with planning and documenting releases. This is especially helpful for new contributors who don't know where to start. I have done that successfully on many projects using Jira and Confluence, so I know it can be done. To address immediate concerns of open PRs and excessive, overlapping Jira issues, we probably have to create a meta issue and assign resources to fix it. I don't mind helping with that also. - -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Handling-stale-PRs-tp8015p8031.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Handling stale PRs
- Original Message - Another thing is that we should, IMO, err on the side of explicitly saying no or not yet to patches, rather than letting them linger without attention. We do get patches where the user is well intentioned, but it is Completely agree. The solution is partly more supply of committer time on JIRAs. But that is detracting from the work the committers themselves want to do. More of the solution is reducing demand by helping people create useful, actionable, non-duplicate JIRAs from the start. Or encouraging people to resolve existing JIRAs and shepherding those in. saying no/not-yet is a vitally important piece of information. +1, when I propose a contribution to a project, I consider an articulate (and hopefully polite) no thanks, or not-yet, or needs-work, to be far more useful and pleasing than just radio silence. It allows me to either address feedback, or just move on. Although it takes effort to keep abreast of community contributions, I don't think it needs to be an overbearing or heavy-weight process. I've seen other communities where they talked themselves out of better management because they conceived the ticket workflow as being more effort than it needed to be. Much better to keep ticket triage and workflow fast/simple, but actually do it. Elsewhere, I've found people reluctant to close JIRA for fear of offending or turning off contributors. I think the opposite is true. There is nothing wrong with no or not now especially accompanied with constructive feedback. Better to state for the record what is not being looked at and why, than let people work on and open the same JIRAs repeatedly. well stated! I have also found in the past that a culture of tolerating eternal JIRAs led people to file JIRAs in order to forget about a problem -- it's in JIRA, and it's in progress, so it feels like someone else is going to fix it later and so it can be forgotten now. there's some value in these now-i-can-forget jira, though i'm not personally a fan. it can be good to keep them around and reachable by search, but they should be clearly marked as no/not-yet or something similar. For what it's worth, I think these project and culture mechanics are so important and it's my #1 concern for Spark at this stage. This challenge exists so much more here, exactly because there is so much potential. I'd love to help by trying to identify and close stale JIRAs but am afraid that tagging them is just adding to the heap of work. +1 concern and potential! best, matt - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: CoHadoop Papers
Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS? If the former, Spark does support copartitioning. If the latter, it's an HDFS scope that's outside of Spark. On that note, Hadoop does also make attempts to collocate data, e.g., rack awareness. I'm sure the paper makes useful contributions for its set of use cases. Sent while mobile. Pls excuse typos etc. On Aug 26, 2014 5:21 AM, Gary Malouf malouf.g...@gmail.com wrote: It appears support for this type of control over block placement is going out in the next version of HDFS: https://issues.apache.org/jira/browse/HDFS-2576 On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com wrote: One of my colleagues has been questioning me as to why Spark/HDFS makes no attempts to try to co-locate related data blocks. He pointed to this paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the CoHadoop research and the performance improvements it yielded for Map/Reduce jobs. Would leveraging these ideas for writing data from Spark make sense/be worthwhile?
HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...
is there any dataType auto convert or detect or something in HiveContext ?all columns of a table is defined as string in hive metastoreone column is total_price with values like 123.45, then this column will be recognized as dataType Float in HiveContext...this is a feature or a bug? it really surprised me... how is it implemented? if it is a feature, can i turn it off? i want to get a schemaRDD with exactly the same datatype defined in hive metadata, i know the column total_price should be float values, but they must not be, and what happens if there is some broken line in my huge CSV file? or maybe some total_price is 9,123.45 or $123.45 or something==some example for this in our env.MapR v3 cluster, newest spark github master clone from yesterdaybuilt withsbt/sbt -Dhadoop.version=1.0.3-mapr-3.0.3 -Phive assemblyhive-site.xml configured==spark-shell scripts:val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)hiveContext.sql(use our_live_db)hiveContext.sql(desc formatted et_fullorders).collect.foreach(println)..14/08/26 15:47:09 INFO SparkContext: Job finished: collect at SparkPlan.scala:85, took 0.0305408 s[# col_name data_type comment ][ ][sidstring from deserializer ][request_id string from deserializer ][*times_dq string* from deserializer ][*total_pricestring* from deserializer ][order_id string from deserializer ][ ][# Partition Information ][# col_name data_type comment ][][wt_datestring None][countrystring None ][][# Detailed Table Information][Database: our_live_db][Owner: client02 ][CreateTime:Fri Jan 31 12:23:40 CET 2014 ][LastAccessTime: UNKNOWN ][Protect Mode: None ][Retention: 0][Location: maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders ][Table Type:EXTERNAL_TABLE ][Table Parameters: ][ EXTERNALTRUE][ transient_lastDdlTime 1391167420 ][][# Storage Information ][SerDe Library: com.bizo.hive.serde.csv.CSVSerde ][InputFormat: org.apache.hadoop.mapred.TextInputFormat ][OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat ][Compressed:No ][Num Buckets: -1 ][Bucket Columns:[] ][Sort Columns: [] ][Storage Desc Params: ][ separatorChar ; ][ serialization.format1 ]then, create a schemaRDD from this tableval result = hiveContext.sql(select sid, order_id, total_price, times_dq from et_fullorders where wt_date='2014-04-14' and country='uk' limit 5)ok now, printSchema...scala result.printSchemaroot |-- sid: string (nullable = true) |-- order_id: string (nullable = true) |-- *total_price: float* (nullable = true) |-- *times_dq: timestamp* (nullable = true)total_price was STRING but now in schemaRDD is FLOATandtimes_dq, now is TIMESTAMPreally strange and surprised...and more strange is:scala result.map(row = row.getString(2)).collect.foreach(println)i got240.0045.8321.6795.83120.83butscala result.map(row = row.getFloat(2)).collect.foreach(println)14/08/26 16:01:24 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 8)java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Floatat scala.runtime.BoxesRunTime.unboxToFloat(BoxesRunTime.java:114)==btw, files in this external table are gzipped csv files:14/08/26 15:49:56 INFO HadoopRDD: Input split: maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders/wt_date=2014-04-14/country=uk/getFullOrders_2014-04-14.csv.gz:0+16990and the data in it:scala result.collect.foreach(println)[51402123123,12344000123454,240.00,2014-04-14 00:03:49.082000][51402110123,12344000123455,45.83,2014-04-14 00:04:13.639000][51402129123,12344000123458,21.67,2014-04-14 00:09:12.276000][51402092123,12344000132457,95.83,2014-04-14 00:09:42.228000][51402135123,12344000123460,120.83,2014-04-14 00:12:44.742000]we use CSVSerDe https://drone.io/github.com/ogrodnek/csv-serde/files/target/csv-serde-1.1.2-0.11.0-all.jarmaybe
HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...
is there any dataType auto convert or detect or something in HiveContext ? all columns of a table is defined as string in hive metastore one column is total_price with values like 123.45, then this column will be recognized as dataType Float in HiveContext... this is a feature or a bug? it really surprised me... how is it implemented? if it is a feature, can i turn it off? i want to get a schemaRDD with exactly the same datatype defined in hive metadata, i know the column total_price should be float values, but they must not be, and what happens if there is some broken line in my huge CSV file? or maybe some total_price is 9,123.45 or $123.45 or something == some example for this in our env. MapR v3 cluster, newest spark github master clone from yesterday built with sbt/sbt -Dhadoop.version=1.0.3-mapr-3.0.3 -Phive assembly hive-site.xml configured == spark-shell scripts: val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext.sql(use our_live_db) hiveContext.sql(desc formatted et_fullorders).collect.foreach(println) ... ... 14/08/26 15:47:09 INFO SparkContext: Job finished: collect at SparkPlan.scala:85, took 0.0305408 s [# col_name data_type comment ] [] [sidstring from deserializer ] [request_id string from deserializer ] [*times_dq string* from deserializer ] [*total_pricestring* from deserializer ] [order_id string from deserializer ] [] [# Partition Information ] [# col_name data_type comment ] [] [wt_datestring None] [countrystring None] [] [# Detailed Table Information] [Database: our_live_db] [Owner: client02 ] [CreateTime:Fri Jan 31 12:23:40 CET 2014 ] [LastAccessTime:UNKNOWN ] [Protect Mode: None ] [Retention: 0] [Location: maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders ] [Table Type:EXTERNAL_TABLE ] [Table Parameters: ] [ EXTERNALTRUE] [ transient_lastDdlTime 1391167420 ] [] [# Storage Information ] [SerDe Library: com.bizo.hive.serde.csv.CSVSerde ] [InputFormat: org.apache.hadoop.mapred.TextInputFormat ] [OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat ] [Compressed:No ] [Num Buckets: -1 ] [Bucket Columns:[] ] [Sort Columns: [] ] [Storage Desc Params:] [ separatorChar ; ] [ serialization.format1 ] then, create a schemaRDD from this table val result = hiveContext.sql(select sid, order_id, total_price, times_dq from et_fullorders where wt_date='2014-04-14' and country='uk' limit 5) ok now, printSchema... scala result.printSchema root |-- sid: string (nullable = true) |-- order_id: string (nullable = true) |-- *total_price: float* (nullable = true) |-- *times_dq: timestamp* (nullable = true) total_price was STRING but now in schemaRDD is FLOAT and times_dq, now is TIMESTAMP really strange and surprised... and more strange is: scala result.map(row = row.getString(2)).collect.foreach(println) i got 240.00 45.83 21.67 95.83 120.83 but scala result.map(row = row.getFloat(2)).collect.foreach(println) 14/08/26 16:01:24 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 8) java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Float at scala.runtime.BoxesRunTime.unboxToFloat(BoxesRunTime.java:114) == btw, files in this external table are gzipped csv files: 14/08/26 15:49:56 INFO HadoopRDD: Input split: maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders/wt_date=2014-04-14/country=uk/getFullOrders_2014-04-14.csv.gz:0+16990 and the data in it: scala result.collect.foreach(println) [51402123123,12344000123454,240.00,2014-04-14 00:03:49.082000] [51402110123,12344000123455,45.83,2014-04-14 00:04:13.639000] [51402129123,12344000123458,21.67,2014-04-14 00:09:12.276000] [51402092123,12344000132457,95.83,2014-04-14 00:09:42.228000] [51402135123,12344000123460,120.83,2014-04-14 00:12:44.742000]
Re: Gradient descent and runMiniBatchSGD
Hi Alexander, Can you post a link to the code? RJ On Tue, Aug 26, 2014 at 6:53 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi, I've implemented back propagation algorithm using Gradient class and a simple update using Updater class. Then I run the algorithm with mllib's GradientDescent class. I have troubles in scaling out this implementation. I thought that if I partition my data into the number of workers then performance will increase, because each worker will run a step of gradient descent on its partition of data. But this does not happen and each worker seems to process all data (if miniBatchFraction == 1.0 as in mllib's logisic regression implementation). For me, this doesn't make sense, because then only single Worker will provide the same performance. Could someone elaborate on this and correct me if I am wrong. How can I scale out the algorithm with many Workers? Best regards, Alexander -- em rnowl...@gmail.com c 954.496.2314
Re: Handling stale PRs
On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com wrote: I'd prefer if we took the approach of politely explaining why in the current form the patch isn't acceptable and closing it (potentially w/ tips on how to improve it or narrow the scope). Amen to this. Aiming for such a culture would set Spark apart from other projects in a great way. I've proposed several different solutions to ASF infra to streamline the process, but thus far they haven't been open to any of my ideas: I've added myself as a watcher on those 2 INFRA issues. Sucks that the only solution on offer right now requires basically polluting the commit history. Short of moving Spark's repo to a non-ASF-managed GitHub account, do you think another bot could help us manage the number of stale PRs? I'm thinking a solution as follows might be very helpful: - Extend Spark QA / Jenkins to run on a weekly schedule and check for stale PRs. Let's say a stale PR is an open one that hasn't been updated in N months. - Spark QA maintains a list of known committers on its side. - During its weekly check of stale PRs, Spark QA takes the following action: - If the last person to comment on a PR was a committer, post to the PR asking for an update from the contributor. - If the last person to comment on a PR was a contributor, add the PR to a list. Email this list of *hanging PRs* out to the dev list on a weekly basis and ask committers to update them. - If the last person to comment on a PR was Spark QA asking the contributor to update it, then add the PR to a list. Email this list of *abandoned PRs* to the dev list for the record (or for closing, if that becomes possible in the future). This doesn't solve the problem of not being able to close PRs, but it does help make sure no PR is left hanging for long. What do you think? I'd be interested in implementing this solution if we like it. Nick
Re: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...
oops, i tried on a managed table, column types will not be changed so it is mostly due to the serde lib CSVSerDe (https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java#L123) or maybe CSVReader from opencsv?... but if the columns are defined as string, no matter what type returned from custom SerDe or CSVReader, they should be cast to string at the end right? why do not use the schema from hive metadata directly? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035p8039.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: CoHadoop Papers
Christopher, can you expand on the co-partitioning support? We have a number of spark SQL tables (saved in parquet format) that all could be considered to have a common hash key. Our analytics team wants to do frequent joins across these different data-sets based on this key. It makes sense that if the data for each key across 'tables' was co-located on the same server, shuffles could be minimized and ultimately performance could be much better. From reading the HDFS issue I posted before, the way is being paved for implementing this type of behavior though there are a lot of complications to make it work I believe. On Tue, Aug 26, 2014 at 10:40 AM, Christopher Nguyen c...@adatao.com wrote: Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS? If the former, Spark does support copartitioning. If the latter, it's an HDFS scope that's outside of Spark. On that note, Hadoop does also make attempts to collocate data, e.g., rack awareness. I'm sure the paper makes useful contributions for its set of use cases. Sent while mobile. Pls excuse typos etc. On Aug 26, 2014 5:21 AM, Gary Malouf malouf.g...@gmail.com wrote: It appears support for this type of control over block placement is going out in the next version of HDFS: https://issues.apache.org/jira/browse/HDFS-2576 On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com wrote: One of my colleagues has been questioning me as to why Spark/HDFS makes no attempts to try to co-locate related data blocks. He pointed to this paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the CoHadoop research and the performance improvements it yielded for Map/Reduce jobs. Would leveraging these ideas for writing data from Spark make sense/be worthwhile?
Re: Handling stale PRs
Last weekend, I started hacking on a Google App Engine app for helping with pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png). Some of my basic goals (not all implemented yet): - Users sign in using GitHub and can browse a list of pull requests, including links to associated JIRAs, Jenkins statuses, a quick preview of the last comment, etc. - Pull requests are auto-classified based on which components they modify (by looking at the diff). - From the app’s own internal database of PRs, we can build dashboards to find “abandoned” PRs, graph average time to first review, etc. - Since we authenticate users with GitHub, we can enable administrative functions via this dashboard (e.g. “assign this PR to me”, “vote to close in the weekly auto-close commit”, etc. Right now, I’ve implemented GItHub OAuth support and code to update the issues database using the GitHub API. Because we have access to the full API, it’s pretty easy to do fancy things like parsing the reason for Jenkins failure, etc. You could even imagine some fancy mashup tools to pull up JIRAs and pull requests side-by in iframes. After I hack on this a bit more, I plan to release a public preview version; if we find this tool useful, I’ll clean it up and open-source the app so folks can contribute to it. - Josh On August 26, 2014 at 8:16:46 AM, Nicholas Chammas (nicholas.cham...@gmail.com) wrote: On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com wrote: I'd prefer if we took the approach of politely explaining why in the current form the patch isn't acceptable and closing it (potentially w/ tips on how to improve it or narrow the scope). Amen to this. Aiming for such a culture would set Spark apart from other projects in a great way. I've proposed several different solutions to ASF infra to streamline the process, but thus far they haven't been open to any of my ideas: I've added myself as a watcher on those 2 INFRA issues. Sucks that the only solution on offer right now requires basically polluting the commit history. Short of moving Spark's repo to a non-ASF-managed GitHub account, do you think another bot could help us manage the number of stale PRs? I'm thinking a solution as follows might be very helpful: - Extend Spark QA / Jenkins to run on a weekly schedule and check for stale PRs. Let's say a stale PR is an open one that hasn't been updated in N months. - Spark QA maintains a list of known committers on its side. - During its weekly check of stale PRs, Spark QA takes the following action: - If the last person to comment on a PR was a committer, post to the PR asking for an update from the contributor. - If the last person to comment on a PR was a contributor, add the PR to a list. Email this list of *hanging PRs* out to the dev list on a weekly basis and ask committers to update them. - If the last person to comment on a PR was Spark QA asking the contributor to update it, then add the PR to a list. Email this list of *abandoned PRs* to the dev list for the record (or for closing, if that becomes possible in the future). This doesn't solve the problem of not being able to close PRs, but it does help make sure no PR is left hanging for long. What do you think? I'd be interested in implementing this solution if we like it. Nick
Re: Handling stale PRs
OK, that sounds pretty cool. Josh, Do you see this app as encompassing or supplanting the functionality I described as well? Nick On Tue, Aug 26, 2014 at 2:21 PM, Josh Rosen rosenvi...@gmail.com wrote: Last weekend, I started hacking on a Google App Engine app for helping with pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png). Some of my basic goals (not all implemented yet): - Users sign in using GitHub and can browse a list of pull requests, including links to associated JIRAs, Jenkins statuses, a quick preview of the last comment, etc. - Pull requests are auto-classified based on which components they modify (by looking at the diff). - From the app’s own internal database of PRs, we can build dashboards to find “abandoned” PRs, graph average time to first review, etc. - Since we authenticate users with GitHub, we can enable administrative functions via this dashboard (e.g. “assign this PR to me”, “vote to close in the weekly auto-close commit”, etc. Right now, I’ve implemented GItHub OAuth support and code to update the issues database using the GitHub API. Because we have access to the full API, it’s pretty easy to do fancy things like parsing the reason for Jenkins failure, etc. You could even imagine some fancy mashup tools to pull up JIRAs and pull requests side-by in iframes. After I hack on this a bit more, I plan to release a public preview version; if we find this tool useful, I’ll clean it up and open-source the app so folks can contribute to it. - Josh On August 26, 2014 at 8:16:46 AM, Nicholas Chammas ( nicholas.cham...@gmail.com) wrote: On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com wrote: I'd prefer if we took the approach of politely explaining why in the current form the patch isn't acceptable and closing it (potentially w/ tips on how to improve it or narrow the scope). Amen to this. Aiming for such a culture would set Spark apart from other projects in a great way. I've proposed several different solutions to ASF infra to streamline the process, but thus far they haven't been open to any of my ideas: I've added myself as a watcher on those 2 INFRA issues. Sucks that the only solution on offer right now requires basically polluting the commit history. Short of moving Spark's repo to a non-ASF-managed GitHub account, do you think another bot could help us manage the number of stale PRs? I'm thinking a solution as follows might be very helpful: - Extend Spark QA / Jenkins to run on a weekly schedule and check for stale PRs. Let's say a stale PR is an open one that hasn't been updated in N months. - Spark QA maintains a list of known committers on its side. - During its weekly check of stale PRs, Spark QA takes the following action: - If the last person to comment on a PR was a committer, post to the PR asking for an update from the contributor. - If the last person to comment on a PR was a contributor, add the PR to a list. Email this list of *hanging PRs* out to the dev list on a weekly basis and ask committers to update them. - If the last person to comment on a PR was Spark QA asking the contributor to update it, then add the PR to a list. Email this list of *abandoned PRs* to the dev list for the record (or for closing, if that becomes possible in the future). This doesn't solve the problem of not being able to close PRs, but it does help make sure no PR is left hanging for long. What do you think? I'd be interested in implementing this solution if we like it. Nick
Re: Handling stale PRs
Sure; App Engine supports cron and sending emails. We can configure the app with Spark QA’s credentials in order to allow it to post comments on issues, etc. - Josh On August 26, 2014 at 11:38:08 AM, Nicholas Chammas (nicholas.cham...@gmail.com) wrote: OK, that sounds pretty cool. Josh, Do you see this app as encompassing or supplanting the functionality I described as well? Nick On Tue, Aug 26, 2014 at 2:21 PM, Josh Rosen rosenvi...@gmail.com wrote: Last weekend, I started hacking on a Google App Engine app for helping with pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png). Some of my basic goals (not all implemented yet): - Users sign in using GitHub and can browse a list of pull requests, including links to associated JIRAs, Jenkins statuses, a quick preview of the last comment, etc. - Pull requests are auto-classified based on which components they modify (by looking at the diff). - From the app’s own internal database of PRs, we can build dashboards to find “abandoned” PRs, graph average time to first review, etc. - Since we authenticate users with GitHub, we can enable administrative functions via this dashboard (e.g. “assign this PR to me”, “vote to close in the weekly auto-close commit”, etc. Right now, I’ve implemented GItHub OAuth support and code to update the issues database using the GitHub API. Because we have access to the full API, it’s pretty easy to do fancy things like parsing the reason for Jenkins failure, etc. You could even imagine some fancy mashup tools to pull up JIRAs and pull requests side-by in iframes. After I hack on this a bit more, I plan to release a public preview version; if we find this tool useful, I’ll clean it up and open-source the app so folks can contribute to it. - Josh On August 26, 2014 at 8:16:46 AM, Nicholas Chammas (nicholas.cham...@gmail.com) wrote: On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com wrote: I'd prefer if we took the approach of politely explaining why in the current form the patch isn't acceptable and closing it (potentially w/ tips on how to improve it or narrow the scope). Amen to this. Aiming for such a culture would set Spark apart from other projects in a great way. I've proposed several different solutions to ASF infra to streamline the process, but thus far they haven't been open to any of my ideas: I've added myself as a watcher on those 2 INFRA issues. Sucks that the only solution on offer right now requires basically polluting the commit history. Short of moving Spark's repo to a non-ASF-managed GitHub account, do you think another bot could help us manage the number of stale PRs? I'm thinking a solution as follows might be very helpful: - Extend Spark QA / Jenkins to run on a weekly schedule and check for stale PRs. Let's say a stale PR is an open one that hasn't been updated in N months. - Spark QA maintains a list of known committers on its side. - During its weekly check of stale PRs, Spark QA takes the following action: - If the last person to comment on a PR was a committer, post to the PR asking for an update from the contributor. - If the last person to comment on a PR was a contributor, add the PR to a list. Email this list of *hanging PRs* out to the dev list on a weekly basis and ask committers to update them. - If the last person to comment on a PR was Spark QA asking the contributor to update it, then add the PR to a list. Email this list of *abandoned PRs* to the dev list for the record (or for closing, if that becomes possible in the future). This doesn't solve the problem of not being able to close PRs, but it does help make sure no PR is left hanging for long. What do you think? I'd be interested in implementing this solution if we like it. Nick
Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing
I have both SPARK-2878 and SPARK-2893. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-2878-Kryo-serialisation-with-custom-Kryo-registrator-failing-tp7719p8046.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
spark-ec2 1.0.2 creates EC2 cluster at wrong version
I downloaded the source code release for 1.0.2 from here http://spark.apache.org/downloads.html and launched an EC2 cluster using spark-ec2. After the cluster finishes launching, I fire up the shell and check the version: scala sc.version res1: String = 1.0.1 The startup banner also shows the same thing. Hmm... So I dig around and find that the spark_ec2.py script has the default Spark version set to 1.0.1. Derp. parser.add_option(-v, --spark-version, default=1.0.1, help=Version of Spark to use: 'X.Y.Z' or a specific git hash) Is there any way to fix the release? It’s a minor issue, but could be very confusing. And how can we prevent this from happening again? Nick
Re: spark-ec2 1.0.2 creates EC2 cluster at wrong version
This is a chicken and egg problem in some sense. We can't change the ec2 script till we have made the release and uploaded the binaries -- But once that is done, we can't update the script. I think the model we support so far is that you can launch the latest spark version from the master branch on github. I guess we can try to add something in the release process that updates the script but doesn't commit it ? The release managers might be able to add more. Thanks Shivaram On Tue, Aug 26, 2014 at 1:16 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I downloaded the source code release for 1.0.2 from here http://spark.apache.org/downloads.html and launched an EC2 cluster using spark-ec2. After the cluster finishes launching, I fire up the shell and check the version: scala sc.version res1: String = 1.0.1 The startup banner also shows the same thing. Hmm... So I dig around and find that the spark_ec2.py script has the default Spark version set to 1.0.1. Derp. parser.add_option(-v, --spark-version, default=1.0.1, help=Version of Spark to use: 'X.Y.Z' or a specific git hash) Is there any way to fix the release? It’s a minor issue, but could be very confusing. And how can we prevent this from happening again? Nick
Re: Gradient descent and runMiniBatchSGD
Xiangrui, I posted a note on my JIRA for MiniBatch KMeans about the same problem -- sampling running in O(n). Can you elaborate on ways to get more efficient sampling? I think this will be important for a variety of stochastic algorithms. RJ On Tue, Aug 26, 2014 at 12:54 PM, Xiangrui Meng men...@gmail.com wrote: miniBatchFraction uses RDD.sample to get the mini-batch, and sample still needs to visit the elements one after another. So it is not efficient if the task is not computation heavy and this is why setMiniBatchFraction is marked as experimental. If we can detect that the partition iterator is backed by an ArrayBuffer, maybe we can do a skip iterator to skip elements. -Xiangrui On Tue, Aug 26, 2014 at 8:15 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi, RJ https://github.com/avulanov/spark/blob/neuralnetwork/mllib/src/main/scala/org/apache/spark/mllib/classification/NeuralNetwork.scala Unit tests are in the same branch. Alexander From: RJ Nowling [mailto:rnowl...@gmail.com] Sent: Tuesday, August 26, 2014 6:59 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Gradient descent and runMiniBatchSGD Hi Alexander, Can you post a link to the code? RJ On Tue, Aug 26, 2014 at 6:53 AM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi, I've implemented back propagation algorithm using Gradient class and a simple update using Updater class. Then I run the algorithm with mllib's GradientDescent class. I have troubles in scaling out this implementation. I thought that if I partition my data into the number of workers then performance will increase, because each worker will run a step of gradient descent on its partition of data. But this does not happen and each worker seems to process all data (if miniBatchFraction == 1.0 as in mllib's logisic regression implementation). For me, this doesn't make sense, because then only single Worker will provide the same performance. Could someone elaborate on this and correct me if I am wrong. How can I scale out the algorithm with many Workers? Best regards, Alexander -- em rnowl...@gmail.commailto:rnowl...@gmail.com c 954.496.2314 -- em rnowl...@gmail.com c 954.496.2314
Re: Gradient descent and runMiniBatchSGD
Also, another idea: may algorithms that use sampling tend to do so multiple times. It may be beneficial to allow a transformation to a representation that is more efficient for multiple rounds of sampling. On Tue, Aug 26, 2014 at 4:36 PM, RJ Nowling rnowl...@gmail.com wrote: Xiangrui, I posted a note on my JIRA for MiniBatch KMeans about the same problem -- sampling running in O(n). Can you elaborate on ways to get more efficient sampling? I think this will be important for a variety of stochastic algorithms. RJ On Tue, Aug 26, 2014 at 12:54 PM, Xiangrui Meng men...@gmail.com wrote: miniBatchFraction uses RDD.sample to get the mini-batch, and sample still needs to visit the elements one after another. So it is not efficient if the task is not computation heavy and this is why setMiniBatchFraction is marked as experimental. If we can detect that the partition iterator is backed by an ArrayBuffer, maybe we can do a skip iterator to skip elements. -Xiangrui On Tue, Aug 26, 2014 at 8:15 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi, RJ https://github.com/avulanov/spark/blob/neuralnetwork/mllib/src/main/scala/org/apache/spark/mllib/classification/NeuralNetwork.scala Unit tests are in the same branch. Alexander From: RJ Nowling [mailto:rnowl...@gmail.com] Sent: Tuesday, August 26, 2014 6:59 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Gradient descent and runMiniBatchSGD Hi Alexander, Can you post a link to the code? RJ On Tue, Aug 26, 2014 at 6:53 AM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi, I've implemented back propagation algorithm using Gradient class and a simple update using Updater class. Then I run the algorithm with mllib's GradientDescent class. I have troubles in scaling out this implementation. I thought that if I partition my data into the number of workers then performance will increase, because each worker will run a step of gradient descent on its partition of data. But this does not happen and each worker seems to process all data (if miniBatchFraction == 1.0 as in mllib's logisic regression implementation). For me, this doesn't make sense, because then only single Worker will provide the same performance. Could someone elaborate on this and correct me if I am wrong. How can I scale out the algorithm with many Workers? Best regards, Alexander -- em rnowl...@gmail.commailto:rnowl...@gmail.com c 954.496.2314 -- em rnowl...@gmail.com c 954.496.2314 -- em rnowl...@gmail.com c 954.496.2314
Re: Gradient descent and runMiniBatchSGD
Hi Xiangrui, Thanks for explanation, but I'm still missing something. In my experiments, if miniBatchFraction == 1.0, no matter how the data is partitioned (2, 4, 8, 16 partitions), the algorithm executes more or less in the same time. (I have 16 Workers). Reduce from runMiniBatchSGD takes most of the time for 2 partitions, mapPartitionWithIndex -- for 16. What I would expect is that the time reduces proportional to the number of data partitions because each partition will be processed on separate Worker hopefully. Why the time does not reduce? Btw processing of one instance in my algorithm is a heavy computation, this is exact reason why I want to parallelize it. Best regards, Alexander 26.08.2014, в 20:54, Xiangrui Meng men...@gmail.commailto:men...@gmail.com написал(а): miniBatchFraction uses RDD.sample to get the mini-batch, and sample still needs to visit the elements one after another. So it is not efficient if the task is not computation heavy and this is why setMiniBatchFraction is marked as experimental. If we can detect that the partition iterator is backed by an ArrayBuffer, maybe we can do a skip iterator to skip elements. -Xiangrui On Tue, Aug 26, 2014 at 8:15 AM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi, RJ https://github.com/avulanov/spark/blob/neuralnetwork/mllib/src/main/scala/org/apache/spark/mllib/classification/NeuralNetwork.scala Unit tests are in the same branch. Alexander From: RJ Nowling [mailto:rnowl...@gmail.com] Sent: Tuesday, August 26, 2014 6:59 PM To: Ulanov, Alexander Cc: dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Gradient descent and runMiniBatchSGD Hi Alexander, Can you post a link to the code? RJ On Tue, Aug 26, 2014 at 6:53 AM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi, I've implemented back propagation algorithm using Gradient class and a simple update using Updater class. Then I run the algorithm with mllib's GradientDescent class. I have troubles in scaling out this implementation. I thought that if I partition my data into the number of workers then performance will increase, because each worker will run a step of gradient descent on its partition of data. But this does not happen and each worker seems to process all data (if miniBatchFraction == 1.0 as in mllib's logisic regression implementation). For me, this doesn't make sense, because then only single Worker will provide the same performance. Could someone elaborate on this and correct me if I am wrong. How can I scale out the algorithm with many Workers? Best regards, Alexander -- em rnowl...@gmail.commailto:rnowl...@gmail.commailto:rnowl...@gmail.com c 954.496.2314 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Handling stale PRs
By the way, as a reference point, I just stumbled across the Discourse GitHub project and their list of pull requests https://github.com/discourse/discourse/pulls looks pretty neat. ~2,200 closed PRs, 6 open. Least recently updated PR dates to 8 days ago. Project started ~1.5 years ago. Dunno how many committers Discourse has, but it looks like they've managed their PRs well. I hope we can do as well in this regard as they have. Nick On Tue, Aug 26, 2014 at 2:40 PM, Josh Rosen rosenvi...@gmail.com wrote: Sure; App Engine supports cron and sending emails. We can configure the app with Spark QA’s credentials in order to allow it to post comments on issues, etc. - Josh On August 26, 2014 at 11:38:08 AM, Nicholas Chammas ( nicholas.cham...@gmail.com) wrote: OK, that sounds pretty cool. Josh, Do you see this app as encompassing or supplanting the functionality I described as well? Nick On Tue, Aug 26, 2014 at 2:21 PM, Josh Rosen rosenvi...@gmail.com wrote: Last weekend, I started hacking on a Google App Engine app for helping with pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png). Some of my basic goals (not all implemented yet): - Users sign in using GitHub and can browse a list of pull requests, including links to associated JIRAs, Jenkins statuses, a quick preview of the last comment, etc. - Pull requests are auto-classified based on which components they modify (by looking at the diff). - From the app’s own internal database of PRs, we can build dashboards to find “abandoned” PRs, graph average time to first review, etc. - Since we authenticate users with GitHub, we can enable administrative functions via this dashboard (e.g. “assign this PR to me”, “vote to close in the weekly auto-close commit”, etc. Right now, I’ve implemented GItHub OAuth support and code to update the issues database using the GitHub API. Because we have access to the full API, it’s pretty easy to do fancy things like parsing the reason for Jenkins failure, etc. You could even imagine some fancy mashup tools to pull up JIRAs and pull requests side-by in iframes. After I hack on this a bit more, I plan to release a public preview version; if we find this tool useful, I’ll clean it up and open-source the app so folks can contribute to it. - Josh On August 26, 2014 at 8:16:46 AM, Nicholas Chammas ( nicholas.cham...@gmail.com) wrote: On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com wrote: I'd prefer if we took the approach of politely explaining why in the current form the patch isn't acceptable and closing it (potentially w/ tips on how to improve it or narrow the scope). Amen to this. Aiming for such a culture would set Spark apart from other projects in a great way. I've proposed several different solutions to ASF infra to streamline the process, but thus far they haven't been open to any of my ideas: I've added myself as a watcher on those 2 INFRA issues. Sucks that the only solution on offer right now requires basically polluting the commit history. Short of moving Spark's repo to a non-ASF-managed GitHub account, do you think another bot could help us manage the number of stale PRs? I'm thinking a solution as follows might be very helpful: - Extend Spark QA / Jenkins to run on a weekly schedule and check for stale PRs. Let's say a stale PR is an open one that hasn't been updated in N months. - Spark QA maintains a list of known committers on its side. - During its weekly check of stale PRs, Spark QA takes the following action: - If the last person to comment on a PR was a committer, post to the PR asking for an update from the contributor. - If the last person to comment on a PR was a contributor, add the PR to a list. Email this list of *hanging PRs* out to the dev list on a weekly basis and ask committers to update them. - If the last person to comment on a PR was Spark QA asking the contributor to update it, then add the PR to a list. Email this list of *abandoned PRs* to the dev list for the record (or for closing, if that becomes possible in the future). This doesn't solve the problem of not being able to close PRs, but it does help make sure no PR is left hanging for long. What do you think? I'd be interested in implementing this solution if we like it. Nick
Re: CoHadoop Papers
It seems like there are two things here: - Co-locating blocks with the same keys to avoid network transfer. - Leveraging partitioning information to avoid a shuffle when data is already partitioned correctly (even if those partitions aren't yet on the same machine). The former seems more complicated and probably requires the support from Hadoop you linked to. However, the latter might be easier as there is already a framework for reasoning about partitioning and the need to shuffle in the Spark SQL planner. On Tue, Aug 26, 2014 at 8:37 AM, Gary Malouf malouf.g...@gmail.com wrote: Christopher, can you expand on the co-partitioning support? We have a number of spark SQL tables (saved in parquet format) that all could be considered to have a common hash key. Our analytics team wants to do frequent joins across these different data-sets based on this key. It makes sense that if the data for each key across 'tables' was co-located on the same server, shuffles could be minimized and ultimately performance could be much better. From reading the HDFS issue I posted before, the way is being paved for implementing this type of behavior though there are a lot of complications to make it work I believe. On Tue, Aug 26, 2014 at 10:40 AM, Christopher Nguyen c...@adatao.com wrote: Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS? If the former, Spark does support copartitioning. If the latter, it's an HDFS scope that's outside of Spark. On that note, Hadoop does also make attempts to collocate data, e.g., rack awareness. I'm sure the paper makes useful contributions for its set of use cases. Sent while mobile. Pls excuse typos etc. On Aug 26, 2014 5:21 AM, Gary Malouf malouf.g...@gmail.com wrote: It appears support for this type of control over block placement is going out in the next version of HDFS: https://issues.apache.org/jira/browse/HDFS-2576 On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com wrote: One of my colleagues has been questioning me as to why Spark/HDFS makes no attempts to try to co-locate related data blocks. He pointed to this paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the CoHadoop research and the performance improvements it yielded for Map/Reduce jobs. Would leveraging these ideas for writing data from Spark make sense/be worthwhile?
Re: CoHadoop Papers
Hi Michael, I think once that work is into HDFS, it will be great to expose this functionality via Spark. This is something worth pursuing because it could grant orders of magnitude perf improvements in cases when people need to join data. The second item would be very interesting, could yield significant performance boosts. Best, Gary On Tue, Aug 26, 2014 at 6:50 PM, Michael Armbrust mich...@databricks.com wrote: It seems like there are two things here: - Co-locating blocks with the same keys to avoid network transfer. - Leveraging partitioning information to avoid a shuffle when data is already partitioned correctly (even if those partitions aren't yet on the same machine). The former seems more complicated and probably requires the support from Hadoop you linked to. However, the latter might be easier as there is already a framework for reasoning about partitioning and the need to shuffle in the Spark SQL planner. On Tue, Aug 26, 2014 at 8:37 AM, Gary Malouf malouf.g...@gmail.com wrote: Christopher, can you expand on the co-partitioning support? We have a number of spark SQL tables (saved in parquet format) that all could be considered to have a common hash key. Our analytics team wants to do frequent joins across these different data-sets based on this key. It makes sense that if the data for each key across 'tables' was co-located on the same server, shuffles could be minimized and ultimately performance could be much better. From reading the HDFS issue I posted before, the way is being paved for implementing this type of behavior though there are a lot of complications to make it work I believe. On Tue, Aug 26, 2014 at 10:40 AM, Christopher Nguyen c...@adatao.com wrote: Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS? If the former, Spark does support copartitioning. If the latter, it's an HDFS scope that's outside of Spark. On that note, Hadoop does also make attempts to collocate data, e.g., rack awareness. I'm sure the paper makes useful contributions for its set of use cases. Sent while mobile. Pls excuse typos etc. On Aug 26, 2014 5:21 AM, Gary Malouf malouf.g...@gmail.com wrote: It appears support for this type of control over block placement is going out in the next version of HDFS: https://issues.apache.org/jira/browse/HDFS-2576 On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com wrote: One of my colleagues has been questioning me as to why Spark/HDFS makes no attempts to try to co-locate related data blocks. He pointed to this paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the CoHadoop research and the performance improvements it yielded for Map/Reduce jobs. Would leveraging these ideas for writing data from Spark make sense/be worthwhile?
OutOfMemoryError when running sbt/sbt test
Hi spark. I've been trying to build spark, but I've been getting lots of oome exceptions. https://gist.github.com/jayunit100/d424b6b825ce8517d68c For the most part, they are of the form: java.lang.OutOfMemoryError: unable to create new native thread I've attempted to hard code the get_mem_opts function, which is in the sbt-launch-lib.bash file, to have various very high parameter sizes (i.e. -Xms5g) with high MaxPermSize, etc... and to no avail. Any thoughts on this would be appreciated. I know of others having the same problem as well. Thanks! -- jay vyas
Re: OutOfMemoryError when running sbt/sbt test
What is your ulimit value? On Tue, Aug 26, 2014 at 5:49 PM, jay vyas jayunit100.apa...@gmail.com wrote: Hi spark. I've been trying to build spark, but I've been getting lots of oome exceptions. https://gist.github.com/jayunit100/d424b6b825ce8517d68c For the most part, they are of the form: java.lang.OutOfMemoryError: unable to create new native thread I've attempted to hard code the get_mem_opts function, which is in the sbt-launch-lib.bash file, to have various very high parameter sizes (i.e. -Xms5g) with high MaxPermSize, etc... and to no avail. Any thoughts on this would be appreciated. I know of others having the same problem as well. Thanks! -- jay vyas
Re: spark-ec2 1.0.2 creates EC2 cluster at wrong version
This shouldn't be a chicken-and-egg problem, since the script fetches the AMI from a known URL. Seems like an issue in publishing this release. On August 26, 2014 at 1:24:45 PM, Shivaram Venkataraman (shiva...@eecs.berkeley.edu) wrote: This is a chicken and egg problem in some sense. We can't change the ec2 script till we have made the release and uploaded the binaries -- But once that is done, we can't update the script. I think the model we support so far is that you can launch the latest spark version from the master branch on github. I guess we can try to add something in the release process that updates the script but doesn't commit it ? The release managers might be able to add more. Thanks Shivaram On Tue, Aug 26, 2014 at 1:16 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I downloaded the source code release for 1.0.2 from here http://spark.apache.org/downloads.html and launched an EC2 cluster using spark-ec2. After the cluster finishes launching, I fire up the shell and check the version: scala sc.version res1: String = 1.0.1 The startup banner also shows the same thing. Hmm... So I dig around and find that the spark_ec2.py script has the default Spark version set to 1.0.1. Derp. parser.add_option(-v, --spark-version, default=1.0.1, help=Version of Spark to use: 'X.Y.Z' or a specific git hash) Is there any way to fix the release? It’s a minor issue, but could be very confusing. And how can we prevent this from happening again? Nick
Re: OutOfMemoryError when running sbt/sbt test
Thanks...! Some questions below. 1) you are suggesting that maybe this OOME is a symptom/red herring , and the true cause of it is that a thread can't span because of ulimit... If so possibly this could be flagged early on in the build. And -- where are so many threads coming from that I need to up my limit? Is this a new feature added to spark recently, and if so will it effect deployments scenarios as well? And 2) possibly SBT_OPTS is where the memory settings should be ? If so, then why do we have the get_mem_opts wrapper function coded to send memory manually as Xmx/Xms options? execRunner $java_cmd \ ${SBT_OPTS:-$default_sbt_opts} \ $(get_mem_opts $sbt_mem) \ ${java_opts} \ ${java_args[@]} \ -jar $sbt_jar \ ${sbt_commands[@]} \ ${residual_args[@]} On Aug 26, 2014, at 8:58 PM, Mubarak Seyed spark.devu...@gmail.com wrote: What is your ulimit value? On Tue, Aug 26, 2014 at 5:49 PM, jay vyas jayunit100.apa...@gmail.com wrote: Hi spark. I've been trying to build spark, but I've been getting lots of oome exceptions. https://gist.github.com/jayunit100/d424b6b825ce8517d68c For the most part, they are of the form: java.lang.OutOfMemoryError: unable to create new native thread I've attempted to hard code the get_mem_opts function, which is in the sbt-launch-lib.bash file, to have various very high parameter sizes (i.e. -Xms5g) with high MaxPermSize, etc... and to no avail. Any thoughts on this would be appreciated. I know of others having the same problem as well. Thanks! -- jay vyas
Re: OutOfMemoryError when running sbt/sbt test
Hi Jay, The recommended way to build spark from source is through the maven system. You would want to follow the steps in https://spark.apache.org/docs/latest/building-with-maven.html to set the MAVEN_OPTS to prevent OOM build errors. Thanks On Tue, Aug 26, 2014 at 5:49 PM, jay vyas jayunit100.apa...@gmail.com wrote: Hi spark. I've been trying to build spark, but I've been getting lots of oome exceptions. https://gist.github.com/jayunit100/d424b6b825ce8517d68c For the most part, they are of the form: java.lang.OutOfMemoryError: unable to create new native thread I've attempted to hard code the get_mem_opts function, which is in the sbt-launch-lib.bash file, to have various very high parameter sizes (i.e. -Xms5g) with high MaxPermSize, etc... and to no avail. Any thoughts on this would be appreciated. I know of others having the same problem as well. Thanks! -- jay vyas
Re: Handling stale PRs
Nicholas Chammas wrote Dunno how many committers Discourse has, but it looks like they've managed their PRs well. I hope we can do as well in this regard as they have. Discourse developers appear to eat their own dog food https://meta.discourse.org . Improved collaboration and a shared vision might be a reason for their success. - -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Handling-stale-PRs-tp8015p8061.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: spark-ec2 1.0.2 creates EC2 cluster at wrong version
Yes, this was an oversight on my part. I have opened a JIRA for this. https://issues.apache.org/jira/browse/SPARK-3242 For the time being the workaround should be providing the version 1.0.2 explicitly as part of the script. TD On Tue, Aug 26, 2014 at 6:39 PM, Matei Zaharia matei.zaha...@gmail.com wrote: This shouldn't be a chicken-and-egg problem, since the script fetches the AMI from a known URL. Seems like an issue in publishing this release. On August 26, 2014 at 1:24:45 PM, Shivaram Venkataraman ( shiva...@eecs.berkeley.edu) wrote: This is a chicken and egg problem in some sense. We can't change the ec2 script till we have made the release and uploaded the binaries -- But once that is done, we can't update the script. I think the model we support so far is that you can launch the latest spark version from the master branch on github. I guess we can try to add something in the release process that updates the script but doesn't commit it ? The release managers might be able to add more. Thanks Shivaram On Tue, Aug 26, 2014 at 1:16 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I downloaded the source code release for 1.0.2 from here http://spark.apache.org/downloads.html and launched an EC2 cluster using spark-ec2. After the cluster finishes launching, I fire up the shell and check the version: scala sc.version res1: String = 1.0.1 The startup banner also shows the same thing. Hmm... So I dig around and find that the spark_ec2.py script has the default Spark version set to 1.0.1. Derp. parser.add_option(-v, --spark-version, default=1.0.1, help=Version of Spark to use: 'X.Y.Z' or a specific git hash) Is there any way to fix the release? It’s a minor issue, but could be very confusing. And how can we prevent this from happening again? Nick