I want to contribute MLlib two quality measures(ARHR and HR) for top N recommendation system. Is this meaningful?
Hi: In paper Item-Based Top-N Recommendation Algorithms(https://stuyresearch.googlecode.com/hg/blake/resources/10.1.1.102.4451.pdf), there are two parameters measuring the quality of recommendation: HR and ARHR. If I use ALS(Implicit) for top-N recommendation system, I want to check it's quality. ARHR and HR are two good quality measures. I want to contribute them to spark MLlib. So I want to know whether this is meaningful? (1) If n is the total number of customers/users, the hit-rate of the recommendation algorithm was computed as hit-rate (HR) = Number of hits / n (2)If h is the number of hits that occurred at positions p1, p2, . . . , ph within the top-N lists (i.e., 1 ≤ pi ≤ N), then the average reciprocal hit-rank is equal to: [cid:image001.png@01CFC086.8EE1FF40]i .
Re: Mesos/Spark Deadlock
We have not tried the work-around because there are other bugs in there that affected our set-up, though it seems it would help. On Mon, Aug 25, 2014 at 12:54 AM, Timothy Chen tnac...@gmail.com wrote: +1 to have the work around in. I'll be investigating from the Mesos side too. Tim On Sun, Aug 24, 2014 at 9:52 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yeah, Mesos in coarse-grained mode probably wouldn't work here. It's too bad that this happens in fine-grained mode -- would be really good to fix. I'll see if we can get the workaround in https://github.com/apache/spark/pull/1860 into Spark 1.1. Incidentally have you tried that? Matei On August 23, 2014 at 4:30:27 PM, Gary Malouf (malouf.g...@gmail.com) wrote: Hi Matei, We have an analytics team that uses the cluster on a daily basis. They use two types of 'run modes': 1) For running actual queries, they set the spark.executor.memory to something between 4 and 8GB of RAM/worker. 2) A shell that takes a minimal amount of memory on workers (128MB) for prototyping out a larger query. This allows them to not take up RAM on the cluster when they do not really need it. We see the deadlocks when there are a few shells in either case. From the usage patterns we have, coarse-grained mode would be a challenge as we have to constantly remind people to kill their shells as soon as their queries finish. Am I correct in viewing Mesos in coarse-grained mode as being similar to Spark Standalone's cpu allocation behavior? On Sat, Aug 23, 2014 at 7:16 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Gary, just as a workaround, note that you can use Mesos in coarse-grained mode by setting spark.mesos.coarse=true. Then it will hold onto CPUs for the duration of the job. Matei On August 23, 2014 at 7:57:30 AM, Gary Malouf (malouf.g...@gmail.com) wrote: I just wanted to bring up a significant Mesos/Spark issue that makes the combo difficult to use for teams larger than 4-5 people. It's covered in https://issues.apache.org/jira/browse/MESOS-1688. My understanding is that Spark's use of executors in fine-grained mode is a very different behavior than many of the other common frameworks for Mesos.
Re: Pull requests will be automatically linked to JIRA when submitted
FYI: Looks like the Mesos folk also have a bot to do automatic linking, but it appears to have been provided to them somehow by ASF. See this comment as an example: https://issues.apache.org/jira/browse/MESOS-1688?focusedCommentId=14109078page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14109078 Might be a small win to push this work to a bot ASF manages if we can get access to it (and if we have no concerns about depending on an another external service). Nick On Mon, Aug 11, 2014 at 4:10 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for looking into this. I think little tools like this are super helpful. Would it hurt to open a request with INFRA to install/configure the JIRA-GitHub plugin while we continue to use the Python script we have? I wouldn't mind opening that JIRA issue with them. Nick On Mon, Aug 11, 2014 at 12:52 PM, Patrick Wendell pwend...@gmail.com wrote: I spent some time on this and I'm not sure either of these is an option, unfortunately. We typically can't use custom JIRA plug-in's because this JIRA is controlled by the ASF and we don't have rights to modify most things about how it works (it's a large shared JIRA instance used by more than 50 projects). It's worth looking into whether they can do something. In general we've tended to avoid going through ASF infra them whenever possible, since they are generally overloaded and things move very slowly, even if there are outages. Here is the script we use to do the sync: https://github.com/apache/spark/blob/master/dev/github_jira_sync.py It might be possible to modify this to support post-hoc changes, but we'd need to think about how to do so while minimizing function calls to the ASF JIRA API, which I found are very slow. - Patrick On Mon, Aug 11, 2014 at 7:51 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: It looks like this script doesn't catch PRs that are opened and *then* have the JIRA issue ID added to the name. Would it be easy to somehow have the script trigger on PR name changes as well as PR creates? Alternately, is there a reason we can't or don't want to use the plugin mentioned below? (I'm assuming it covers cases like this, but I'm not sure.) Nick On Wed, Jul 23, 2014 at 12:52 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: By the way, it looks like there’s a JIRA plugin that integrates it with GitHub: - https://marketplace.atlassian.com/plugins/com.atlassian.jira.plugins.jira-bitbucket-connector-plugin - https://confluence.atlassian.com/display/BITBUCKET/Linking+Bitbucket+and+GitHub+accounts+to+JIRA It does the automatic linking and shows some additional information https://marketplace-cdn.atlassian.com/files/images/com.atlassian.jira.plugins.jira-bitbucket-connector-plugin/86ff1a21-44fb-4227-aa4f-44c77aec2c97.png that might be nice to have for heavy JIRA users. Nick On Sun, Jul 20, 2014 at 12:50 PM, Patrick Wendell pwend...@gmail.com wrote: Yeah it needs to have SPARK-XXX in the title (this is the format we request already). It just works with small synchronization script I wrote that we run every five minutes on Jeknins that uses the Github and Jenkins API: https://github.com/apache/spark/commit/49e472744951d875627d78b0d6e93cd139232929 - Patrick On Sun, Jul 20, 2014 at 8:06 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: That's pretty neat. How does it work? Do we just need to put the issue ID (e.g. SPARK-1234) anywhere in the pull request? Nick On Sat, Jul 19, 2014 at 11:10 PM, Patrick Wendell pwend...@gmail.com wrote: Just a small note, today I committed a tool that will automatically mirror pull requests to JIRA issues, so contributors will no longer have to manually post a pull request on the JIRA when they make one. It will create a link on the JIRA and also make a comment to trigger an e-mail to people watching. This should make some things easier, such as avoiding accidental duplicate effort on the same JIRA. - Patrick
Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing
I am running the code with @rxin's patch in standalone mode. In my case I am registering org.apache.spark.graphx.GraphKryoRegistrator . Recently I started to see com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to uncompress the chunk: PARSING_ERROR . Has anyone seen this? Could it be related to this issue? Here it trace: -- vids (org.apache.spark.graphx.impl.VertexAttributeBlock) com.esotericsoftware.kryo.io.Input.fill(Input.java:142) com.esotericsoftware.kryo.io.Input.require(Input.java:169) com.esotericsoftware.kryo.io.Input.readLong_slow(Input.java:710) com.esotericsoftware.kryo.io.Input.readLong(Input.java:665) com.esotericsoftware.kryo.serializers.DefaultArraySerializers$LongArraySerializer.read(DefaultArraySerializers.java:127) com.esotericsoftware.kryo.serializers.DefaultArraySerializers$LongArraySerializer.read(DefaultArraySerializers.java:107) com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) org.apache.spark.storage.BlockManager$LazyProxyIterator$1.hasNext(BlockManager.scala:1054) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.graphx.impl.VertexPartitionBaseOps.innerJoinKeepLeft(VertexPartitionBaseOps.scala:192) org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:78) org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75) org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.graphx.EdgeRDD$$anonfun$mapEdgePartitions$1.apply(EdgeRDD.scala:87) org.apache.spark.graphx.EdgeRDD$$anonfun$mapEdgePartitions$1.apply(EdgeRDD.scala:85) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:202) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) -- -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-2878-Kryo-serialisation-with-custom-Kryo-registrator-failing-tp7719p7989.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: saveAsTextFile to s3 on spark does not work, just hangs
Hi jerryye, Maybe if you voted up my question on Stack Overflow it would get some traction and we would get nearer to a solution. Thanks, Amnon -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/saveAsTextFile-to-s3-on-spark-does-not-work-just-hangs-tp7795p7991.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Mesos/Spark Deadlock
This is kind of weird then, seems perhaps unrelated to this issue (or at least to the way I understood it). Is the problem maybe that Mesos saw 0 MB being freed and didn't re-offer the machine *even though there was more than 32 MB free overall*? Matei On August 25, 2014 at 12:59:59 PM, Cody Koeninger (c...@koeninger.org) wrote: I definitely saw a case where a. the only job running was a 256m shell b. I started a 2g job c. a little while later the same user as in a started another 256m shell My job immediately stopped making progress. Once user a killed his shells, it started again. This is on nodes with ~15G of memory, on which we have successfully run 8G jobs. On Mon, Aug 25, 2014 at 2:02 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW it seems to me that even without that patch, you should be getting tasks launched as long as you leave at least 32 MB of memory free on each machine (that is, the sum of the executor memory sizes is not exactly the same as the total size of the machine). Then Mesos will be able to re-offer that machine whenever CPUs free up. Matei On August 25, 2014 at 5:05:56 AM, Gary Malouf (malouf.g...@gmail.com) wrote: We have not tried the work-around because there are other bugs in there that affected our set-up, though it seems it would help. On Mon, Aug 25, 2014 at 12:54 AM, Timothy Chen tnac...@gmail.com wrote: +1 to have the work around in. I'll be investigating from the Mesos side too. Tim On Sun, Aug 24, 2014 at 9:52 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yeah, Mesos in coarse-grained mode probably wouldn't work here. It's too bad that this happens in fine-grained mode -- would be really good to fix. I'll see if we can get the workaround in https://github.com/apache/spark/pull/1860 into Spark 1.1. Incidentally have you tried that? Matei On August 23, 2014 at 4:30:27 PM, Gary Malouf (malouf.g...@gmail.com) wrote: Hi Matei, We have an analytics team that uses the cluster on a daily basis. They use two types of 'run modes': 1) For running actual queries, they set the spark.executor.memory to something between 4 and 8GB of RAM/worker. 2) A shell that takes a minimal amount of memory on workers (128MB) for prototyping out a larger query. This allows them to not take up RAM on the cluster when they do not really need it. We see the deadlocks when there are a few shells in either case. From the usage patterns we have, coarse-grained mode would be a challenge as we have to constantly remind people to kill their shells as soon as their queries finish. Am I correct in viewing Mesos in coarse-grained mode as being similar to Spark Standalone's cpu allocation behavior? On Sat, Aug 23, 2014 at 7:16 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Gary, just as a workaround, note that you can use Mesos in coarse-grained mode by setting spark.mesos.coarse=true. Then it will hold onto CPUs for the duration of the job. Matei On August 23, 2014 at 7:57:30 AM, Gary Malouf (malouf.g...@gmail.com) wrote: I just wanted to bring up a significant Mesos/Spark issue that makes the combo difficult to use for teams larger than 4-5 people. It's covered in https://issues.apache.org/jira/browse/MESOS-1688. My understanding is that Spark's use of executors in fine-grained mode is a very different behavior than many of the other common frameworks for Mesos.
Re: Mesos/Spark Deadlock
Anyway it would be good if someone from the Mesos side investigates this and proposes a solution. The 32 MB per task hack isn't completely foolproof either (e.g. people might allocate all the RAM to their executor and thus stop being able to launch tasks), so maybe we wait on a Mesos fix for this one. Matei On August 25, 2014 at 1:07:15 PM, Matei Zaharia (matei.zaha...@gmail.com) wrote: This is kind of weird then, seems perhaps unrelated to this issue (or at least to the way I understood it). Is the problem maybe that Mesos saw 0 MB being freed and didn't re-offer the machine *even though there was more than 32 MB free overall*? Matei On August 25, 2014 at 12:59:59 PM, Cody Koeninger (c...@koeninger.org) wrote: I definitely saw a case where a. the only job running was a 256m shell b. I started a 2g job c. a little while later the same user as in a started another 256m shell My job immediately stopped making progress. Once user a killed his shells, it started again. This is on nodes with ~15G of memory, on which we have successfully run 8G jobs. On Mon, Aug 25, 2014 at 2:02 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW it seems to me that even without that patch, you should be getting tasks launched as long as you leave at least 32 MB of memory free on each machine (that is, the sum of the executor memory sizes is not exactly the same as the total size of the machine). Then Mesos will be able to re-offer that machine whenever CPUs free up. Matei On August 25, 2014 at 5:05:56 AM, Gary Malouf (malouf.g...@gmail.com) wrote: We have not tried the work-around because there are other bugs in there that affected our set-up, though it seems it would help. On Mon, Aug 25, 2014 at 12:54 AM, Timothy Chen tnac...@gmail.com wrote: +1 to have the work around in. I'll be investigating from the Mesos side too. Tim On Sun, Aug 24, 2014 at 9:52 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yeah, Mesos in coarse-grained mode probably wouldn't work here. It's too bad that this happens in fine-grained mode -- would be really good to fix. I'll see if we can get the workaround in https://github.com/apache/spark/pull/1860 into Spark 1.1. Incidentally have you tried that? Matei On August 23, 2014 at 4:30:27 PM, Gary Malouf (malouf.g...@gmail.com) wrote: Hi Matei, We have an analytics team that uses the cluster on a daily basis. They use two types of 'run modes': 1) For running actual queries, they set the spark.executor.memory to something between 4 and 8GB of RAM/worker. 2) A shell that takes a minimal amount of memory on workers (128MB) for prototyping out a larger query. This allows them to not take up RAM on the cluster when they do not really need it. We see the deadlocks when there are a few shells in either case. From the usage patterns we have, coarse-grained mode would be a challenge as we have to constantly remind people to kill their shells as soon as their queries finish. Am I correct in viewing Mesos in coarse-grained mode as being similar to Spark Standalone's cpu allocation behavior? On Sat, Aug 23, 2014 at 7:16 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Gary, just as a workaround, note that you can use Mesos in coarse-grained mode by setting spark.mesos.coarse=true. Then it will hold onto CPUs for the duration of the job. Matei On August 23, 2014 at 7:57:30 AM, Gary Malouf (malouf.g...@gmail.com) wrote: I just wanted to bring up a significant Mesos/Spark issue that makes the combo difficult to use for teams larger than 4-5 people. It's covered in https://issues.apache.org/jira/browse/MESOS-1688. My understanding is that Spark's use of executors in fine-grained mode is a very different behavior than many of the other common frameworks for Mesos.
Re: Working Formula for Hive 0.13?
Thanks for working on this! Its unclear at the moment exactly how we are going to handle this, since the end goal is to be compatible with as many versions of Hive as possible. That said, I think it would be great to open a PR in this case. Even if we don't merge it, thats a good way to get it on people's radar and have a discussion about the changes that are required. On Sun, Aug 24, 2014 at 7:11 PM, scwf wangf...@huawei.com wrote: I have worked for a branch update the hive version to hive-0.13(by org.apache.hive)---https://github.com/scwf/spark/tree/hive-0.13 I am wondering whether it's ok to make a PR now because hive-0.13 version is not compatible with hive-0.12 and here i used org.apache.hive. On 2014/7/29 8:22, Michael Armbrust wrote: A few things: - When we upgrade to Hive 0.13.0, Patrick will likely republish the hive-exec jar just as we did for 0.12.0 - Since we have to tie into some pretty low level APIs it is unsurprising that the code doesn't just compile out of the box against 0.13.0 - ScalaReflection is for determining Schema from Scala classes, not reflection based bridge code. Either way its unclear to if there is any reason to use reflection to support multiple versions, instead of just upgrading to Hive 0.13.0 One question I have is, What is the goal of upgrading to hive 0.13.0? Is it purely because you are having problems connecting to newer metastores? Are there some features you are hoping for? This will help me prioritize this effort. Michael On Mon, Jul 28, 2014 at 4:05 PM, Ted Yu yuzhih...@gmail.com wrote: I was looking for a class where reflection-related code should reside. I found this but don't think it is the proper class for bridging differences between hive 0.12 and 0.13.1: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ ScalaReflection.scala Cheers On Mon, Jul 28, 2014 at 3:41 PM, Ted Yu yuzhih...@gmail.com wrote: After manually copying hive 0.13.1 jars to local maven repo, I got the following errors when building spark-hive_2.10 module : [ERROR] /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/ sql/hive/HiveContext.scala:182: type mismatch; found : String required: Array[String] [ERROR] val proc: CommandProcessor = CommandProcessorFactory.get(tokens(0), hiveconf) [ERROR] ^ [ERROR] /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/ sql/hive/HiveMetastoreCatalog.scala:60: value getAllPartitionsForPruner is not a member of org.apache. hadoop.hive.ql.metadata.Hive [ERROR] client.getAllPartitionsForPruner(table).toSeq [ERROR]^ [ERROR] /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/ sql/hive/HiveMetastoreCatalog.scala:267: overloaded method constructor TableDesc with alternatives: (x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2: Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc and ()org.apache.hadoop.hive.ql.plan.TableDesc cannot be applied to (Class[org.apache.hadoop.hive. serde2.Deserializer], Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in value tableDesc)(in value tableDesc)], java.util.Properties) [ERROR] val tableDesc = new TableDesc( [ERROR] ^ [WARNING] Class org.antlr.runtime.tree.CommonTree not found - continuing with a stub. [WARNING] Class org.antlr.runtime.Token not found - continuing with a stub. [WARNING] Class org.antlr.runtime.tree.Tree not found - continuing with a stub. [ERROR] while compiling: /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/ sql/hive/HiveQl.scala during phase: typer library version: version 2.10.4 compiler version: version 2.10.4 The above shows incompatible changes between 0.12 and 0.13.1 e.g. the first error corresponds to the following method in CommandProcessorFactory : public static CommandProcessor get(String[] cmd, HiveConf conf) Cheers On Mon, Jul 28, 2014 at 1:32 PM, Steve Nunez snu...@hortonworks.com wrote: So, do we have a short-term fix until Hive 0.14 comes out? Perhaps adding the hive-exec jar to the spark-project repo? It doesn¹t look like there¹s a release date schedule for 0.14. On 7/28/14, 10:50, Cheng Lian lian.cs@gmail.com wrote: Exactly, forgot to mention Hulu team also made changes to cope with those incompatibility issues, but they said that¹s relatively easy once the re-packaging work is done. On Tue, Jul 29, 2014 at 1:20 AM, Patrick Wendell pwend...@gmail.com wrote: I've heard from Cloudera that there were hive internal changes between 0.12 and 0.13 that required code re-writing. Over time it might be possible for us to integrate with hive using API's that are more stable (this is the domain of Michael/Cheng/Yin more than me!). It would be interesting to see what the Hulu folks did. - Patrick On Mon, Jul 28, 2014 at
Re: Storage Handlers in Spark SQL
- dev list + user list You should be able to query Spark SQL using JDBC, starting with the 1.1 release. There is some documentation is the repo https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#running-the-thrift-jdbc-server, and we'll update the official docs once the release is out. On Thu, Aug 21, 2014 at 4:43 AM, Niranda Perera nira...@wso2.com wrote: Hi, I have been playing around with Spark for the past few days, and evaluating the possibility of migrating into Spark (Spark SQL) from Hive/Hadoop. I am working on the WSO2 Business Activity Monitor (WSO2 BAM, https://docs.wso2.com/display/BAM241/WSO2+Business+Activity+Monitor+Documentation ) which has currently employed Hive. We are considering Spark as a successor for Hive, given it's performance enhancement. We have currently employed several custom storage-handlers in Hive. Example: WSO2 JDBC and Cassandra storage handlers: https://docs.wso2.com/display/BAM241/JDBC+Storage+Handler+for+Hive https://docs.wso2.com/display/BAM241/Creating+Hive+Queries+to+Analyze+Data#CreatingHiveQueriestoAnalyzeData-cas I would like to know where Spark SQL can work with these storage handlers (while using HiveContext may be) ? Best regards -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 https://twitter.com/N1R44
Re: saveAsTextFile to s3 on spark does not work, just hangs
Was the original issue with Spark 1.1 (i.e. master branch) or an earlier release? One possibility is that your S3 bucket is in a remote Amazon region, which would make it very slow. In my experience though saveAsTextFile has worked even for pretty large datasets in that situation, so maybe there's something else in your job causing a problem. Have you tried other operations on the data, like count(), or saving synthetic datasets (e.g. sc.parallelize(1 to 100*1000*1000, 20).saveAsTextFile(...)? Matei On August 25, 2014 at 12:09:25 PM, amnonkhen (amnon...@gmail.com) wrote: Hi jerryye, Maybe if you voted up my question on Stack Overflow it would get some traction and we would get nearer to a solution. Thanks, Amnon -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/saveAsTextFile-to-s3-on-spark-does-not-work-just-hangs-tp7795p7991.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: saveAsTextFile to s3 on spark does not work, just hangs
One other idea - when things freeze up, try to run jstack on the spark shell process and on the executors and attach the results. It could be that somehow you are encountering a deadlock somewhere. On Mon, Aug 25, 2014 at 1:26 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Was the original issue with Spark 1.1 (i.e. master branch) or an earlier release? One possibility is that your S3 bucket is in a remote Amazon region, which would make it very slow. In my experience though saveAsTextFile has worked even for pretty large datasets in that situation, so maybe there's something else in your job causing a problem. Have you tried other operations on the data, like count(), or saving synthetic datasets (e.g. sc.parallelize(1 to 100*1000*1000, 20).saveAsTextFile(...)? Matei On August 25, 2014 at 12:09:25 PM, amnonkhen (amnon...@gmail.com) wrote: Hi jerryye, Maybe if you voted up my question on Stack Overflow it would get some traction and we would get nearer to a solution. Thanks, Amnon -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/saveAsTextFile-to-s3-on-spark-does-not-work-just-hangs-tp7795p7991.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Pull requests will be automatically linked to JIRA when submitted
Hey Nicholas, That seems promising - I prefer having a proper link to having that fairly verbose comment though, because in some cases there will be dozens of comments and it could get lost. I wonder if they could do something where it posts a link instead... - Patrick On Mon, Aug 25, 2014 at 11:06 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: FYI: Looks like the Mesos folk also have a bot to do automatic linking, but it appears to have been provided to them somehow by ASF. See this comment as an example: https://issues.apache.org/jira/browse/MESOS-1688?focusedCommentId=14109078page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14109078 Might be a small win to push this work to a bot ASF manages if we can get access to it (and if we have no concerns about depending on an another external service). Nick On Mon, Aug 11, 2014 at 4:10 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for looking into this. I think little tools like this are super helpful. Would it hurt to open a request with INFRA to install/configure the JIRA-GitHub plugin while we continue to use the Python script we have? I wouldn't mind opening that JIRA issue with them. Nick On Mon, Aug 11, 2014 at 12:52 PM, Patrick Wendell pwend...@gmail.com wrote: I spent some time on this and I'm not sure either of these is an option, unfortunately. We typically can't use custom JIRA plug-in's because this JIRA is controlled by the ASF and we don't have rights to modify most things about how it works (it's a large shared JIRA instance used by more than 50 projects). It's worth looking into whether they can do something. In general we've tended to avoid going through ASF infra them whenever possible, since they are generally overloaded and things move very slowly, even if there are outages. Here is the script we use to do the sync: https://github.com/apache/spark/blob/master/dev/github_jira_sync.py It might be possible to modify this to support post-hoc changes, but we'd need to think about how to do so while minimizing function calls to the ASF JIRA API, which I found are very slow. - Patrick On Mon, Aug 11, 2014 at 7:51 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: It looks like this script doesn't catch PRs that are opened and *then* have the JIRA issue ID added to the name. Would it be easy to somehow have the script trigger on PR name changes as well as PR creates? Alternately, is there a reason we can't or don't want to use the plugin mentioned below? (I'm assuming it covers cases like this, but I'm not sure.) Nick On Wed, Jul 23, 2014 at 12:52 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: By the way, it looks like there's a JIRA plugin that integrates it with GitHub: - https://marketplace.atlassian.com/plugins/com.atlassian.jira.plugins.jira-bitbucket-connector-plugin - https://confluence.atlassian.com/display/BITBUCKET/Linking+Bitbucket+and+GitHub+accounts+to+JIRA It does the automatic linking and shows some additional information https://marketplace-cdn.atlassian.com/files/images/com.atlassian.jira.plugins.jira-bitbucket-connector-plugin/86ff1a21-44fb-4227-aa4f-44c77aec2c97.png that might be nice to have for heavy JIRA users. Nick On Sun, Jul 20, 2014 at 12:50 PM, Patrick Wendell pwend...@gmail.com wrote: Yeah it needs to have SPARK-XXX in the title (this is the format we request already). It just works with small synchronization script I wrote that we run every five minutes on Jeknins that uses the Github and Jenkins API: https://github.com/apache/spark/commit/49e472744951d875627d78b0d6e93cd139232929 - Patrick On Sun, Jul 20, 2014 at 8:06 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: That's pretty neat. How does it work? Do we just need to put the issue ID (e.g. SPARK-1234) anywhere in the pull request? Nick On Sat, Jul 19, 2014 at 11:10 PM, Patrick Wendell pwend...@gmail.com wrote: Just a small note, today I committed a tool that will automatically mirror pull requests to JIRA issues, so contributors will no longer have to manually post a pull request on the JIRA when they make one. It will create a link on the JIRA and also make a comment to trigger an e-mail to people watching. This should make some things easier, such as avoiding accidental duplicate effort on the same JIRA. - Patrick
Re: saveAsTextFile to s3 on spark does not work, just hangs
Hi Matei, At least in my case, the s3 bucket is in the same region. Running count() works and so does generating synthetic data. What I saw was that the job would hang for over an hour with no progress but tasks would immediately start finishing if I cached the data. - jerry On Mon, Aug 25, 2014 at 1:26 PM, Matei Zaharia [via Apache Spark Developers List] ml-node+s1001551n8000...@n3.nabble.com wrote: Was the original issue with Spark 1.1 (i.e. master branch) or an earlier release? One possibility is that your S3 bucket is in a remote Amazon region, which would make it very slow. In my experience though saveAsTextFile has worked even for pretty large datasets in that situation, so maybe there's something else in your job causing a problem. Have you tried other operations on the data, like count(), or saving synthetic datasets (e.g. sc.parallelize(1 to 100*1000*1000, 20).saveAsTextFile(...)? Matei On August 25, 2014 at 12:09:25 PM, amnonkhen ([hidden email] http://user/SendEmail.jtp?type=nodenode=8000i=0) wrote: Hi jerryye, Maybe if you voted up my question on Stack Overflow it would get some traction and we would get nearer to a solution. Thanks, Amnon -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/saveAsTextFile-to-s3-on-spark-does-not-work-just-hangs-tp7795p7991.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=8000i=1 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=8000i=2 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com/saveAsTextFile-to-s3-on-spark-does-not-work-just-hangs-tp7795p8000.html To start a new topic under Apache Spark Developers List, email ml-node+s1001551n1...@n3.nabble.com To unsubscribe from Apache Spark Developers List, click here http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=amVycnl5ZUBnbWFpbC5jb218MXwtNTI4OTc1MTAz . NAML http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/saveAsTextFile-to-s3-on-spark-does-not-work-just-hangs-tp7795p8003.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Mesos/Spark Deadlock
Hi Matei, I'm going to investigate from both Mesos and Spark side will hopefully have a good long term solution. In the mean time having a work around to start with is going to unblock folks. Tim On Mon, Aug 25, 2014 at 1:08 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Anyway it would be good if someone from the Mesos side investigates this and proposes a solution. The 32 MB per task hack isn't completely foolproof either (e.g. people might allocate all the RAM to their executor and thus stop being able to launch tasks), so maybe we wait on a Mesos fix for this one. Matei On August 25, 2014 at 1:07:15 PM, Matei Zaharia (matei.zaha...@gmail.com) wrote: This is kind of weird then, seems perhaps unrelated to this issue (or at least to the way I understood it). Is the problem maybe that Mesos saw 0 MB being freed and didn't re-offer the machine *even though there was more than 32 MB free overall*? Matei On August 25, 2014 at 12:59:59 PM, Cody Koeninger (c...@koeninger.org) wrote: I definitely saw a case where a. the only job running was a 256m shell b. I started a 2g job c. a little while later the same user as in a started another 256m shell My job immediately stopped making progress. Once user a killed his shells, it started again. This is on nodes with ~15G of memory, on which we have successfully run 8G jobs. On Mon, Aug 25, 2014 at 2:02 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW it seems to me that even without that patch, you should be getting tasks launched as long as you leave at least 32 MB of memory free on each machine (that is, the sum of the executor memory sizes is not exactly the same as the total size of the machine). Then Mesos will be able to re-offer that machine whenever CPUs free up. Matei On August 25, 2014 at 5:05:56 AM, Gary Malouf (malouf.g...@gmail.com) wrote: We have not tried the work-around because there are other bugs in there that affected our set-up, though it seems it would help. On Mon, Aug 25, 2014 at 12:54 AM, Timothy Chen tnac...@gmail.com wrote: +1 to have the work around in. I'll be investigating from the Mesos side too. Tim On Sun, Aug 24, 2014 at 9:52 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yeah, Mesos in coarse-grained mode probably wouldn't work here. It's too bad that this happens in fine-grained mode -- would be really good to fix. I'll see if we can get the workaround in https://github.com/apache/spark/pull/1860 into Spark 1.1. Incidentally have you tried that? Matei On August 23, 2014 at 4:30:27 PM, Gary Malouf (malouf.g...@gmail.com) wrote: Hi Matei, We have an analytics team that uses the cluster on a daily basis. They use two types of 'run modes': 1) For running actual queries, they set the spark.executor.memory to something between 4 and 8GB of RAM/worker. 2) A shell that takes a minimal amount of memory on workers (128MB) for prototyping out a larger query. This allows them to not take up RAM on the cluster when they do not really need it. We see the deadlocks when there are a few shells in either case. From the usage patterns we have, coarse-grained mode would be a challenge as we have to constantly remind people to kill their shells as soon as their queries finish. Am I correct in viewing Mesos in coarse-grained mode as being similar to Spark Standalone's cpu allocation behavior? On Sat, Aug 23, 2014 at 7:16 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Gary, just as a workaround, note that you can use Mesos in coarse-grained mode by setting spark.mesos.coarse=true. Then it will hold onto CPUs for the duration of the job. Matei On August 23, 2014 at 7:57:30 AM, Gary Malouf (malouf.g...@gmail.com) wrote: I just wanted to bring up a significant Mesos/Spark issue that makes the combo difficult to use for teams larger than 4-5 people. It's covered in https://issues.apache.org/jira/browse/MESOS-1688. My understanding is that Spark's use of executors in fine-grained mode is a very different behavior than many of the other common frameworks for Mesos. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
RE: Working Formula for Hive 0.13?
From my perspective, there're few benefits regarding Hive 0.13.1+. The following are the 4 major ones that I can see why people are asking to upgrade to Hive 0.13.1 recently. 1. Performance and bug fix, patches. (Usual case) 2. Native support for Parquet format, no need to provide custom JARs and SerDe like Hive 0.12. (Depends, driven by data format and queries) 3. Support of Tez engine which gives performance improvement in several use cases (Performance improvement) 4. Security enhancement in Hive 0.13.1 has improved a lot (Security concerns, ACLs, etc) These are the major benefits I see to upgrade to Hive 0.13.1+ from Hive 0.12.0. There may be others out there that I'm not aware of, but I do see it coming. my 2 cents. From: mich...@databricks.com Date: Mon, 25 Aug 2014 13:08:42 -0700 Subject: Re: Working Formula for Hive 0.13? To: wangf...@huawei.com CC: dev@spark.apache.org Thanks for working on this! Its unclear at the moment exactly how we are going to handle this, since the end goal is to be compatible with as many versions of Hive as possible. That said, I think it would be great to open a PR in this case. Even if we don't merge it, thats a good way to get it on people's radar and have a discussion about the changes that are required. On Sun, Aug 24, 2014 at 7:11 PM, scwf wangf...@huawei.com wrote: I have worked for a branch update the hive version to hive-0.13(by org.apache.hive)---https://github.com/scwf/spark/tree/hive-0.13 I am wondering whether it's ok to make a PR now because hive-0.13 version is not compatible with hive-0.12 and here i used org.apache.hive. On 2014/7/29 8:22, Michael Armbrust wrote: A few things: - When we upgrade to Hive 0.13.0, Patrick will likely republish the hive-exec jar just as we did for 0.12.0 - Since we have to tie into some pretty low level APIs it is unsurprising that the code doesn't just compile out of the box against 0.13.0 - ScalaReflection is for determining Schema from Scala classes, not reflection based bridge code. Either way its unclear to if there is any reason to use reflection to support multiple versions, instead of just upgrading to Hive 0.13.0 One question I have is, What is the goal of upgrading to hive 0.13.0? Is it purely because you are having problems connecting to newer metastores? Are there some features you are hoping for? This will help me prioritize this effort. Michael On Mon, Jul 28, 2014 at 4:05 PM, Ted Yu yuzhih...@gmail.com wrote: I was looking for a class where reflection-related code should reside. I found this but don't think it is the proper class for bridging differences between hive 0.12 and 0.13.1: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ ScalaReflection.scala Cheers On Mon, Jul 28, 2014 at 3:41 PM, Ted Yu yuzhih...@gmail.com wrote: After manually copying hive 0.13.1 jars to local maven repo, I got the following errors when building spark-hive_2.10 module : [ERROR] /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/ sql/hive/HiveContext.scala:182: type mismatch; found : String required: Array[String] [ERROR] val proc: CommandProcessor = CommandProcessorFactory.get(tokens(0), hiveconf) [ERROR] ^ [ERROR] /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/ sql/hive/HiveMetastoreCatalog.scala:60: value getAllPartitionsForPruner is not a member of org.apache. hadoop.hive.ql.metadata.Hive [ERROR] client.getAllPartitionsForPruner(table).toSeq [ERROR]^ [ERROR] /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/ sql/hive/HiveMetastoreCatalog.scala:267: overloaded method constructor TableDesc with alternatives: (x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2: Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc and ()org.apache.hadoop.hive.ql.plan.TableDesc cannot be applied to (Class[org.apache.hadoop.hive. serde2.Deserializer], Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in value tableDesc)(in value tableDesc)], java.util.Properties) [ERROR] val tableDesc = new TableDesc( [ERROR] ^ [WARNING] Class org.antlr.runtime.tree.CommonTree not found - continuing with a stub. [WARNING] Class org.antlr.runtime.Token not found - continuing with a stub. [WARNING] Class org.antlr.runtime.tree.Tree not found - continuing with a stub. [ERROR] while compiling: /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/ sql/hive/HiveQl.scala during phase: typer library version: version 2.10.4 compiler version: version 2.10.4 The above shows incompatible changes between 0.12 and 0.13.1 e.g. the first error corresponds to the following method in CommandProcessorFactory : public static
Re: saveAsTextFile to s3 on spark does not work, just hangs
Hi Patrick, Here's the process: java -cp /root/ephemeral-hdfs/conf/root/ephemeral-hdfs/conf:/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly-1.1.1-SNAPSHOT-hadoop1.0.4.jar -XX:MaxPermSize=128m -Djava.library.path=/root/ephemeral-hdfs/lib/native/ -Xms5g -Xmx10g -XX:MaxPermSize=10g -Dspark.akka.timeout=300 -Dspark.driver.port=59156 -Xms5g -Xmx10g -XX:MaxPermSize=10g -Xms58315M -Xmx58315M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sp...@ip-10-226-198-178.us-west-2.compute.internal:59156/user/CoarseGrainedScheduler 5 ip-10-38-9-181.us-west-2.compute.internal 8 akka.tcp://sparkwor...@ip-10-38-9-181.us-west-2.compute.internal:34533/user/Worker app-20140825214225-0001 Attached is the requested stack trace. On Mon, Aug 25, 2014 at 1:35 PM, Patrick Wendell [via Apache Spark Developers List] ml-node+s1001551n8001...@n3.nabble.com wrote: One other idea - when things freeze up, try to run jstack on the spark shell process and on the executors and attach the results. It could be that somehow you are encountering a deadlock somewhere. On Mon, Aug 25, 2014 at 1:26 PM, Matei Zaharia [hidden email] http://user/SendEmail.jtp?type=nodenode=8001i=0 wrote: Was the original issue with Spark 1.1 (i.e. master branch) or an earlier release? One possibility is that your S3 bucket is in a remote Amazon region, which would make it very slow. In my experience though saveAsTextFile has worked even for pretty large datasets in that situation, so maybe there's something else in your job causing a problem. Have you tried other operations on the data, like count(), or saving synthetic datasets (e.g. sc.parallelize(1 to 100*1000*1000, 20).saveAsTextFile(...)? Matei On August 25, 2014 at 12:09:25 PM, amnonkhen ([hidden email] http://user/SendEmail.jtp?type=nodenode=8001i=1) wrote: Hi jerryye, Maybe if you voted up my question on Stack Overflow it would get some traction and we would get nearer to a solution. Thanks, Amnon -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/saveAsTextFile-to-s3-on-spark-does-not-work-just-hangs-tp7795p7991.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=8001i=2 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=8001i=3 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com/saveAsTextFile-to-s3-on-spark-does-not-work-just-hangs-tp7795p8001.html To start a new topic under Apache Spark Developers List, email ml-node+s1001551n1...@n3.nabble.com To unsubscribe from Apache Spark Developers List, click here http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=amVycnl5ZUBnbWFpbC5jb218MXwtNTI4OTc1MTAz . NAML http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml jstack.txt (92K) http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/8006/0/jstack.txt -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/saveAsTextFile-to-s3-on-spark-does-not-work-just-hangs-tp7795p8006.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing
Hi, Unless you manually patched Spark, if you have Reynold’s patch for SPARK-2878, you also have the patch for SPARK-2893 which makes the underlying cause much more obvious and explicit. So the below is unlikely to be related to SPARK-2878. Graham On 26 Aug 2014, at 4:13 am, npanj nitinp...@gmail.com wrote: I am running the code with @rxin's patch in standalone mode. In my case I am registering org.apache.spark.graphx.GraphKryoRegistrator . Recently I started to see com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to uncompress the chunk: PARSING_ERROR . Has anyone seen this? Could it be related to this issue? Here it trace: -- vids (org.apache.spark.graphx.impl.VertexAttributeBlock) com.esotericsoftware.kryo.io.Input.fill(Input.java:142) com.esotericsoftware.kryo.io.Input.require(Input.java:169) com.esotericsoftware.kryo.io.Input.readLong_slow(Input.java:710) com.esotericsoftware.kryo.io.Input.readLong(Input.java:665) com.esotericsoftware.kryo.serializers.DefaultArraySerializers$LongArraySerializer.read(DefaultArraySerializers.java:127) com.esotericsoftware.kryo.serializers.DefaultArraySerializers$LongArraySerializer.read(DefaultArraySerializers.java:107) com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) org.apache.spark.storage.BlockManager$LazyProxyIterator$1.hasNext(BlockManager.scala:1054) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.graphx.impl.VertexPartitionBaseOps.innerJoinKeepLeft(VertexPartitionBaseOps.scala:192) org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:78) org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75) org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.graphx.EdgeRDD$$anonfun$mapEdgePartitions$1.apply(EdgeRDD.scala:87) org.apache.spark.graphx.EdgeRDD$$anonfun$mapEdgePartitions$1.apply(EdgeRDD.scala:85) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:202) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) -- -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-2878-Kryo-serialisation-with-custom-Kryo-registrator-failing-tp7719p7989.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: saveAsTextFile to s3 on spark does not work, just hangs
Hi Matei, The original issue happened on a spark-1.0.2-bin-hadoop2 installation. I will try the synthetic operation and see if I get the same results or not. Amnon On Mon, Aug 25, 2014 at 11:26 PM, Matei Zaharia [via Apache Spark Developers List] ml-node+s1001551n8000...@n3.nabble.com wrote: Was the original issue with Spark 1.1 (i.e. master branch) or an earlier release? One possibility is that your S3 bucket is in a remote Amazon region, which would make it very slow. In my experience though saveAsTextFile has worked even for pretty large datasets in that situation, so maybe there's something else in your job causing a problem. Have you tried other operations on the data, like count(), or saving synthetic datasets (e.g. sc.parallelize(1 to 100*1000*1000, 20).saveAsTextFile(...)? Matei On August 25, 2014 at 12:09:25 PM, amnonkhen ([hidden email] http://user/SendEmail.jtp?type=nodenode=8000i=0) wrote: Hi jerryye, Maybe if you voted up my question on Stack Overflow it would get some traction and we would get nearer to a solution. Thanks, Amnon -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/saveAsTextFile-to-s3-on-spark-does-not-work-just-hangs-tp7795p7991.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=8000i=1 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=8000i=2 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com/saveAsTextFile-to-s3-on-spark-does-not-work-just-hangs-tp7795p8000.html To unsubscribe from saveAsTextFile to s3 on spark does not work, just hangs, click here http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=7795code=YW1ub24uaXNAZ21haWwuY29tfDc3OTV8LTkxODIwMjYzNg== . NAML http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/saveAsTextFile-to-s3-on-spark-does-not-work-just-hangs-tp7795p8008.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Mesos/Spark Deadlock
My problem is that I'm not sure this workaround would solve things, given the issue described here (where there was a lot of memory free but it didn't get re-offered). If you think it does, it would be good to explain why it behaves like that. Matei On August 25, 2014 at 2:28:18 PM, Timothy Chen (tnac...@gmail.com) wrote: Hi Matei, I'm going to investigate from both Mesos and Spark side will hopefully have a good long term solution. In the mean time having a work around to start with is going to unblock folks. Tim On Mon, Aug 25, 2014 at 1:08 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Anyway it would be good if someone from the Mesos side investigates this and proposes a solution. The 32 MB per task hack isn't completely foolproof either (e.g. people might allocate all the RAM to their executor and thus stop being able to launch tasks), so maybe we wait on a Mesos fix for this one. Matei On August 25, 2014 at 1:07:15 PM, Matei Zaharia (matei.zaha...@gmail.com) wrote: This is kind of weird then, seems perhaps unrelated to this issue (or at least to the way I understood it). Is the problem maybe that Mesos saw 0 MB being freed and didn't re-offer the machine *even though there was more than 32 MB free overall*? Matei On August 25, 2014 at 12:59:59 PM, Cody Koeninger (c...@koeninger.org) wrote: I definitely saw a case where a. the only job running was a 256m shell b. I started a 2g job c. a little while later the same user as in a started another 256m shell My job immediately stopped making progress. Once user a killed his shells, it started again. This is on nodes with ~15G of memory, on which we have successfully run 8G jobs. On Mon, Aug 25, 2014 at 2:02 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW it seems to me that even without that patch, you should be getting tasks launched as long as you leave at least 32 MB of memory free on each machine (that is, the sum of the executor memory sizes is not exactly the same as the total size of the machine). Then Mesos will be able to re-offer that machine whenever CPUs free up. Matei On August 25, 2014 at 5:05:56 AM, Gary Malouf (malouf.g...@gmail.com) wrote: We have not tried the work-around because there are other bugs in there that affected our set-up, though it seems it would help. On Mon, Aug 25, 2014 at 12:54 AM, Timothy Chen tnac...@gmail.com wrote: +1 to have the work around in. I'll be investigating from the Mesos side too. Tim On Sun, Aug 24, 2014 at 9:52 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yeah, Mesos in coarse-grained mode probably wouldn't work here. It's too bad that this happens in fine-grained mode -- would be really good to fix. I'll see if we can get the workaround in https://github.com/apache/spark/pull/1860 into Spark 1.1. Incidentally have you tried that? Matei On August 23, 2014 at 4:30:27 PM, Gary Malouf (malouf.g...@gmail.com) wrote: Hi Matei, We have an analytics team that uses the cluster on a daily basis. They use two types of 'run modes': 1) For running actual queries, they set the spark.executor.memory to something between 4 and 8GB of RAM/worker. 2) A shell that takes a minimal amount of memory on workers (128MB) for prototyping out a larger query. This allows them to not take up RAM on the cluster when they do not really need it. We see the deadlocks when there are a few shells in either case. From the usage patterns we have, coarse-grained mode would be a challenge as we have to constantly remind people to kill their shells as soon as their queries finish. Am I correct in viewing Mesos in coarse-grained mode as being similar to Spark Standalone's cpu allocation behavior? On Sat, Aug 23, 2014 at 7:16 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Gary, just as a workaround, note that you can use Mesos in coarse-grained mode by setting spark.mesos.coarse=true. Then it will hold onto CPUs for the duration of the job. Matei On August 23, 2014 at 7:57:30 AM, Gary Malouf (malouf.g...@gmail.com) wrote: I just wanted to bring up a significant Mesos/Spark issue that makes the combo difficult to use for teams larger than 4-5 people. It's covered in https://issues.apache.org/jira/browse/MESOS-1688. My understanding is that Spark's use of executors in fine-grained mode is a very different behavior than many of the other common frameworks for Mesos.
Re: [Spark SQL] off-heap columnar store
Hi Michael, This is great news. Any initial proposal or design about the caching to Tachyon that you can share so far? I don't think there is a JIRA ticket open to track this feature yet. - Henry On Mon, Aug 25, 2014 at 1:13 PM, Michael Armbrust mich...@databricks.com wrote: What is the plan for getting Tachyon/off-heap support for the columnar compressed store? It's not in 1.1 is it? It is not in 1.1 and there are not concrete plans for adding it at this point. Currently, there is more engineering investment going into caching parquet data in Tachyon instead. This approach is going to have much better support for nested data, leverages other work being done on parquet, and alleviates your concerns about wire format compatibility. That said, if someone really wants to try and implement it, I don't think it would be very hard. The primary issue is going to be designing a clean interface that is not too tied to this one implementation. Also, how likely is the wire format for the columnar compressed data to change? That would be a problem for write-through or persistence. We aren't making any guarantees at the moment that it won't change. Its currently only intended for temporary caching of data. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: saveAsTextFile to s3 on spark does not work, just hangs
Got it. Another thing that would help is if you spot any exceptions or failed tasks in the web UI (http://driver:4040). Matei On August 25, 2014 at 3:07:41 PM, amnonkhen (amnon...@gmail.com) wrote: Hi Matei, The original issue happened on a spark-1.0.2-bin-hadoop2 installation. I will try the synthetic operation and see if I get the same results or not. Amnon On Mon, Aug 25, 2014 at 11:26 PM, Matei Zaharia [via Apache Spark Developers List] ml-node+s1001551n8000...@n3.nabble.com wrote: Was the original issue with Spark 1.1 (i.e. master branch) or an earlier release? One possibility is that your S3 bucket is in a remote Amazon region, which would make it very slow. In my experience though saveAsTextFile has worked even for pretty large datasets in that situation, so maybe there's something else in your job causing a problem. Have you tried other operations on the data, like count(), or saving synthetic datasets (e.g. sc.parallelize(1 to 100*1000*1000, 20).saveAsTextFile(...)? Matei On August 25, 2014 at 12:09:25 PM, amnonkhen ([hidden email] http://user/SendEmail.jtp?type=nodenode=8000i=0) wrote: Hi jerryye, Maybe if you voted up my question on Stack Overflow it would get some traction and we would get nearer to a solution. Thanks, Amnon -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/saveAsTextFile-to-s3-on-spark-does-not-work-just-hangs-tp7795p7991.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=8000i=1 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=8000i=2 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com/saveAsTextFile-to-s3-on-spark-does-not-work-just-hangs-tp7795p8000.html To unsubscribe from saveAsTextFile to s3 on spark does not work, just hangs, click here http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=7795code=YW1ub24uaXNAZ21haWwuY29tfDc3OTV8LTkxODIwMjYzNg== . NAML http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/saveAsTextFile-to-s3-on-spark-does-not-work-just-hangs-tp7795p8008.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Mesos/Spark Deadlock
I don't think it solves Cody's problem which still need more investigating, but I believe it does solve the problem you described earlier. I just confirmed with Mesos folks that we no longer need the minimum memory requirement so we'll be dropping that soon and the workaround might not be needed for the next mesos release. Tim On Mon, Aug 25, 2014 at 3:06 PM, Matei Zaharia matei.zaha...@gmail.com wrote: My problem is that I'm not sure this workaround would solve things, given the issue described here (where there was a lot of memory free but it didn't get re-offered). If you think it does, it would be good to explain why it behaves like that. Matei On August 25, 2014 at 2:28:18 PM, Timothy Chen (tnac...@gmail.com) wrote: Hi Matei, I'm going to investigate from both Mesos and Spark side will hopefully have a good long term solution. In the mean time having a work around to start with is going to unblock folks. Tim On Mon, Aug 25, 2014 at 1:08 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Anyway it would be good if someone from the Mesos side investigates this and proposes a solution. The 32 MB per task hack isn't completely foolproof either (e.g. people might allocate all the RAM to their executor and thus stop being able to launch tasks), so maybe we wait on a Mesos fix for this one. Matei On August 25, 2014 at 1:07:15 PM, Matei Zaharia (matei.zaha...@gmail.com) wrote: This is kind of weird then, seems perhaps unrelated to this issue (or at least to the way I understood it). Is the problem maybe that Mesos saw 0 MB being freed and didn't re-offer the machine *even though there was more than 32 MB free overall*? Matei On August 25, 2014 at 12:59:59 PM, Cody Koeninger (c...@koeninger.org) wrote: I definitely saw a case where a. the only job running was a 256m shell b. I started a 2g job c. a little while later the same user as in a started another 256m shell My job immediately stopped making progress. Once user a killed his shells, it started again. This is on nodes with ~15G of memory, on which we have successfully run 8G jobs. On Mon, Aug 25, 2014 at 2:02 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW it seems to me that even without that patch, you should be getting tasks launched as long as you leave at least 32 MB of memory free on each machine (that is, the sum of the executor memory sizes is not exactly the same as the total size of the machine). Then Mesos will be able to re-offer that machine whenever CPUs free up. Matei On August 25, 2014 at 5:05:56 AM, Gary Malouf (malouf.g...@gmail.com) wrote: We have not tried the work-around because there are other bugs in there that affected our set-up, though it seems it would help. On Mon, Aug 25, 2014 at 12:54 AM, Timothy Chen tnac...@gmail.com wrote: +1 to have the work around in. I'll be investigating from the Mesos side too. Tim On Sun, Aug 24, 2014 at 9:52 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yeah, Mesos in coarse-grained mode probably wouldn't work here. It's too bad that this happens in fine-grained mode -- would be really good to fix. I'll see if we can get the workaround in https://github.com/apache/spark/pull/1860 into Spark 1.1. Incidentally have you tried that? Matei On August 23, 2014 at 4:30:27 PM, Gary Malouf (malouf.g...@gmail.com) wrote: Hi Matei, We have an analytics team that uses the cluster on a daily basis. They use two types of 'run modes': 1) For running actual queries, they set the spark.executor.memory to something between 4 and 8GB of RAM/worker. 2) A shell that takes a minimal amount of memory on workers (128MB) for prototyping out a larger query. This allows them to not take up RAM on the cluster when they do not really need it. We see the deadlocks when there are a few shells in either case. From the usage patterns we have, coarse-grained mode would be a challenge as we have to constantly remind people to kill their shells as soon as their queries finish. Am I correct in viewing Mesos in coarse-grained mode as being similar to Spark Standalone's cpu allocation behavior? On Sat, Aug 23, 2014 at 7:16 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Gary, just as a workaround, note that you can use Mesos in coarse-grained mode by setting spark.mesos.coarse=true. Then it will hold onto CPUs for the duration of the job. Matei On August 23, 2014 at 7:57:30 AM, Gary Malouf (malouf.g...@gmail.com) wrote: I just wanted to bring up a significant Mesos/Spark issue that makes the combo difficult to use for teams larger than 4-5 people. It's covered in https://issues.apache.org/jira/browse/MESOS-1688. My understanding is that Spark's use of executors in fine-grained mode is a very different behavior than many of the other common frameworks for Mesos.
too many CancelledKeyException throwed from ConnectionManager
Hi Folks, We are testing our home-made KMeans algorithm using Spark on Yarn. Recently, we've found that the application failed frequently when doing clustering over 300,000,000 users (each user is represented by a feature vector and the whole data set is around 600,000,000). After digging into the job log, we've found that there are many CancelledKeyException throwed by ConnectionManager but not observed other exceptions. We double frequent CancelledKeyException brings the whole application down since the application often failed on the third or fourth iteration for large datasets. Welcome to any directional suggestions. *Errors in job log*: java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:363) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116) 14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(lsv-289.rfiserve.net,43199) 14/08/25 19:04:32 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found 14/08/25 19:04:32 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@2570cd62 14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@2570cd62 java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:363) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116) 14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(lsv-289.rfiserve.net,56727) 14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(lsv-289.rfiserve.net,56727) 14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(lsv-289.rfiserve.net,56727) 14/08/25 19:04:32 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@37c8b85a 14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@37c8b85a java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:287) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116) 14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(lsv-668.rfiserve.net,41913) 14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(lsv-668.rfiserve.net,41913) 14/08/25 19:04:32 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@fcea3a4 14/08/25 19:04:32 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found 14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@fcea3a4 Best Shengzhe
Re: Graphx seems to be broken while Creating a large graph(6B nodes in my case)
I posted the fix on the JIRA ticket (https://issues.apache.org/jira/browse/SPARK-3190). To update the user list, this is indeed an integer overflow problem when summing up the partition sizes. The fix is to use Longs for the sum: https://github.com/apache/spark/pull/2106. Ankur - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Handling stale PRs
Check this out: https://github.com/apache/spark/pulls?q=is%3Aopen+is%3Apr+sort%3Aupdated-asc We're hitting close to 300 open PRs. Those are the least recently updated ones. I think having a low number of stale (i.e. not recently updated) PRs is a good thing to shoot for. It doesn't leave contributors hanging (which feels bad for contributors), and reduces project clutter (which feels bad for maintainers/committers). What is our approach to tackling this problem? I think communicating and enforcing a clear policy on how stale PRs are handled might be a good way to reduce the number of stale PRs we have without making contributors feel rejected. I don't know what such a policy would look like, but it should be enforceable and lightweight--i.e. it shouldn't feel like a hammer used to reject people's work, but rather a necessary tool to keep the project's contributions relevant and manageable. Nick
RDD replication in Spark
Hi, I've exercised multiple options available for persist() including RDD replication. I have gone thru the classes that involve in caching/storing the RDDS at different levels. StorageLevel class plays a pivotal role by recording whether to use memory or disk or to replicate the RDD on multiple nodes. The class LocationIterator iterates over the preferred machines one by one for each partition that is replicated. I got a rough idea of CoalescedRDD. Please correct me if I am wrong. But I am looking for the code that chooses the resources to replicate the RDDs. Can someone please tell me how replication takes place and how do we choose the resources for replication. I just want to know as to where should I look into to understand how the replication happens. Thank you so much!!! regards -Karthik
Re: Handling stale PRs
Hey Nicholas, In general we've been looking at these periodically (at least I have) and asking people to close out of date ones, but it's true that the list has gotten fairly large. We should probably have an expiry time of a few months and close them automatically. I agree that it's daunting to see so many open PRs. Matei On August 25, 2014 at 9:03:09 PM, Nicholas Chammas (nicholas.cham...@gmail.com) wrote: Check this out: https://github.com/apache/spark/pulls?q=is%3Aopen+is%3Apr+sort%3Aupdated-asc We're hitting close to 300 open PRs. Those are the least recently updated ones. I think having a low number of stale (i.e. not recently updated) PRs is a good thing to shoot for. It doesn't leave contributors hanging (which feels bad for contributors), and reduces project clutter (which feels bad for maintainers/committers). What is our approach to tackling this problem? I think communicating and enforcing a clear policy on how stale PRs are handled might be a good way to reduce the number of stale PRs we have without making contributors feel rejected. I don't know what such a policy would look like, but it should be enforceable and lightweight--i.e. it shouldn't feel like a hammer used to reject people's work, but rather a necessary tool to keep the project's contributions relevant and manageable. Nick
Re: saveAsTextFile to s3 on spark does not work, just hangs
There were no failures nor exceptions. On Tue, Aug 26, 2014 at 1:31 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Got it. Another thing that would help is if you spot any exceptions or failed tasks in the web UI (http://driver:4040). Matei On August 25, 2014 at 3:07:41 PM, amnonkhen (amnon...@gmail.com) wrote: Hi Matei, The original issue happened on a spark-1.0.2-bin-hadoop2 installation. I will try the synthetic operation and see if I get the same results or not. Amnon On Mon, Aug 25, 2014 at 11:26 PM, Matei Zaharia [via Apache Spark Developers List] ml-node+s1001551n8000...@n3.nabble.com wrote: Was the original issue with Spark 1.1 (i.e. master branch) or an earlier release? One possibility is that your S3 bucket is in a remote Amazon region, which would make it very slow. In my experience though saveAsTextFile has worked even for pretty large datasets in that situation, so maybe there's something else in your job causing a problem. Have you tried other operations on the data, like count(), or saving synthetic datasets (e.g. sc.parallelize(1 to 100*1000*1000, 20).saveAsTextFile(...)? Matei On August 25, 2014 at 12:09:25 PM, amnonkhen ([hidden email] http://user/SendEmail.jtp?type=nodenode=8000i=0) wrote: Hi jerryye, Maybe if you voted up my question on Stack Overflow it would get some traction and we would get nearer to a solution. Thanks, Amnon -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/saveAsTextFile-to-s3-on-spark-does-not-work-just-hangs-tp7795p7991.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=8000i=1 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=8000i=2 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com/saveAsTextFile-to-s3-on-spark-does-not-work-just-hangs-tp7795p8000.html To unsubscribe from saveAsTextFile to s3 on spark does not work, just hangs, click here http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=7795code=YW1ub24uaXNAZ21haWwuY29tfDc3OTV8LTkxODIwMjYzNg== . NAML http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/saveAsTextFile-to-s3-on-spark-does-not-work-just-hangs-tp7795p8008.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: saveAsTextFile to s3 on spark does not work, just hangs
Hey Amnon, So just to make sure I understand - you also saw the same issue with 1.0.2? Just asking because whether or not this regresses the 1.0.2 behavior is important for our own bug tracking. - Patrick On Mon, Aug 25, 2014 at 10:22 PM, Amnon Khen amnon...@gmail.com wrote: There were no failures nor exceptions. On Tue, Aug 26, 2014 at 1:31 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Got it. Another thing that would help is if you spot any exceptions or failed tasks in the web UI (http://driver:4040). Matei On August 25, 2014 at 3:07:41 PM, amnonkhen (amnon...@gmail.com) wrote: Hi Matei, The original issue happened on a spark-1.0.2-bin-hadoop2 installation. I will try the synthetic operation and see if I get the same results or not. Amnon On Mon, Aug 25, 2014 at 11:26 PM, Matei Zaharia [via Apache Spark Developers List] ml-node+s1001551n8000...@n3.nabble.com wrote: Was the original issue with Spark 1.1 (i.e. master branch) or an earlier release? One possibility is that your S3 bucket is in a remote Amazon region, which would make it very slow. In my experience though saveAsTextFile has worked even for pretty large datasets in that situation, so maybe there's something else in your job causing a problem. Have you tried other operations on the data, like count(), or saving synthetic datasets (e.g. sc.parallelize(1 to 100*1000*1000, 20).saveAsTextFile(...)? Matei On August 25, 2014 at 12:09:25 PM, amnonkhen ([hidden email] http://user/SendEmail.jtp?type=nodenode=8000i=0) wrote: Hi jerryye, Maybe if you voted up my question on Stack Overflow it would get some traction and we would get nearer to a solution. Thanks, Amnon -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/saveAsTextFile-to-s3-on-spark-does-not-work-just-hangs-tp7795p7991.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=8000i=1 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=8000i=2 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com/saveAsTextFile-to-s3-on-spark-does-not-work-just-hangs-tp7795p8000.html To unsubscribe from saveAsTextFile to s3 on spark does not work, just hangs, click here http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=7795code=YW1ub24uaXNAZ21haWwuY29tfDc3OTV8LTkxODIwMjYzNg== . NAML http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/saveAsTextFile-to-s3-on-spark-does-not-work-just-hangs-tp7795p8008.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.