[spark-streaming] Is this LocalInputDStream useful to someone?
Hi guys, In my recent blog post (http://mandubian.com/2014/03/08/zpark-ml-nio-1/), I needed to have an InputDStream helper looking like NetworkInputDStream to be able to push my data into DStream in an async way. But I didn't want the remoting aspect as my data source runs locally and nowhere else. I didn't want my InputDStream to be considered as a NetworkInputDStream as they have a special management in DStream scheduler to be potentially remoted. So I've implemented this LocalInputDStream that provides simple push with an receiver based on an actor, creating BlockRDD but ensures it won't be remoted: https://github.com/mandubian/zpark-ztream/blob/master/src/main/scala/LocalInputDStream.scala (the code is just a hack of NetworkInputDStream) and a instance of it: https://github.com/mandubian/zpark-ztream/blob/master/src/main/scala/ZparkInputDStream.scala Is it something useful for the spark-streaming project that I could contribute to the project (in a PR) or have I totally missed something that would do the same in current project code? Best regards Pascal
Re: how to config worker HA
can someone help me ? 2014-03-12 21:26 GMT+08:00 qingyang li liqingyang1...@gmail.com: in addition: on this site: https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#hadoop-datasets , i find RDD can be stored using a different *storage level on the web, and *also find StorageLevel's attribute MEMORY_ONLY_2 . MEMORY_ONLY_2, Same as the levels above, but replicate each partition on two cluster nodes. 1. is this one point of fault-tolerance ? 2.if replicate each partition on two cluster nodes will help worker node HA ? 3. if there is MEMORY_ONLY_3 which could replicate each partition on three cluster nodes? 2014-03-12 12:11 GMT+08:00 qingyang li liqingyang1...@gmail.com: i have one table in memery, when one worker becomes dead, i can not query data from that table. Here is it's storage status: RDD Name Storage LevelCached PartitionsFraction CachedSize in MemorySize on Disk http://192.168.1.101:4040/storage/rdd?id=47 table01 Memory Deserialized 1x Replicated 119 88% 697.0 MB 0.0 Bso, my question is: 1. what meaning is Memory Deserialized 1x Replicated ? 2. how to config worker HA so that i can query data even one worker dead.
Re: Spark AMI
It's mostly stock CentOS installation with some scripts. On Thu, Mar 20, 2014 at 2:53 AM, Usman Ghani us...@platfora.com wrote: Is there anything special about the spark AMIs or are they just stock CentOS installations?
Re: Largest input data set observed for Spark.
Reynold, How complex was that job (I guess in terms of number of transforms and actions) and how long did that take to process? -Suren On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin r...@databricks.com wrote: Actually we just ran a job with 70TB+ compressed data on 28 worker nodes - I didn't count the size of the uncompressed data, but I am guessing it is somewhere between 200TB to 700TB. On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani us...@platfora.com wrote: All, What is the largest input data set y'all have come across that has been successfully processed in production using spark. Ball park? -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
Re: Largest input data set observed for Spark.
I'm not really at liberty to discuss details of the job. It involves some expensive aggregated statistics, and took 10 hours to complete (mostly bottlenecked by network io). On Thu, Mar 20, 2014 at 11:12 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Reynold, How complex was that job (I guess in terms of number of transforms and actions) and how long did that take to process? -Suren On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin r...@databricks.com wrote: Actually we just ran a job with 70TB+ compressed data on 28 worker nodes - I didn't count the size of the uncompressed data, but I am guessing it is somewhere between 200TB to 700TB. On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani us...@platfora.com wrote: All, What is the largest input data set y'all have come across that has been successfully processed in production using spark. Ball park? -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
Re: Largest input data set observed for Spark.
Reynold, just curious did you guys ran it in AWS? - Henry On Thu, Mar 20, 2014 at 11:08 AM, Reynold Xin r...@databricks.com wrote: Actually we just ran a job with 70TB+ compressed data on 28 worker nodes - I didn't count the size of the uncompressed data, but I am guessing it is somewhere between 200TB to 700TB. On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani us...@platfora.com wrote: All, What is the largest input data set y'all have come across that has been successfully processed in production using spark. Ball park?
Re: Spark 0.9.1 release
It will be great if SPARK-1101https://spark-project.atlassian.net/browse/SPARK-1101: Umbrella for hardening Spark on YARN can get into 0.9.1. Thanks, Bhaskar On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das tathagata.das1...@gmail.comwrote: Hello everyone, Since the release of Spark 0.9, we have received a number of important bug fixes and we would like to make a bug-fix release of Spark 0.9.1. We are going to cut a release candidate soon and we would love it if people test it out. We have backported several bug fixes into the 0.9 and updated JIRA accordingly https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed) . Please let me know if there are fixes that were not backported but you would like to see them in 0.9.1. Thanks! TD
Re: Spark 0.9.1 release
Hey Tom, I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on YARN - JIRA and [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA in. The pyspark one I would consider more of an enhancement so might not be appropriate for a point release. Someone recently sent me a personal e-mail reporting some problems with this. I'll ask them to forward it to you/the dev list. Might be worth looking into before merging. [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA This means that they can't write/read from files that the yarn user doesn't have permissions to but the submitting user does. View on spark-project.atlassian.net Preview by Yahoo Good call on this one. - Patrick
Re: Spark 0.9.1 release
Thanks for the heads up, saw that and will make sure that is resolved before pulling into 0.9. Unless I'm missing something, they should just use sc.addJar to distributed the jar rather then relying on SPARK_YARN_APP_JAR. Tom On Thursday, March 20, 2014 3:31 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Tom, I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on YARN - JIRA and [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA in. The pyspark one I would consider more of an enhancement so might not be appropriate for a point release. Someone recently sent me a personal e-mail reporting some problems with this. I'll ask them to forward it to you/the dev list. Might be worth looking into before merging. [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA This means that they can't write/read from files that the yarn user doesn't have permissions to but the submitting user does. View on spark-project.atlassian.net Preview by Yahoo Good call on this one. - Patrick
Re: new Catalyst/SQL component merged into master
Hi Everyone, I'm very excited about merging this new feature into Spark! We have a lot of cool things in the pipeline, including: porting Shark's in-memory columnar format to Spark SQL, code-generation for expression evaluation and improved support for complex types in parquet. I would love to hear feedback on the interfaces, and what is missing. In particular, while we have pretty good test coverage for Hive, there has not been a lot of testing with real Hive deployments and there is certainly a lot more work to do. So, please test it out and if there are any missing features let me know! Michael On Thu, Mar 20, 2014 at 6:11 PM, Reynold Xin r...@databricks.com wrote: Hi All, I'm excited to announce a new module in Spark (SPARK-1251). After an initial review we've merged this as Spark as an alpha component to be included in Spark 1.0. This new component adds some exciting features, including: - schema-aware RDD programming via an experimental DSL - native Parquet support - support for executing SQL against RDDs The pull request itself contains more information: https://github.com/apache/spark/pull/146 You can also find the documentation for this new component here: http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html This contribution was lead by Michael Ambrust with work from several other contributors who I'd like to highlight here: Yin Huai, Cheng Lian, Andre Schumacher, Timothy Chen, Henry Cook, and Mark Hamstra. - Reynold
Please subscribe me
Please subscribe me.