[spark-streaming] Is this LocalInputDStream useful to someone?

2014-03-20 Thread Pascal Voitot Dev
Hi guys,

In my recent blog post (http://mandubian.com/2014/03/08/zpark-ml-nio-1/), I
needed to have an InputDStream helper looking like NetworkInputDStream to
be able to push my data into DStream in an async way. But I didn't want the
remoting aspect as my data source runs locally and nowhere else. I didn't
want my InputDStream to be considered as a NetworkInputDStream as they have
a special management in DStream scheduler to be potentially remoted.

So I've implemented this LocalInputDStream that provides simple push with
an receiver based on an actor, creating BlockRDD but ensures it won't be
remoted:

https://github.com/mandubian/zpark-ztream/blob/master/src/main/scala/LocalInputDStream.scala

(the code is just a hack of NetworkInputDStream)

and a instance of it:
https://github.com/mandubian/zpark-ztream/blob/master/src/main/scala/ZparkInputDStream.scala

Is it something useful for the spark-streaming project that I could
contribute to the project (in a PR) or have I totally missed something that
would do the same in current project code?

Best regards
Pascal


Re: how to config worker HA

2014-03-20 Thread qingyang li
can someone help me ?


2014-03-12 21:26 GMT+08:00 qingyang li liqingyang1...@gmail.com:

 in addition:
 on this site:
 https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#hadoop-datasets
 ,
 i find RDD can be stored using a different *storage level on the web,
 and  *also find StorageLevel's attribute MEMORY_ONLY_2 .
 MEMORY_ONLY_2, Same as the levels above, but replicate each partition on
 two cluster nodes.
 1. is this one point of fault-tolerance ?
 2.if replicate each partition on two cluster nodes will help worker node
 HA ?
 3. if there is MEMORY_ONLY_3 which could replicate each partition on three
 cluster nodes?




 2014-03-12 12:11 GMT+08:00 qingyang li liqingyang1...@gmail.com:

 i have one table in memery,  when one worker becomes dead, i can not query
 data from that table. Here is it's storage status:


 RDD Name Storage LevelCached PartitionsFraction CachedSize in MemorySize
 on Disk
  http://192.168.1.101:4040/storage/rdd?id=47
 table01 Memory Deserialized 1x Replicated 119 88%   697.0 MB 0.0
 Bso, my question is:
 1. what meaning is  Memory Deserialized 1x Replicated ?
 2. how to config worker HA so that i can query data even one worker dead.





Re: Spark AMI

2014-03-20 Thread Reynold Xin
It's mostly stock CentOS installation with some scripts.




On Thu, Mar 20, 2014 at 2:53 AM, Usman Ghani us...@platfora.com wrote:

 Is there anything special about the spark AMIs or are they just stock
 CentOS installations?



Re: Largest input data set observed for Spark.

2014-03-20 Thread Surendranauth Hiraman
Reynold,

How complex was that job (I guess in terms of number of transforms and
actions) and how long did that take to process?

-Suren



On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin r...@databricks.com wrote:

 Actually we just ran a job with 70TB+ compressed data on 28 worker nodes -
 I didn't count the size of the uncompressed data, but I am guessing it is
 somewhere between 200TB to 700TB.



 On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani us...@platfora.com wrote:

  All,
  What is the largest input data set y'all have come across that has been
  successfully processed in production using spark. Ball park?
 




-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v suren.hira...@sociocast.comelos.io
W: www.velos.io


Re: Largest input data set observed for Spark.

2014-03-20 Thread Reynold Xin
I'm not really at liberty to discuss details of the job. It involves some
expensive aggregated statistics, and took 10 hours to complete (mostly
bottlenecked by network  io).





On Thu, Mar 20, 2014 at 11:12 AM, Surendranauth Hiraman 
suren.hira...@velos.io wrote:

 Reynold,

 How complex was that job (I guess in terms of number of transforms and
 actions) and how long did that take to process?

 -Suren



 On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin r...@databricks.com wrote:

  Actually we just ran a job with 70TB+ compressed data on 28 worker nodes
 -
  I didn't count the size of the uncompressed data, but I am guessing it is
  somewhere between 200TB to 700TB.
 
 
 
  On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani us...@platfora.com
 wrote:
 
   All,
   What is the largest input data set y'all have come across that has been
   successfully processed in production using spark. Ball park?
  
 



 --

 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hiraman@v suren.hira...@sociocast.comelos.io
 W: www.velos.io



Re: Largest input data set observed for Spark.

2014-03-20 Thread Henry Saputra
Reynold, just curious did you guys ran it in AWS?

- Henry

On Thu, Mar 20, 2014 at 11:08 AM, Reynold Xin r...@databricks.com wrote:
 Actually we just ran a job with 70TB+ compressed data on 28 worker nodes -
 I didn't count the size of the uncompressed data, but I am guessing it is
 somewhere between 200TB to 700TB.



 On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani us...@platfora.com wrote:

 All,
 What is the largest input data set y'all have come across that has been
 successfully processed in production using spark. Ball park?



Re: Spark 0.9.1 release

2014-03-20 Thread Bhaskar Dutta
It will be great if
SPARK-1101https://spark-project.atlassian.net/browse/SPARK-1101:
Umbrella
for hardening Spark on YARN can get into 0.9.1.

Thanks,
Bhaskar

On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das
tathagata.das1...@gmail.comwrote:

  Hello everyone,

 Since the release of Spark 0.9, we have received a number of important bug
 fixes and we would like to make a bug-fix release of Spark 0.9.1. We are
 going to cut a release candidate soon and we would love it if people test
 it out. We have backported several bug fixes into the 0.9 and updated JIRA
 accordingly
 https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)
 .
 Please let me know if there are fixes that were not backported but you
 would like to see them in 0.9.1.

 Thanks!

 TD



Re: Spark 0.9.1 release

2014-03-20 Thread Patrick Wendell
Hey Tom,

 I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on 
 YARN - JIRA and  [SPARK-1051] On Yarn, executors don't doAs as submitting 
 user - JIRA in.  The pyspark one I would consider more of an enhancement so 
 might not be appropriate for a point release.

Someone recently sent me a personal e-mail reporting some problems
with this. I'll ask them to forward it to you/the dev list. Might be
worth looking into before merging.

  [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA
 This means that they can't write/read from files that the yarn user doesn't 
 have permissions to but the submitting user does.
 View on spark-project.atlassian.net Preview by Yahoo

Good call on this one.

- Patrick


Re: Spark 0.9.1 release

2014-03-20 Thread Tom Graves
Thanks for the heads up, saw that and will make sure that is resolved before 
pulling into 0.9.  Unless I'm missing something, they should just use sc.addJar 
to distributed the jar rather then relying on SPARK_YARN_APP_JAR.

Tom



On Thursday, March 20, 2014 3:31 PM, Patrick Wendell pwend...@gmail.com wrote:
 
Hey Tom,

 I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on 
 YARN - JIRA and  [SPARK-1051] On Yarn, executors don't doAs as submitting 
 user - JIRA in.  The pyspark one I would consider more of an enhancement so 
 might not be appropriate for a point release.

Someone recently sent me a personal e-mail reporting some problems
with this. I'll ask them to forward it to you/the dev list. Might be
worth looking into before merging.


  [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA
 This means that they can't write/read from files that the yarn user doesn't 
 have permissions to but the submitting user does.
 View on spark-project.atlassian.net Preview by Yahoo

Good call on this one.

- Patrick

Re: new Catalyst/SQL component merged into master

2014-03-20 Thread Michael Armbrust
Hi Everyone,

I'm very excited about merging this new feature into Spark!  We have a lot
of cool things in the pipeline, including: porting Shark's in-memory
columnar format to Spark SQL, code-generation for expression evaluation and
improved support for complex types in parquet.

I would love to hear feedback on the interfaces, and what is missing.  In
particular, while we have pretty good test coverage for Hive, there has not
been a lot of testing with real Hive deployments and there is certainly a
lot more work to do.  So, please test it out and if there are any missing
features let me know!

Michael


On Thu, Mar 20, 2014 at 6:11 PM, Reynold Xin r...@databricks.com wrote:

 Hi All,

 I'm excited to announce a new module in Spark (SPARK-1251). After an
 initial review we've merged this as Spark as an alpha component to be
 included in Spark 1.0. This new component adds some exciting features,
 including:

 - schema-aware RDD programming via an experimental DSL
 - native Parquet support
 - support for executing SQL against RDDs

 The pull request itself contains more information:
 https://github.com/apache/spark/pull/146

 You can also find the documentation for this new component here:
 http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html


 This contribution was lead by Michael Ambrust with work from several other
 contributors who I'd like to highlight here: Yin Huai, Cheng Lian, Andre
 Schumacher, Timothy Chen, Henry Cook, and Mark Hamstra.


 - Reynold




Please subscribe me

2014-03-20 Thread twinkle sachdeva
Please subscribe me.