Re: Contributing to Spark

2014-04-08 Thread Michael Ernest
Ha ha! nice try, sheepherder! ;-)


On Tue, Apr 8, 2014 at 12:37 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

 Shh, maybe I really wanted people to fix that one issue.

 On Apr 8, 2014, at 9:34 AM, Aaron Davidson ilike...@gmail.com wrote:

  Matei's link seems to point to a specific starter project as part of the
  starter list, but here is the list itself:
 
 https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)
 
 
  On Mon, Apr 7, 2014 at 10:22 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
 
  I'd suggest looking for the issues labeled Starter on JIRA. You can
 find
  them here:
 
 https://issues.apache.org/jira/browse/SPARK-1438?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)
 
  Matei
 
  On Apr 7, 2014, at 9:45 PM, Mukesh G muk...@gmail.com wrote:
 
  Hi Sujeet,
 
Thanks. I went thru the website and looks great. Is there a list of
  items that I can choose from, for contribution?
 
  Thanks
 
  Mukesh
 
 
  On Mon, Apr 7, 2014 at 10:14 PM, Sujeet Varakhedi
  svarakh...@gopivotal.comwrote:
 
  This is a good place to start:
 
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
 
  Sujeet
 
 
  On Mon, Apr 7, 2014 at 9:20 AM, Mukesh G muk...@gmail.com wrote:
 
  Hi,
 
   How I contribute to Spark and it's associated projects?
 
  Appreciate the help...
 
  Thanks
 
  Mukesh
 
 
 
 




-- 
Michael Ernest
Sr. Solutions Consultant
West Coast


Re: Spark Streaming and Flume Avro RPC Servers

2014-04-07 Thread Michael Ernest
You can configure your sinks to write to one or more Avro sources in a
load-balanced configuration.

https://flume.apache.org/FlumeUserGuide.html#flume-sink-processors

mfe


On Mon, Apr 7, 2014 at 3:19 PM, Christophe Clapp
christo...@christophe.ccwrote:

 Hi,

 From my testing of Spark Streaming with Flume, it seems that there's only
 one of the Spark worker nodes that runs a Flume Avro RPC server to receive
 messages at any given time, as opposed to every Spark worker running an
 Avro RPC server to receive messages. Is this the case? Our use-case would
 benefit from balancing the load across Workers because of our volume of
 messages. We would be using a load balancer in front of the Spark workers
 running the Avro RPC servers, essentially round-robinning the messages
 across all of them.

 If this is something that is currently not supported, I'd be interested in
 contributing to the code to make it happen.

 - Christophe




-- 
Michael Ernest
Sr. Solutions Consultant
West Coast


Re: Spark Streaming and Flume Avro RPC Servers

2014-04-07 Thread Michael Ernest
I don't see why not. If one were doing something similar with straight
Flume, you'd start an agent on each node you care to receive Avro/RPC
events. In the absence of clearer insight to your use case, I'm puzzling
just a little why it's necessary for each Worker to be its own receiver,
but there's no real objection or concern to fuel the puzzlement, just
curiosity.


On Mon, Apr 7, 2014 at 4:16 PM, Christophe Clapp
christo...@christophe.ccwrote:

 Could it be as simple as just changing FlumeUtils to accept a list of
 host/port number pairs to start the RPC servers on?



 On 4/7/14, 12:58 PM, Christophe Clapp wrote:

 Based on the source code here:
 https://github.com/apache/spark/blob/master/external/
 flume/src/main/scala/org/apache/spark/streaming/flume/FlumeUtils.scala

 It looks like in its current version, FlumeUtils does not support
 starting an Avro RPC server on more than one worker.

 - Christophe

 On 4/7/14, 12:23 PM, Michael Ernest wrote:

 You can configure your sinks to write to one or more Avro sources in a
 load-balanced configuration.

 https://flume.apache.org/FlumeUserGuide.html#flume-sink-processors

 mfe


 On Mon, Apr 7, 2014 at 3:19 PM, Christophe Clapp
 christo...@christophe.ccwrote:

  Hi,

  From my testing of Spark Streaming with Flume, it seems that there's
 only
 one of the Spark worker nodes that runs a Flume Avro RPC server to
 receive
 messages at any given time, as opposed to every Spark worker running an
 Avro RPC server to receive messages. Is this the case? Our use-case
 would
 benefit from balancing the load across Workers because of our volume of
 messages. We would be using a load balancer in front of the Spark
 workers
 running the Avro RPC servers, essentially round-robinning the messages
 across all of them.

 If this is something that is currently not supported, I'd be interested
 in
 contributing to the code to make it happen.

 - Christophe








-- 
Michael Ernest
Sr. Solutions Consultant
West Coast