Spark-Kafka Connector issue

2015-09-26 Thread Ratika Prasad
Hi All, I am trying out the spark streaming and reading the messages from kafka topics which later would be created into streams as below...I have the kafka setup on a vm and topics created however when I try to run the program below from my spark vm as below I get an error even though the kafk

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-26 Thread robineast
+1 build/mvn clean package -DskipTests -Pyarn -Phadoop-2.6 OK Basic graph tests Load graph using edgeListFile...SUCCESS Run PageRank...SUCCESS Minimum Spanning Tree Algorithm Run basic Minimum Spanning Tree algorithm...SUCCESS Run Minimum Spanning Tree taxonomy creation...SUCCESS -- Vi

Re: treeAggregate timing / SGD performance with miniBatchFraction < 1

2015-09-26 Thread Mike Hynes
That is an interesting point; I run the driver as a background process on the master node so that I can still pipe the stdout/stderr filestreams to the (network) filesystem. I should mention that the master is connected to the slaves with a 10 Gb link on the same managed switch that the slaves use.

Re: treeAggregate timing / SGD performance with miniBatchFraction < 1

2015-09-26 Thread Evan R. Sparks
Off the top of my head, I'm not sure, but it looks like virtually all the extra time between each stage is accounted for with T_{io} in your plot, which I'm guessing is time spent communicating results over the network? Is your driver running on the master or is it on a different node? If you look

treeAggregate timing / SGD performance with miniBatchFraction < 1

2015-09-26 Thread Mike Hynes
Hi Evan, (I just realized my initial email was a reply to the wrong thread; I'm very sorry about this). Thanks for your email, and your thoughts on the sampling. That the gradient computations are essentially the cost of a pass through each element of the partition makes sense, especially given t

Re: RDD API patterns

2015-09-26 Thread Evan R. Sparks
Mike, I believe the reason you're seeing near identical performance on the gradient computations is twofold 1) Gradient computations for GLM models are computationally pretty cheap from a FLOPs/byte read perspective. They are essentially a BLAS "gemv" call in the dense case, which is well known to

Re: RDD API patterns

2015-09-26 Thread Mike Hynes
Hello Devs, This email concerns some timing results for a treeAggregate in computing a (stochastic) gradient over an RDD of labelled points, as is currently done in the MLlib optimization routine for SGD. In SGD, the underlying RDD is downsampled by a fraction f \in (0,1], and the subgradients ov

Re: ClassCastException using DataFrame only when num-executors > 2 ...

2015-09-26 Thread Olivier Girardot
sorry for the delay, yes still. I'm still trying to figure out if it comes from bad data and trying to isolate the bug itself... 2015-09-11 0:28 GMT+02:00 Reynold Xin : > Does this still happen on 1.5.0 release? > > > On Mon, Aug 31, 2015 at 9:31 AM, Olivier Girardot > wrote: > >> tested now aga

Re: RFC: packaging Spark without assemblies

2015-09-26 Thread Steve Loughran
> On 25 Sep 2015, at 19:11, Marcelo Vanzin wrote: > > - People who ship the assembly with their application. As Matei > suggested (and I agree), that is kinda weird. But currently that is > the easiest way to embed Spark and get, for example, the YARN backend > working. There are ways around tha

Re: How to get the HDFS path for each RDD

2015-09-26 Thread Fengdong Yu
Anchit, please ignore my inputs. you are right. Thanks. > On Sep 26, 2015, at 17:27, Fengdong Yu wrote: > > Hi Anchit, > > this is not my expected, because you specified the HDFS directory in your > code. > I've solved like this: > >val text = sc.hadoopFile(Args.input, >

Re: How to get the HDFS path for each RDD

2015-09-26 Thread Fengdong Yu
Hi Anchit, this is not my expected, because you specified the HDFS directory in your code. I've solved like this: val text = sc.hadoopFile(Args.input, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], 2) val hadoopRdd = text.asInstanceOf[HadoopRDD[