Re: Distance metrics in KMeans

2015-09-26 Thread Robineast
There is a Spark Package that gives some alternative distance metrics, http://spark-packages.org/package/derrickburns/generalized-kmeans-clustering. Not used it myself. - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co.

Re: Spark Streaming with Tachyon : Data Loss on Receiver Failure due to WAL error

2015-09-26 Thread N B
Hi Dibyendu, I am not sure I understand completely. But are you suggesting that currently there is no way to enable Checkpoint directory to be in Tachyon? Thanks Nikunj On Fri, Sep 25, 2015 at 11:49 PM, Dibyendu Bhattacharya < dibyendu.bhattach...@gmail.com> wrote: > Hi, > > Recently I was

Re: How to properly set conf/spark-env.sh for spark to run on yarn

2015-09-26 Thread Gavin Yue
It is working, We are doing the same thing everyday. But the remote server needs to able to talk with ResourceManager. If you are using Spark-submit, your will also specify the hadoop conf directory in your Env variable. Spark would rely on that to locate where the cluster's resource manager

Re: Spark Streaming with Tachyon : Data Loss on Receiver Failure due to WAL error

2015-09-26 Thread Dibyendu Bhattacharya
Hi, Recently I was working on a PR to use Tachyon as OFF_HEAP store for Spark Streaming and make sure Spark Streaming can recover from Driver failure and recover the blocks form Tachyon. The The Motivation for this PR is : If Streaming application stores the blocks OFF_HEAP, it may not need

Re: GraphX create graph with multiple node attributes

2015-09-26 Thread JJ
Robineast wrote > 2) let GraphX supply a null instead > val graph = Graph(vertices, edges) // vertices found in 'edges' but > not in 'vertices' will be set to null Thank you! This method works. As a follow up (sorry I'm new to this, don't know if I should start a new thread?): if I have

Re: GraphX create graph with multiple node attributes

2015-09-26 Thread JJ
Here is all of my code. My first post had a simplified version. As I post this, I realize one issue may be that when I convert my Ids to long (I define a pageHash function to convert string Ids to long), the nodeIds are no longer the same between the 'vertices' object and the 'edges' object. Do

Re: Weird worker usage

2015-09-26 Thread N B
Hello, Does anyone have an insight into what could be the issue here? Thanks Nikunj On Fri, Sep 25, 2015 at 10:44 AM, N B wrote: > Hi Akhil, > > I do have 25 partitions being created. I have set > the spark.default.parallelism property to 25. Batch size is 30 seconds and

Re: GraphX create graph with multiple node attributes

2015-09-26 Thread Robineast
Vertices that aren't connected to anything are perfectly valid e.g. import org.apache.spark.graphx._ val vertices = sc.makeRDD(Seq((1L,1),(2L,1),(3L,1))) val edges = sc.makeRDD(Seq(Edge(1L,2L,1))) val g = Graph(vertices, edges) g.vertices.count gives 3 Not sure why vertices appear to be

Re: Spark SQL: Native Support for LATERAL VIEW EXPLODE

2015-09-26 Thread Jerry Lam
Hi Michael, Thanks for the tip. With dataframe, is it possible to explode some selected fields in each purchase_items? Since purchase_items is an array of item and each item has a number of fields (for example product_id and price), is it possible to just explode these two fields directly using

Problem with multiple fields with same name in Avro

2015-09-26 Thread Anders Arpteg
Hi, Received the following error when reading an Avro source with Spark 1.5.0 and the com.databricks.spark.avro reader. In the data source, there is one nested field named "UserActivity.history.activity" and another named "UserActivity.activity". This seems to be the reason for the execption,

Re: Spark Streaming with Tachyon : Data Loss on Receiver Failure due to WAL error

2015-09-26 Thread N B
Hi Dibyendu, Thanks. I believe I understand why it has been an issue using S3 for checkpoints based on your explanation. But does this limitation apply only if recovery is needed in case of driver failure? What if we are not interested in recovery after a driver failure. However, just for the

Re: Spark Streaming with Tachyon : Data Loss on Receiver Failure due to WAL error

2015-09-26 Thread N B
I wanted to add that we are not configuring the WAL in our scenario. Thanks again, Nikunj On Sat, Sep 26, 2015 at 11:35 AM, N B wrote: > Hi Dibyendu, > > Thanks. I believe I understand why it has been an issue using S3 for > checkpoints based on your explanation. But does

queup jobs in spark cluster

2015-09-26 Thread manish ranjan
Dear All, I have a small spark cluster for academia purpose and would like it to be open to accept jobs for set of friends where all of us can submit and queue up jobs. How is that possible ? What is solution of this problem ? Any blog/sw/ link will be very helpful. Thanks ~Manish

Re: queup jobs in spark cluster

2015-09-26 Thread Ted Yu
Related thread: http://search-hadoop.com/m/q3RTt31EUSYGOj82 Please see: https://spark.apache.org/docs/latest/security.html FYI On Sat, Sep 26, 2015 at 4:03 PM, manish ranjan wrote: > Dear All, > > I have a small spark cluster for academia purpose and would like it to be

Re: GraphX create graph with multiple node attributes

2015-09-26 Thread Nick Peterson
Have you checked to make sure that your hashing function doesn't have any collisions? Node ids have to be unique; so, if you're getting repeated ids out of your hasher, it could certainly lead to dropping of duplicate ids, and therefore loss of vertices. On Sat, Sep 26, 2015 at 10:37 AM JJ

Re: What are best practices from Unit Testing Spark Code?

2015-09-26 Thread ehrlichja
Check out the spark-testing-base project. I haven't tried it yet, looks good though: http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/ -- View this message in context:

Re: Weird worker usage

2015-09-26 Thread Akhil Das
That means only a single receiver is doing all the work and hence the data is local to your N1 machine and hence all tasks are executed there. Now to get the data to N2, you need to do either a .repartition or set the StorageLevel MEMORY*_2 where _2 enables the data replication and i guess that

Re: Weird worker usage

2015-09-26 Thread Akhil Das
That means only Thanks Best Regards On Sun, Sep 27, 2015 at 12:07 AM, N B wrote: > Hello, > > Does anyone have an insight into what could be the issue here? > > Thanks > Nikunj > > > On Fri, Sep 25, 2015 at 10:44 AM, N B wrote: > >> Hi Akhil, >> >> I