Re: Spark Streaming with Tachyon : Data Loss on Receiver Failure due to WAL error

2015-09-26 Thread N B
Hi Dibyendu, I am not sure I understand completely. But are you suggesting that currently there is no way to enable Checkpoint directory to be in Tachyon? Thanks Nikunj On Fri, Sep 25, 2015 at 11:49 PM, Dibyendu Bhattacharya < dibyendu.bhattach...@gmail.com> wrote: > Hi, > > Recently I was wor

Re: GraphX create graph with multiple node attributes

2015-09-26 Thread Robineast
The 3rd parameter to Graph() is a default Vertex to be supplied if there are edges with vertices that don't appear in the vertices RDD. 2 choices: 1) supply a "null" VertexAttributes: val graph = Graph(vertices, edges, VertexAttributes(...)) // fill in the attributes as makes sense 2) let

Re: Distance metrics in KMeans

2015-09-26 Thread Robineast
There is a Spark Package that gives some alternative distance metrics, http://spark-packages.org/package/derrickburns/generalized-kmeans-clustering. Not used it myself. - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/s

Re: Spark Streaming with Tachyon : Data Loss on Receiver Failure due to WAL error

2015-09-26 Thread Dibyendu Bhattacharya
In Spark Streaming , Checkpoint Directory is used for two purpose 1. Metadata checkpointing 2. Data checkpointing If you enable WAL to recover from Driver failure, Spark Streaming will also write the Received Blocks in WAL which stored in checkpoint directory. For streaming solution to recover

Problem with multiple fields with same name in Avro

2015-09-26 Thread Anders Arpteg
Hi, Received the following error when reading an Avro source with Spark 1.5.0 and the com.databricks.spark.avro reader. In the data source, there is one nested field named "UserActivity.history.activity" and another named "UserActivity.activity". This seems to be the reason for the execption, sinc

Re: GraphX create graph with multiple node attributes

2015-09-26 Thread JJ
Robineast wrote > 2) let GraphX supply a null instead > val graph = Graph(vertices, edges) // vertices found in 'edges' but > not in 'vertices' will be set to null Thank you! This method works. As a follow up (sorry I'm new to this, don't know if I should start a new thread?): if I have ve

Re: GraphX create graph with multiple node attributes

2015-09-26 Thread Robineast
Vertices that aren't connected to anything are perfectly valid e.g. import org.apache.spark.graphx._ val vertices = sc.makeRDD(Seq((1L,1),(2L,1),(3L,1))) val edges = sc.makeRDD(Seq(Edge(1L,2L,1))) val g = Graph(vertices, edges) g.vertices.count gives 3 Not sure why vertices appear to be droppi

Re: GraphX create graph with multiple node attributes

2015-09-26 Thread JJ
Here is all of my code. My first post had a simplified version. As I post this, I realize one issue may be that when I convert my Ids to long (I define a pageHash function to convert string Ids to long), the nodeIds are no longer the same between the 'vertices' object and the 'edges' object. Do you

Re: GraphX create graph with multiple node attributes

2015-09-26 Thread Nick Peterson
Have you checked to make sure that your hashing function doesn't have any collisions? Node ids have to be unique; so, if you're getting repeated ids out of your hasher, it could certainly lead to dropping of duplicate ids, and therefore loss of vertices. On Sat, Sep 26, 2015 at 10:37 AM JJ wrote

Re: Spark SQL: Native Support for LATERAL VIEW EXPLODE

2015-09-26 Thread Jerry Lam
Hi Michael, Thanks for the tip. With dataframe, is it possible to explode some selected fields in each purchase_items? Since purchase_items is an array of item and each item has a number of fields (for example product_id and price), is it possible to just explode these two fields directly using da

Re: Spark Streaming with Tachyon : Data Loss on Receiver Failure due to WAL error

2015-09-26 Thread N B
Hi Dibyendu, Thanks. I believe I understand why it has been an issue using S3 for checkpoints based on your explanation. But does this limitation apply only if recovery is needed in case of driver failure? What if we are not interested in recovery after a driver failure. However, just for the pur

Re: Spark Streaming with Tachyon : Data Loss on Receiver Failure due to WAL error

2015-09-26 Thread N B
I wanted to add that we are not configuring the WAL in our scenario. Thanks again, Nikunj On Sat, Sep 26, 2015 at 11:35 AM, N B wrote: > Hi Dibyendu, > > Thanks. I believe I understand why it has been an issue using S3 for > checkpoints based on your explanation. But does this limitation apply

Re: Weird worker usage

2015-09-26 Thread N B
Hello, Does anyone have an insight into what could be the issue here? Thanks Nikunj On Fri, Sep 25, 2015 at 10:44 AM, N B wrote: > Hi Akhil, > > I do have 25 partitions being created. I have set > the spark.default.parallelism property to 25. Batch size is 30 seconds and > block interval is 1

Re: Weird worker usage

2015-09-26 Thread Akhil Das
That means only Thanks Best Regards On Sun, Sep 27, 2015 at 12:07 AM, N B wrote: > Hello, > > Does anyone have an insight into what could be the issue here? > > Thanks > Nikunj > > > On Fri, Sep 25, 2015 at 10:44 AM, N B wrote: > >> Hi Akhil, >> >> I do have 25 partitions being created. I have

Re: Weird worker usage

2015-09-26 Thread Akhil Das
That means only a single receiver is doing all the work and hence the data is local to your N1 machine and hence all tasks are executed there. Now to get the data to N2, you need to do either a .repartition or set the StorageLevel MEMORY*_2 where _2 enables the data replication and i guess that wil

Re: What are best practices from Unit Testing Spark Code?

2015-09-26 Thread ehrlichja
Check out the spark-testing-base project. I haven't tried it yet, looks good though: http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-are-best-practices-fro

queup jobs in spark cluster

2015-09-26 Thread manish ranjan
Dear All, I have a small spark cluster for academia purpose and would like it to be open to accept jobs for set of friends where all of us can submit and queue up jobs. How is that possible ? What is solution of this problem ? Any blog/sw/ link will be very helpful. Thanks ~Manish

Re: queup jobs in spark cluster

2015-09-26 Thread Ted Yu
Related thread: http://search-hadoop.com/m/q3RTt31EUSYGOj82 Please see: https://spark.apache.org/docs/latest/security.html FYI On Sat, Sep 26, 2015 at 4:03 PM, manish ranjan wrote: > Dear All, > > I have a small spark cluster for academia purpose and would like it to be > open to accept jobs f