There is a Spark Package that gives some alternative distance metrics,
http://spark-packages.org/package/derrickburns/generalized-kmeans-clustering.
Not used it myself.
-
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
Hi Dibyendu,
I am not sure I understand completely. But are you suggesting that
currently there is no way to enable Checkpoint directory to be in Tachyon?
Thanks
Nikunj
On Fri, Sep 25, 2015 at 11:49 PM, Dibyendu Bhattacharya <
dibyendu.bhattach...@gmail.com> wrote:
> Hi,
>
> Recently I was
It is working, We are doing the same thing everyday. But the remote server
needs to able to talk with ResourceManager.
If you are using Spark-submit, your will also specify the hadoop conf
directory in your Env variable. Spark would rely on that to locate where
the cluster's resource manager
Hi,
Recently I was working on a PR to use Tachyon as OFF_HEAP store for Spark
Streaming and make sure Spark Streaming can recover from Driver failure and
recover the blocks form Tachyon.
The The Motivation for this PR is :
If Streaming application stores the blocks OFF_HEAP, it may not need
Robineast wrote
> 2) let GraphX supply a null instead
> val graph = Graph(vertices, edges) // vertices found in 'edges' but
> not in 'vertices' will be set to null
Thank you! This method works.
As a follow up (sorry I'm new to this, don't know if I should start a new
thread?): if I have
Here is all of my code. My first post had a simplified version. As I post
this, I realize one issue may be that when I convert my Ids to long (I
define a pageHash function to convert string Ids to long), the nodeIds are
no longer the same between the 'vertices' object and the 'edges' object. Do
Hello,
Does anyone have an insight into what could be the issue here?
Thanks
Nikunj
On Fri, Sep 25, 2015 at 10:44 AM, N B wrote:
> Hi Akhil,
>
> I do have 25 partitions being created. I have set
> the spark.default.parallelism property to 25. Batch size is 30 seconds and
Vertices that aren't connected to anything are perfectly valid e.g.
import org.apache.spark.graphx._
val vertices = sc.makeRDD(Seq((1L,1),(2L,1),(3L,1)))
val edges = sc.makeRDD(Seq(Edge(1L,2L,1)))
val g = Graph(vertices, edges)
g.vertices.count
gives 3
Not sure why vertices appear to be
Hi Michael,
Thanks for the tip. With dataframe, is it possible to explode some selected
fields in each purchase_items?
Since purchase_items is an array of item and each item has a number of
fields (for example product_id and price), is it possible to just explode
these two fields directly using
Hi,
Received the following error when reading an Avro source with Spark 1.5.0
and the com.databricks.spark.avro reader. In the data source, there is one
nested field named "UserActivity.history.activity" and another named
"UserActivity.activity". This seems to be the reason for the execption,
Hi Dibyendu,
Thanks. I believe I understand why it has been an issue using S3 for
checkpoints based on your explanation. But does this limitation apply only
if recovery is needed in case of driver failure?
What if we are not interested in recovery after a driver failure. However,
just for the
I wanted to add that we are not configuring the WAL in our scenario.
Thanks again,
Nikunj
On Sat, Sep 26, 2015 at 11:35 AM, N B wrote:
> Hi Dibyendu,
>
> Thanks. I believe I understand why it has been an issue using S3 for
> checkpoints based on your explanation. But does
Dear All,
I have a small spark cluster for academia purpose and would like it to be
open to accept jobs for set of friends
where all of us can submit and queue up jobs.
How is that possible ? What is solution of this problem ? Any blog/sw/
link will be very helpful.
Thanks
~Manish
Related thread:
http://search-hadoop.com/m/q3RTt31EUSYGOj82
Please see:
https://spark.apache.org/docs/latest/security.html
FYI
On Sat, Sep 26, 2015 at 4:03 PM, manish ranjan
wrote:
> Dear All,
>
> I have a small spark cluster for academia purpose and would like it to be
Have you checked to make sure that your hashing function doesn't have any
collisions? Node ids have to be unique; so, if you're getting repeated ids
out of your hasher, it could certainly lead to dropping of duplicate ids,
and therefore loss of vertices.
On Sat, Sep 26, 2015 at 10:37 AM JJ
Check out the spark-testing-base project. I haven't tried it yet, looks good
though:
http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/
--
View this message in context:
That means only a single receiver is doing all the work and hence the data
is local to your N1 machine and hence all tasks are executed there. Now to
get the data to N2, you need to do either a .repartition or set the
StorageLevel MEMORY*_2 where _2 enables the data replication and i guess
that
That means only
Thanks
Best Regards
On Sun, Sep 27, 2015 at 12:07 AM, N B wrote:
> Hello,
>
> Does anyone have an insight into what could be the issue here?
>
> Thanks
> Nikunj
>
>
> On Fri, Sep 25, 2015 at 10:44 AM, N B wrote:
>
>> Hi Akhil,
>>
>> I
18 matches
Mail list logo