Hi Dibyendu,
I am not sure I understand completely. But are you suggesting that
currently there is no way to enable Checkpoint directory to be in Tachyon?
Thanks
Nikunj
On Fri, Sep 25, 2015 at 11:49 PM, Dibyendu Bhattacharya <
dibyendu.bhattach...@gmail.com> wrote:
> Hi,
>
> Recently I was wor
The 3rd parameter to Graph() is a default Vertex to be supplied if there are
edges with vertices that don't appear in the vertices RDD.
2 choices:
1) supply a "null" VertexAttributes:
val graph = Graph(vertices, edges, VertexAttributes(...)) // fill in
the attributes as makes sense
2) let
There is a Spark Package that gives some alternative distance metrics,
http://spark-packages.org/package/derrickburns/generalized-kmeans-clustering.
Not used it myself.
-
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/s
In Spark Streaming , Checkpoint Directory is used for two purpose
1. Metadata checkpointing
2. Data checkpointing
If you enable WAL to recover from Driver failure, Spark Streaming will also
write the Received Blocks in WAL which stored in checkpoint directory.
For streaming solution to recover
Hi,
Received the following error when reading an Avro source with Spark 1.5.0
and the com.databricks.spark.avro reader. In the data source, there is one
nested field named "UserActivity.history.activity" and another named
"UserActivity.activity". This seems to be the reason for the execption,
sinc
Robineast wrote
> 2) let GraphX supply a null instead
> val graph = Graph(vertices, edges) // vertices found in 'edges' but
> not in 'vertices' will be set to null
Thank you! This method works.
As a follow up (sorry I'm new to this, don't know if I should start a new
thread?): if I have ve
Vertices that aren't connected to anything are perfectly valid e.g.
import org.apache.spark.graphx._
val vertices = sc.makeRDD(Seq((1L,1),(2L,1),(3L,1)))
val edges = sc.makeRDD(Seq(Edge(1L,2L,1)))
val g = Graph(vertices, edges)
g.vertices.count
gives 3
Not sure why vertices appear to be droppi
Here is all of my code. My first post had a simplified version. As I post
this, I realize one issue may be that when I convert my Ids to long (I
define a pageHash function to convert string Ids to long), the nodeIds are
no longer the same between the 'vertices' object and the 'edges' object. Do
you
Have you checked to make sure that your hashing function doesn't have any
collisions? Node ids have to be unique; so, if you're getting repeated ids
out of your hasher, it could certainly lead to dropping of duplicate ids,
and therefore loss of vertices.
On Sat, Sep 26, 2015 at 10:37 AM JJ wrote
Hi Michael,
Thanks for the tip. With dataframe, is it possible to explode some selected
fields in each purchase_items?
Since purchase_items is an array of item and each item has a number of
fields (for example product_id and price), is it possible to just explode
these two fields directly using da
Hi Dibyendu,
Thanks. I believe I understand why it has been an issue using S3 for
checkpoints based on your explanation. But does this limitation apply only
if recovery is needed in case of driver failure?
What if we are not interested in recovery after a driver failure. However,
just for the pur
I wanted to add that we are not configuring the WAL in our scenario.
Thanks again,
Nikunj
On Sat, Sep 26, 2015 at 11:35 AM, N B wrote:
> Hi Dibyendu,
>
> Thanks. I believe I understand why it has been an issue using S3 for
> checkpoints based on your explanation. But does this limitation apply
Hello,
Does anyone have an insight into what could be the issue here?
Thanks
Nikunj
On Fri, Sep 25, 2015 at 10:44 AM, N B wrote:
> Hi Akhil,
>
> I do have 25 partitions being created. I have set
> the spark.default.parallelism property to 25. Batch size is 30 seconds and
> block interval is 1
That means only
Thanks
Best Regards
On Sun, Sep 27, 2015 at 12:07 AM, N B wrote:
> Hello,
>
> Does anyone have an insight into what could be the issue here?
>
> Thanks
> Nikunj
>
>
> On Fri, Sep 25, 2015 at 10:44 AM, N B wrote:
>
>> Hi Akhil,
>>
>> I do have 25 partitions being created. I have
That means only a single receiver is doing all the work and hence the data
is local to your N1 machine and hence all tasks are executed there. Now to
get the data to N2, you need to do either a .repartition or set the
StorageLevel MEMORY*_2 where _2 enables the data replication and i guess
that wil
Check out the spark-testing-base project. I haven't tried it yet, looks good
though:
http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/What-are-best-practices-fro
Dear All,
I have a small spark cluster for academia purpose and would like it to be
open to accept jobs for set of friends
where all of us can submit and queue up jobs.
How is that possible ? What is solution of this problem ? Any blog/sw/
link will be very helpful.
Thanks
~Manish
Related thread:
http://search-hadoop.com/m/q3RTt31EUSYGOj82
Please see:
https://spark.apache.org/docs/latest/security.html
FYI
On Sat, Sep 26, 2015 at 4:03 PM, manish ranjan
wrote:
> Dear All,
>
> I have a small spark cluster for academia purpose and would like it to be
> open to accept jobs f
18 matches
Mail list logo