Re: [GraphX]: Prevent recomputation of DAG

2024-03-18 Thread Mich Talebzadeh
Hi,

I must admit I don't know much about this Fruchterman-Reingold (call
it FR) visualization using GraphX and Kubernetes..But you are
suggesting this slowdown issue starts after the second iteration, and
caching/persisting the graph after each iteration does not help. FR
involves many computations between vertex pairs. In MapReduce (or
shuffle) steps, Data might be shuffled across the network, impacting
performance for large graphs. The usual steps to verify this is
through Spark UI in Stages, SQL and execute tabbs, You will see the
time taken for each step and the amount of read/write  etc. Also
repeatedly creating and destroying GraphX graphs in each iteration may
lead to garbage collection (GC) overhead.So you should consider r
profiling your application to identify bottlenecks and pinpoint which
part of the code is causing the slowdown.  As I mentioned Spark offers
profiling tools like Spark UI or third-party libraries.for this
purpose.

HTH


Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".



On Sun, 17 Mar 2024 at 18:45, Marek Berith  wrote:
>
> Dear community,
> for my diploma thesis, we are implementing a distributed version of
> Fruchterman-Reingold visualization algorithm, using GraphX and Kubernetes. Our
> solution is a backend that continously computes new positions of vertices in a
> graph and sends them via RabbitMQ to a consumer. Fruchterman-Reingold is an
> iterative algorithm, meaning that in each iteration repulsive and attractive
> forces between vertices are computed and then new positions of vertices based
> on those forces are computed. Graph vertices and edges are stored in a GraphX
> graph structure. Forces between vertices are computed using MapReduce(between
> each pair of vertices) and aggregateMessages(for vertices connected via
> edges). After an iteration of the algorithm, the recomputed positions from the
> RDD are serialized using collect and sent to the RabbitMQ queue.
>
> Here comes the issue. The first two iterations of the algorithm seem to be
> quick, but at the third iteration, the algorithm is very slow until it reaches
> a point at which it cannot finish an iteration in real time. It seems like
> caching of the graph may be an issue, because if we serialize the graph after
> each iteration in an array and create new graph from the array in the new
> iteration, we get a constant usage of memory and each iteration takes the same
> amount of time. We had already tried to cache/persist/checkpoint the graph
> after each iteration but it didn't help, so maybe we are doing something
> wrong. We do not think that serializing the graph into an array should be the
> solution for such a complex library like Apache Spark. I'm also not very
> confident how this fix will affect performance for large graphs or in parallel
> environment. We are attaching a short example of code that shows doing an
> iteration of the algorithm, input and output example.
>
> We would appreciate if you could help us fix this issue or give us any
> meaningful ideas, as we had tried everything that came to mind.
>
> We look forward to your reply.
> Thank you, Marek Berith
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: GraphX Support

2022-03-25 Thread Bjørn Jørgensen
Yes, MLlib  is actively developed. You can
have a look at github and filter on closed and ML github and filter on
closed and ML




fre. 25. mar. 2022 kl. 22:15 skrev Bitfox :

> BTW , is MLlib still in active development?
>
> Thanks
>
> On Tue, Mar 22, 2022 at 07:11 Sean Owen  wrote:
>
>> GraphX is not active, though still there and does continue to build and
>> test with each Spark release. GraphFrames kind of superseded it, but is
>> also not super active FWIW.
>>
>> On Mon, Mar 21, 2022 at 6:03 PM Jacob Marquez
>>  wrote:
>>
>>> Hello!
>>>
>>>
>>>
>>> My team and I are evaluating GraphX as a possible solution. Would
>>> someone be able to speak to the support of this Spark feature? Is there
>>> active development or is GraphX in maintenance mode (e.g. updated to ensure
>>> functionality with new Spark releases)?
>>>
>>>
>>>
>>> Thanks in advance for your help!
>>>
>>>
>>>
>>> --
>>>
>>> Jacob H. Marquez
>>>
>>> He/Him
>>>
>>> Data & Applied Scientist
>>>
>>> Microsoft Cloud Data Sciences
>>>
>>>
>>>
>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: GraphX Support

2022-03-25 Thread Bitfox
BTW , is MLlib still in active development?

Thanks

On Tue, Mar 22, 2022 at 07:11 Sean Owen  wrote:

> GraphX is not active, though still there and does continue to build and
> test with each Spark release. GraphFrames kind of superseded it, but is
> also not super active FWIW.
>
> On Mon, Mar 21, 2022 at 6:03 PM Jacob Marquez 
> wrote:
>
>> Hello!
>>
>>
>>
>> My team and I are evaluating GraphX as a possible solution. Would someone
>> be able to speak to the support of this Spark feature? Is there active
>> development or is GraphX in maintenance mode (e.g. updated to ensure
>> functionality with new Spark releases)?
>>
>>
>>
>> Thanks in advance for your help!
>>
>>
>>
>> --
>>
>> Jacob H. Marquez
>>
>> He/Him
>>
>> Data & Applied Scientist
>>
>> Microsoft Cloud Data Sciences
>>
>>
>>
>


Re: [EXTERNAL] Re: GraphX Support

2022-03-25 Thread Bjørn Jørgensen
One alternative can be to use Spark and ArangoDB <https://www.arangodb.com>

Introducing the new ArangoDB Datasource for Apache Spark
<https://www.arangodb.com/2022/03/introducing-the-new-arangodb-datasource-for-apache-spark/>


ArongoDB is a open source graphs DB with a lot of good graphs utils and
documentation <https://www.arangodb.com/docs/stable/graphs.html>

tir. 22. mar. 2022 kl. 00:49 skrev Jacob Marquez
:

> Awesome, thank you!
>
>
>
> *From:* Sean Owen 
> *Sent:* Monday, March 21, 2022 4:11 PM
> *To:* Jacob Marquez 
> *Cc:* user@spark.apache.org
> *Subject:* [EXTERNAL] Re: GraphX Support
>
>
>
> You don't often get email from sro...@gmail.com. Learn why this is
> important <http://aka.ms/LearnAboutSenderIdentification>
>
> GraphX is not active, though still there and does continue to build and
> test with each Spark release. GraphFrames kind of superseded it, but is
> also not super active FWIW.
>
>
>
> On Mon, Mar 21, 2022 at 6:03 PM Jacob Marquez <
> jac...@microsoft.com.invalid> wrote:
>
> Hello!
>
>
>
> My team and I are evaluating GraphX as a possible solution. Would someone
> be able to speak to the support of this Spark feature? Is there active
> development or is GraphX in maintenance mode (e.g. updated to ensure
> functionality with new Spark releases)?
>
>
>
> Thanks in advance for your help!
>
>
>
> --
>
> Jacob H. Marquez
>
> He/Him
>
> Data & Applied Scientist
>
> Microsoft Cloud Data Sciences
>
>
>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: GraphX Support

2022-03-22 Thread Enrico Minack
Right, GraphFrames is not very active and maintainers don't even have 
the capacity to make releases.


Enrico


Am 22.03.22 um 00:10 schrieb Sean Owen:
GraphX is not active, though still there and does continue to build 
and test with each Spark release. GraphFrames kind of superseded it, 
but is also not super active FWIW.


On Mon, Mar 21, 2022 at 6:03 PM Jacob Marquez 
 wrote:


Hello!

My team and I are evaluating GraphX as a possible solution. Would
someone be able to speak to the support of this Spark feature? Is
there active development or is GraphX in maintenance mode (e.g.
updated to ensure functionality with new Spark releases)?

Thanks in advance for your help!

--

Jacob H. Marquez

He/Him

Data & Applied Scientist

Microsoft Cloud Data Sciences



RE: [EXTERNAL] Re: GraphX Support

2022-03-21 Thread Jacob Marquez
Awesome, thank you!

From: Sean Owen 
Sent: Monday, March 21, 2022 4:11 PM
To: Jacob Marquez 
Cc: user@spark.apache.org
Subject: [EXTERNAL] Re: GraphX Support

You don't often get email from sro...@gmail.com<mailto:sro...@gmail.com>. Learn 
why this is important<http://aka.ms/LearnAboutSenderIdentification>
GraphX is not active, though still there and does continue to build and test 
with each Spark release. GraphFrames kind of superseded it, but is also not 
super active FWIW.

On Mon, Mar 21, 2022 at 6:03 PM Jacob Marquez 
mailto:jac...@microsoft.com.invalid>> wrote:
Hello!

My team and I are evaluating GraphX as a possible solution. Would someone be 
able to speak to the support of this Spark feature? Is there active development 
or is GraphX in maintenance mode (e.g. updated to ensure functionality with new 
Spark releases)?

Thanks in advance for your help!

--
Jacob H. Marquez
He/Him
Data & Applied Scientist
Microsoft Cloud Data Sciences



Re: GraphX Support

2022-03-21 Thread Sean Owen
GraphX is not active, though still there and does continue to build and
test with each Spark release. GraphFrames kind of superseded it, but is
also not super active FWIW.

On Mon, Mar 21, 2022 at 6:03 PM Jacob Marquez 
wrote:

> Hello!
>
>
>
> My team and I are evaluating GraphX as a possible solution. Would someone
> be able to speak to the support of this Spark feature? Is there active
> development or is GraphX in maintenance mode (e.g. updated to ensure
> functionality with new Spark releases)?
>
>
>
> Thanks in advance for your help!
>
>
>
> --
>
> Jacob H. Marquez
>
> He/Him
>
> Data & Applied Scientist
>
> Microsoft Cloud Data Sciences
>
>
>


Re: GraphX performance feedback

2019-11-28 Thread mahzad kalantari
Ok thanks!

Le jeu. 28 nov. 2019 à 11:27, Phillip Henry  a
écrit :

> I saw a large improvement in my GraphX processing by:
>
> - using fewer partitions
> - using fewer executors but with much more memory.
>
> YMMV.
>
> Phillip
>
> On Mon, 25 Nov 2019, 19:14 mahzad kalantari, 
> wrote:
>
>> Thanks for your answer, my use case is friend recommandation for 200
>> million profils.
>>
>> Le lun. 25 nov. 2019 à 14:10, Jörn Franke  a
>> écrit :
>>
>>> I think it depends what you want do. Interactive big data graph
>>> analytics are probably better of in Janusgraph or similar.
>>> Batch processing (once-off) can be still fine in graphx - you have
>>> though to carefully design the process.
>>>
>>> Am 25.11.2019 um 20:04 schrieb mahzad kalantari <
>>> mahzad.kalant...@gmail.com>:
>>>
>>> 
>>> Hi all
>>>
>>> My question is about GraphX, I 'm looking for user feedbacks on the
>>> performance.
>>>
>>> I read this paper written by Facebook team that says Graphx has very
>>> poor performance.
>>>
>>> https://engineering.fb.com/core-data/a-comparison-of-state-of-the-art-graph-processing-systems/
>>>
>>>
>>> Has anyone already encountered performance problems with Graphx, and is
>>> it a good choice if I want to do large scale graph modelling?
>>>
>>>
>>> Thanks!
>>>
>>> Mahzad
>>>
>>>


Re: GraphX performance feedback

2019-11-28 Thread Phillip Henry
I saw a large improvement in my GraphX processing by:

- using fewer partitions
- using fewer executors but with much more memory.

YMMV.

Phillip

On Mon, 25 Nov 2019, 19:14 mahzad kalantari, 
wrote:

> Thanks for your answer, my use case is friend recommandation for 200
> million profils.
>
> Le lun. 25 nov. 2019 à 14:10, Jörn Franke  a écrit :
>
>> I think it depends what you want do. Interactive big data graph analytics
>> are probably better of in Janusgraph or similar.
>> Batch processing (once-off) can be still fine in graphx - you have though
>> to carefully design the process.
>>
>> Am 25.11.2019 um 20:04 schrieb mahzad kalantari <
>> mahzad.kalant...@gmail.com>:
>>
>> 
>> Hi all
>>
>> My question is about GraphX, I 'm looking for user feedbacks on the
>> performance.
>>
>> I read this paper written by Facebook team that says Graphx has very poor
>> performance.
>>
>> https://engineering.fb.com/core-data/a-comparison-of-state-of-the-art-graph-processing-systems/
>>
>>
>> Has anyone already encountered performance problems with Graphx, and is
>> it a good choice if I want to do large scale graph modelling?
>>
>>
>> Thanks!
>>
>> Mahzad
>>
>>


Re: GraphX performance feedback

2019-11-25 Thread mahzad kalantari
Thanks for your answer, my use case is friend recommandation for 200
million profils.

Le lun. 25 nov. 2019 à 14:10, Jörn Franke  a écrit :

> I think it depends what you want do. Interactive big data graph analytics
> are probably better of in Janusgraph or similar.
> Batch processing (once-off) can be still fine in graphx - you have though
> to carefully design the process.
>
> Am 25.11.2019 um 20:04 schrieb mahzad kalantari <
> mahzad.kalant...@gmail.com>:
>
> 
> Hi all
>
> My question is about GraphX, I 'm looking for user feedbacks on the
> performance.
>
> I read this paper written by Facebook team that says Graphx has very poor
> performance.
>
> https://engineering.fb.com/core-data/a-comparison-of-state-of-the-art-graph-processing-systems/
>
>
> Has anyone already encountered performance problems with Graphx, and is it
> a good choice if I want to do large scale graph modelling?
>
>
> Thanks!
>
> Mahzad
>
>


Re: GraphX performance feedback

2019-11-25 Thread Jörn Franke
I think it depends what you want do. Interactive big data graph analytics are 
probably better of in Janusgraph or similar. 
Batch processing (once-off) can be still fine in graphx - you have though to 
carefully design the process. 

> Am 25.11.2019 um 20:04 schrieb mahzad kalantari :
> 
> 
> Hi all
> 
> My question is about GraphX, I 'm looking for user feedbacks on the 
> performance.
> 
> I read this paper written by Facebook team that says Graphx has very poor 
> performance.
> https://engineering.fb.com/core-data/a-comparison-of-state-of-the-art-graph-processing-systems/
>   
> 
> Has anyone already encountered performance problems with Graphx, and is it a 
> good choice if I want to do large scale graph modelling?
> 
> 
> Thanks!
> 
> Mahzad 


Re: graphx vs graphframes

2019-10-17 Thread Nicolas Paris
Hi Alastair

Cypher support looks like promising and the dev list thread discussion
is interesting. 
thanks for your feedback. 

On Thu, Oct 17, 2019 at 09:19:28AM +0100, Alastair Green wrote:
> Hi Nicolas, 
> 
> I was following the current thread on the dev channel about Spark
> Graph, including Cypher support, 
> 
> http://apache-spark-developers-list.1001551.n3.nabble.com/
> Add-spark-dependency-on-on-org-opencypher-okapi-shade-okapi-td28118.html
> 
> and I remembered your post.
> 
> Actually, GraphX and GraphFrames are both not being developed actively, so far
> as I can tell. 
> 
> The only activity on GraphX in the last two years was a fix for Scala 2.13
> functionality: to quote the PR 
> 
> 
> ### Does this PR introduce any user-facing change?
> 
> No behavior change at all.
> 
> The only activity on GraphFrames since the addition of Pregel support in Scala
> back in December 2018, has been build/test improvements and recent builds
> against 2.4 and 3.0 snapshots. I’m not sure there was a lot of functional
> change before that either. 
> 
> The efforts to provide graph processing in Spark with the more full-featured
> Cypher query language that you can see in the proposed 3.0 changes discussed 
> in
> the dev list, and the related openCypher/morpheus project (which among many
> other things allows you to cast a Morpheus graph into a GraphX graph) and
> extends the proposed 3.0 changes in a compatible way, are active. 
> 
> Yrs, 
> 
> Alastair
> 
> 
> Alastair Green
> 
> Query Languages Standards and Research
> 
> 
> Neo4j UK Ltd
> 
> Union House
> 182-194 Union Street
> London, SE1 0LH
> 
> 
> +44 795 841 2107
> 
> 
> On Sun, Sep 22, 2019 at 21:17, Nicolas Paris  wrote:
> 
> hi all
> 
> graphframes was intended to replace graphx.
> 
> however the former looks not maintained anymore while the latter is
> still active.
> 
> any thought ?
> --
> nicolas
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> 

-- 
nicolas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: graphx vs graphframes

2019-10-17 Thread Alastair Green
Hi Nicolas,
I was following the current thread on the dev channel about Spark Graph, 
including Cypher support,
http://apache-spark-developers-list.1001551.n3.nabble.com/Add-spark-dependency-on-on-org-opencypher-okapi-shade-okapi-td28118.html
 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Add-spark-dependency-on-on-org-opencypher-okapi-shade-okapi-td28118.html]
and I remembered your post.
Actually, GraphX and GraphFrames are both not being developed actively, so far 
as I can tell.
The only activity on GraphX in the last two years was a fix for Scala 2.13 
functionality: to quote the PR
### Does this PR introduce any user-facing change?No behavior change at all.

The only activity on GraphFrames since the addition of Pregel support in Scala 
back in December 2018, has been build/test improvements and recent builds 
against 2.4 and 3.0 snapshots. I’m not sure there was a lot of functional 
change before that either.
The efforts to provide graph processing in Spark with the more full-featured 
Cypher query language that you can see in the proposed 3.0 changes discussed in 
the dev list, and the related openCypher/morpheus project (which among many 
other things allows you to cast a Morpheus graph into a GraphX graph) and 
extends the proposed 3.0 changes in a compatible way, are active.
Yrs,
Alastair
Alastair Green

Query Languages Standards and Research




Neo4j UK Ltd

Union House
182-194 Union Street
London, SE1 0LH




+44 795 841 2107


On Sun, Sep 22, 2019 at 21:17, Nicolas Paris  wrote:
hi all

graphframes was intended to replace graphx.

however the former looks not maintained anymore while the latter is
still active.

any thought ?
--
nicolas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [GraphX] Preserving Partitions when reading from HDFS

2019-04-25 Thread M Bilal
If I understand correctly this would set the split size in the Hadoop
configuration when reading file. I can see that being useful when you want
to create more partitions than what the block size in HDFS might dictate.
Instead what I want to do is to create a single partition for each file
written by task (from say a previous job) i.e. data in part-0 forms
partition 1, part-1 forms partition 2 and so on and so forth.

- Bilal

On Tue, Apr 16, 2019, 6:00 AM Manu Zhang  wrote:

> You may try
> `sparkContext.hadoopConfiguration().set("mapred.max.split.size",
> "33554432")` to tune the partition size when reading from HDFS.
>
> Thanks,
> Manu Zhang
>
> On Mon, Apr 15, 2019 at 11:28 PM M Bilal  wrote:
>
>> Hi,
>>
>> I have implemented a custom partitioning algorithm to partition graphs in
>> GraphX. Saving the partitioning graph (the edges) to HDFS creates separate
>> files in the output folder with the number of files equal to the number of
>> Partitions.
>>
>> However, reading back the edges creates number of partitions that are
>> equal to the number of blocks in the HDFS folder. Is there a way to instead
>> create the same number of partitions as the number of files written to HDFS
>> while preserving the original partitioning?
>>
>> I would like to avoid repartitioning.
>>
>> Thanks.
>> - Bilal
>>
>


Re: [GraphX] Preserving Partitions when reading from HDFS

2019-04-15 Thread Manu Zhang
You may try
`sparkContext.hadoopConfiguration().set("mapred.max.split.size",
"33554432")` to tune the partition size when reading from HDFS.

Thanks,
Manu Zhang

On Mon, Apr 15, 2019 at 11:28 PM M Bilal  wrote:

> Hi,
>
> I have implemented a custom partitioning algorithm to partition graphs in
> GraphX. Saving the partitioning graph (the edges) to HDFS creates separate
> files in the output folder with the number of files equal to the number of
> Partitions.
>
> However, reading back the edges creates number of partitions that are
> equal to the number of blocks in the HDFS folder. Is there a way to instead
> create the same number of partitions as the number of files written to HDFS
> while preserving the original partitioning?
>
> I would like to avoid repartitioning.
>
> Thanks.
> - Bilal
>


Re: GraphX subgraph from list of VertexIds

2017-05-12 Thread Robineast
it would be listVertices.contains(vid) wouldn't it?



-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-subgraph-from-list-of-VertexIds-tp28677p28679.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: GraphX Pregel API: add vertices and edges

2017-03-23 Thread Robineast
>From the section on Pregel API in the GraphX programming guide: '... the
Pregel operator in GraphX is a bulk-synchronous parallel messaging
abstraction /constrained to the topology of the graph/.'. Does that answer
your question? Did you read the programming guide?



-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pregel-API-add-vertices-and-edges-tp28519p28529.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: GraphX Pregel API: add vertices and edges

2017-03-23 Thread Robineast
GraphX is not synonymous with Pregel. To quote the  GraphX programming guide
  
'GraphX exposes a variant of the Pregel API.'. There is no compute()
function in GraphX - see the Pregel API section of the programming guide for
details on how GraphX implements a Pregel-like API



-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pregel-API-add-vertices-and-edges-tp28519p28527.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: GraphX Pregel API: add vertices and edges

2017-03-23 Thread Robineast
Not that I'm aware of. Where did you read that?



-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pregel-API-add-vertices-and-edges-tp28519p28523.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Graphx Examples for ALS

2017-02-17 Thread Irving Duran
Not sure I follow your question.  Do you want to use ALS or GraphX?


Thank You,

Irving Duran

On Fri, Feb 17, 2017 at 7:07 AM, balaji9058  wrote:

> Hi,
>
> Where can i find the the ALS recommendation algorithm for large data set?
>
> Please feel to share your ideas/algorithms/logic to build recommendation
> engine by using spark graphx
>
> Thanks in advance.
>
> Thanks,
> Balaji
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Graphx-Examples-for-ALS-tp28401.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Graphx triplet comparison

2016-12-14 Thread Robineast
You are trying to invoke 1 RDD action inside another, that won't work. If you
want to do what you are attempting you need to .collect() each triplet to
the driver and iterate over that.

HOWEVER you almost certainly don't want to do that, not if your data are
anything other than a trivial size. In essence you are doing a cartesian
join followed by a filter - that doesn't scale. You might want to consider
joining one triplet RDD to another and then evaluating the condition.



-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-triplet-comparison-tp28198p28208.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Graphx triplet comparison

2016-12-13 Thread balaji9058
Hi Thanks for reply.

Here is my code:
class BusStopNode(val name: String,val mode:String,val maxpasengers :Int)
extends Serializable
case class busstop(override val name: String,override val mode:String,val
shelterId: String, override val maxpasengers :Int) extends
BusStopNode(name,mode,maxpasengers) with Serializable
case class busNodeDetails(override val name: String,override val
mode:String,val srcId: Int,val destId :Int,val arrivalTime :Int,override val
maxpasengers :Int) extends BusStopNode(name,mode,maxpasengers) with
Serializable
case class routeDetails(override val name: String,override val
mode:String,val srcId: Int,val destId :Int,override val maxpasengers :Int)
extends BusStopNode(name,mode,maxpasengers) with Serializable

val busstopRDD: RDD[(VertexId, BusStopNode)] =
  sc.textFile("\\BusStopNameMini.txt").filter(!_.startsWith("#")).
map { line =>
  val row = line split ","
  (row(0).toInt, new
busstop(row(0),row(3),row(1)+row(0),row(2).toInt))
}

busstopRDD.foreach(println)

val busNodeDetailsRdd: RDD[(VertexId, BusStopNode)] =
  sc.textFile("\\RouteDetails.txt").filter(!_.startsWith("#")).
map { line =>
  val row = line split ","
  (row(0).toInt, new
busNodeDetails(row(0),row(4),row(1).toInt,row(2).toInt,row(3).toInt,0))
}
busNodeDetailsRdd.foreach(println)

 val detailedStats: RDD[Edge[BusStopNode]] =
sc.textFile("\\routesEdgeNew.txt").
filter(! _.startsWith("#")).
map {line =>
val row = line split ','
Edge(row(0).toInt, row(1).toInt,new BusStopNode(row(2),
row(3),1)
   )}

val busGraph = busstopRDD ++ busNodeDetailsRdd
busGraph.foreach(println)
val mainGraph = Graph(busGraph, detailedStats)
mainGraph.triplets.foreach(println)
 val subGraph = mainGraph subgraph (epred = _.srcAttr.name == "101")
 //Working Fine
 for (subTriplet <- subGraph.triplets) {
 println(subTriplet.dstAttr.name)
 }
 
 //Working fine
  for (mainTriplet <- mainGraph.triplets) {
 println(subTriplet.dstAttr.name)
 }
 
 //causing error while iterating both at same time
 for (subTriplet <- subGraph.triplets) {
for (mainTriplet <- mainGraph.triplets) {   //Nullpointer exception
is causing here
   if
(subTriplet.dstAttr.name.toString.equals(mainTriplet.dstAttr.name)) {

  println("hello")//success case on both destination names of of
subgraph and maingraph
}
  }
}
}

BusStopNameMini.txt
101,bs,10,B
102,bs,10,B
103,bs,20,B
104,bs,14,B
105,bs,8,B


RouteDetails.txt

#101,102,104  4 5 6
#102,103 3 4
#103,105,104 2 3 4
#104,102,101  4 5 6
#104,1015
#105,104,102 5 6 2
1,101,104,5,R
2,102,103,5,R
3,103,104,5,R
4,102,103,5,R
5,104,101,5,R
6,105,102,5,R

routesEdgeNew.txt it contains two types of edges are bus to bus with edge
value is distance and bus to route with edge value as time
#101,102,104  4 5 6
#102,103 3 4
#103,105,104 2 3 4
#104,102,101  4 5 6
#104,1015
#105,104,102 5 6 2
101,102,4,BS
102,104,5,BS
102,103,3,BS
103,105,4,BS
105,104,3,BS
104,102,4,BS
102,101,5,BS
104,101,5,BS
105,104,5,BS
104,102,6,BS
101,1,4,R,102
101,1,4,R,103
102,2,5,R
103,3,6,R
103,3,5,R
104,4,7,R
105,5,4,Z
101,2,9,R
105,5,4,R
105,2,5,R
104,2,5,R
103,1,4,R
101,103,4,BS
101,104,4,BS
101,105,4,BS
101,103,5,BS
101,104,5,BS
101,105,5,BS
1,101,4,R







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-triplet-comparison-tp28198p28205.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Graphx triplet comparison

2016-12-13 Thread Robineast
No sure what you are asking. What's wrong with:

triplet1.filter(condition3)
triplet2.filter(condition3)




-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-triplet-comparison-tp28198p28202.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [GraphX] Extreme scheduler delay

2016-12-06 Thread Sean Owen
(For what it is worth, I happened to look into this with Anton earlier and
am also pretty convinced it's related to GraphX rather than the app. It's
somewhat difficult to debug what gets sent in the closure AFAICT.)

On Tue, Dec 6, 2016 at 7:49 PM AntonIpp  wrote:

> Hi everyone,
>
> I have a small Scala test project which uses GraphX and for some reason has
> extreme scheduler delay when executed on the cluster. The problem is not
> related to the cluster configuration, as other GraphX applications run
> without any issue.
> I have attached the source code ( MatrixTest.scala
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n28162/MatrixTest.scala
> >
> ), it creates a sort of a  GraphGenerators.gridGraph
> <
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.util.GraphGenerators$
> >
> (but with diagonal edges too) using data from a matrix inside the Map
> class.
> There are in reality only 4 lines related to GraphX itself: creating a
> VertexRDD, creating an EdgeRDD, creating a Graph and then calling
> graph.edges.count.
> As you can see on the  Spark History Server
> <
> http://cdhdns-mn0.westeurope.cloudapp.azure.com:18088/history/application_1480677653852_0050/jobs/
> >
> , the task has very significant scheduler delay. There is also the
> following
> warning in the logs (I have attached them too:  MatrixTest.log
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n28162/MatrixTest.log
> >
> ) : "WARN scheduler.TaskSetManager: Stage 0 contains a task of very large
> size (2905 KB). The maximum recommended task size is 100 KB."
> This also happens with .aggregateMessages.collect and Pregel. I have tested
> with Spark 1.6 and 2.0, different levels of parallelism, different number
> of
> executors, etc but the scheduler delay is still there and grows more and
> more extreme as the number of vertices and edges grows.
>
> Does anyone have any idea as to what could be the source of the issue?
> Thank you!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Extreme-scheduler-delay-tp28162.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: GraphX Pregel not update vertex state properly, cause messages loss

2016-11-28 Thread rohit13k
Found the exact issue. If the vertex attribute is a complex object with
mutable objects the edge triplet does not update the new state once already
the vertex attributes are shipped but if the vertex attributes are immutable
objects then there is no issue. below is a code for the same. Just changing
the mutable hashmap to immutable hashmap solves the issues. ( this is not a
fix for the bug, either this limitation should be made aware of the users
are the bug needs to be fixed for immutable objects.)

import org.apache.spark.graphx._
import com.alibaba.fastjson.JSONObject
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.log4j.Logger
import org.apache.log4j.Level
import scala.collection.mutable.HashMap


object PregelTest {
  val logger = Logger.getLogger(getClass().getName());
  def run(graph: Graph[HashMap[String, Int], HashMap[String, Int]]):
Graph[HashMap[String, Int], HashMap[String, Int]] = {

def vProg(v: VertexId, attr: HashMap[String, Int], msg: Integer):
HashMap[String, Int] = {
  var updatedAttr = attr
  
  if (msg < 0) {
// init message received 
if (v.equals(0.asInstanceOf[VertexId])) updatedAttr =
attr.+=("LENGTH" -> 0)
else updatedAttr = attr.+=("LENGTH" -> Integer.MAX_VALUE)
  } else {
updatedAttr = attr.+=("LENGTH" -> (msg + 1))
  }
  updatedAttr
}

def sendMsg(triplet: EdgeTriplet[HashMap[String, Int], HashMap[String,
Int]]): Iterator[(VertexId, Integer)] = {
  val len = triplet.srcAttr.get("LENGTH").get
  // send a msg if last hub is reachable 
  if (len < Integer.MAX_VALUE) Iterator((triplet.dstId, len))
  else Iterator.empty
}

def mergeMsg(msg1: Integer, msg2: Integer): Integer = {
  if (msg1 < msg2) msg1 else msg2
}

Pregel(graph, new Integer(-1), 3, EdgeDirection.Either)(vProg, sendMsg,
mergeMsg)
  }

  def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("Pregel Test")
conf.set("spark.master", "local")
val sc = new SparkContext(conf)
val test = new HashMap[String, Int]

// create a simplest test graph with 3 nodes and 2 edges 
val vertexList = Array(
  (0.asInstanceOf[VertexId], new HashMap[String, Int]),
  (1.asInstanceOf[VertexId], new HashMap[String, Int]),
  (2.asInstanceOf[VertexId], new HashMap[String, Int]))
val edgeList = Array(
  Edge(0.asInstanceOf[VertexId], 1.asInstanceOf[VertexId], new
HashMap[String, Int]),
  Edge(1.asInstanceOf[VertexId], 2.asInstanceOf[VertexId], new
HashMap[String, Int]))

val vertexRdd = sc.parallelize(vertexList)
val edgeRdd = sc.parallelize(edgeList)
val g = Graph[HashMap[String, Int], HashMap[String, Int]](vertexRdd,
edgeRdd)

// run test code 
val lpa = run(g)
lpa.vertices.collect().map(println)
  }
}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pregel-not-update-vertex-state-properly-cause-messages-loss-tp28100p28139.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: GraphX Pregel not update vertex state properly, cause messages loss

2016-11-24 Thread 吴 郎
Thank you, Dale, I've realized in what situation this bug would be activated. 
Actually, it seems that any user-defined class with dynamic fields (such Map, 
List...) could not be used as message, or it'll lost in the next supersteps. to 
figure this out, I tried to deep-copy an new message object everytime the 
vertex program runs, and it works till now, though it's obviously not an 
elegant way. 

fuz woo
 
--
致好!
吴   郎

---
国防科大计算机学院

湖南省长沙市开福区 邮编:410073
Email: fuz@qq.com






 




-- Original --
From: "Dale Wang"<w.zhaok...@gmail.com>; 
Date: 2016年11月24日(星期四) 中午11:10
To: "吴 郎"<fuz@qq.com>; 
Cc: "user"<user@spark.apache.org>; 
Subject: Re: GraphX Pregel not update vertex state properly, cause messages loss




The problem comes from the inconsistency between graph’s triplet view  and 
vertex view. The message may not be lost but the message is just not  sent in 
sendMsgfunction because sendMsg function gets wrong value  of srcAttr! 
 
 It is not a new bug. I met a similar bug that appeared in version 1.2.1  
according to  JIAR-6378 before. I  can reproduce that inconsistency bug with a 
small and simple program  (See that JIRA issue for more details). It seems that 
in some situation  the triplet view of a Graph object does not update 
consistently with  vertex view. The GraphX Pregel API heavily relies on  
mapReduceTriplets(old)/aggregateMessages(new) API who heavily relies  on the 
correct behavior of the triplet view of a graph. Thus this bug  influences on 
behavior of Pregel API.
 
 Though I cannot figure out why the bug appears either, but I suspect  that the 
bug has some connection with the data type of the vertex  property. If you use 
primitive types such as Double and Long, it is  OK. But if you use some 
self-defined type with mutable fields such as  mutable Map and mutable 
ArrayBuffer, the bug appears. In your case I  notice that you use JSONObject as 
your vertex’s data type. After  looking up the definition ofJSONObject, 
JSONObject has a java map as  its field to store data which is mutable. To 
temporarily avoid the bug,  you can modify the data type of your vertex 
property to avoid any  mutable data type by replacing mutable data collection 
to immutable data  collection provided by Scala and replacing var field to val 
field.  At least, that suggestion works for me.
 
 Zhaokang Wang
 ​



2016-11-18 11:47 GMT+08:00 fuz_woo <fuz@qq.com>:
hi,everyone, I encountered a strange problem these days when i'm attempting
 to use the GraphX Pregel interface to implement a simple
 single-source-shortest-path algorithm.
 below is my code:
 
 import com.alibaba.fastjson.JSONObject
 import org.apache.spark.graphx._
 
 import org.apache.spark.{SparkConf, SparkContext}
 
 object PregelTest {
 
   def run(graph: Graph[JSONObject, JSONObject]): Graph[JSONObject,
 JSONObject] = {
 
 def vProg(v: VertexId, attr: JSONObject, msg: Integer): JSONObject = {
   if ( msg < 0 ) {
 // init message received
 if ( v.equals(0.asInstanceOf[VertexId]) ) attr.put("LENGTH", 0)
 else attr.put("LENGTH", Integer.MAX_VALUE)
   } else {
 attr.put("LENGTH", msg+1)
   }
   attr
 }
 
 def sendMsg(triplet: EdgeTriplet[JSONObject, JSONObject]):
 Iterator[(VertexId, Integer)] = {
   val len = triplet.srcAttr.getInteger("LENGTH")
   // send a msg if last hub is reachable
   if ( len, it seems that the
 messages sent to vertex 2 was lost unexpectedly. I then tracked the debugger
 to file Pregel.scala,  where I saw the code:
 
 
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n28100/%E7%B2%98%E8%B4%B4%E5%9B%BE%E7%89%87.png>
 
 In the first iteration 0, the variable messages in line 138 is reconstructed
 , and then recomputed in line 143, in where activeMessages got a value 0,
 which means the messages is lost.
 then I set a breakpoint in line 138, and before its execution I execute an
 expression " g.triplets().collect() " which just collects the updated graph
 data. after I done this and execute the rest code, the messages is no longer
 empty and activeMessages got value 1 as expected.
 
 I have tested the code with both spark& 1.4 and 1.6 in scala 2.10,
 and got the same result.
 
 I must say this problem makes me really confused, I've spent almost 2 weeks
 to resolve it and I have no idea how to do it now. If this is not a bug, I
 totally can't understand why just executing a non-disturb expression (
 g.triplets().collect(), it just collect the data and do noting computing )
 could changing the essential, it's really ridiculous.
 
 
 
 --
 View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pregel-not-update-vertex-sta

Re: GraphX Pregel not update vertex state properly, cause messages loss

2016-11-23 Thread Dale Wang
The problem comes from the inconsistency between graph’s triplet view and
vertex view. The message may not be lost but the message is just not sent
in sendMsgfunction because sendMsg function gets wrong value of srcAttr!

It is not a new bug. I met a similar bug that appeared in version 1.2.1
according to JIAR-6378 
before. I can reproduce that inconsistency bug with a small and simple
program (See that JIRA issue for more details). It seems that in some
situation the triplet view of a Graph object does not update consistently
with vertex view. The GraphX Pregel API heavily relies on
mapReduceTriplets(old)/aggregateMessages(new) API who heavily relies on the
correct behavior of the triplet view of a graph. Thus this bug influences
on behavior of Pregel API.

Though I cannot figure out why the bug appears either, but I suspect that
the bug has some connection with the data type of the vertex property. If
you use *primitive* types such as Double and Long, it is OK. But if you use
some self-defined type with mutable fields such as mutable Map and mutable
ArrayBuffer, the bug appears. In your case I notice that you use JSONObject
as your vertex’s data type. After looking up the definition ofJSONObject,
JSONObject has a java map as its field to store data which is mutable. To
temporarily avoid the bug, you can modify the data type of your vertex
property to avoid any mutable data type by replacing mutable data
collection to immutable data collection provided by Scala and replacing var
field to val field. At least, that suggestion works for me.

Zhaokang Wang
​

2016-11-18 11:47 GMT+08:00 fuz_woo :

> hi,everyone, I encountered a strange problem these days when i'm attempting
> to use the GraphX Pregel interface to implement a simple
> single-source-shortest-path algorithm.
> below is my code:
>
> import com.alibaba.fastjson.JSONObject
> import org.apache.spark.graphx._
>
> import org.apache.spark.{SparkConf, SparkContext}
>
> object PregelTest {
>
>   def run(graph: Graph[JSONObject, JSONObject]): Graph[JSONObject,
> JSONObject] = {
>
> def vProg(v: VertexId, attr: JSONObject, msg: Integer): JSONObject = {
>   if ( msg < 0 ) {
> // init message received
> if ( v.equals(0.asInstanceOf[VertexId]) ) attr.put("LENGTH", 0)
> else attr.put("LENGTH", Integer.MAX_VALUE)
>   } else {
> attr.put("LENGTH", msg+1)
>   }
>   attr
> }
>
> def sendMsg(triplet: EdgeTriplet[JSONObject, JSONObject]):
> Iterator[(VertexId, Integer)] = {
>   val len = triplet.srcAttr.getInteger("LENGTH")
>   // send a msg if last hub is reachable
>   if ( len   else Iterator.empty
> }
>
> def mergeMsg(msg1: Integer, msg2: Integer): Integer = {
>   if ( msg1  msg2 ) msg1 else msg2
> }
>
> Pregel(graph, new Integer(-1), 3, EdgeDirection.Out)(vProg, sendMsg,
> mergeMsg)
>   }
>
>   def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setAppName(Pregel Test)
> conf.set(spark.master, local)
> val sc = new SparkContext(conf)
>
> // create a simplest test graph with 3 nodes and 2 edges
> val vertexList = Array(
>   (0.asInstanceOf[VertexId], new JSONObject()),
>   (1.asInstanceOf[VertexId], new JSONObject()),
>   (2.asInstanceOf[VertexId], new JSONObject()))
> val edgeList = Array(
>   Edge(0.asInstanceOf[VertexId], 1.asInstanceOf[VertexId], new
> JSONObject()),
>   Edge(1.asInstanceOf[VertexId], 2.asInstanceOf[VertexId], new
> JSONObject()))
>
> val vertexRdd = sc.parallelize(vertexList)
> val edgeRdd = sc.parallelize(edgeList)
> val g = Graph[JSONObject, JSONObject](vertexRdd, edgeRdd)
>
> // run test code
> val lpa = run(g)
> lpa
>   }
> }
>
> and after i run the code, I got a incorrect result in which the vertex 2
> has
> a LENGTH label valued Integer.MAX_VALUE>, it seems that the
> messages sent to vertex 2 was lost unexpectedly. I then tracked the
> debugger
> to file Pregel.scala,  where I saw the code:
>
>  file/n28100/%E7%B2%98%E8%B4%B4%E5%9B%BE%E7%89%87.png>
>
> In the first iteration 0, the variable messages in line 138 is
> reconstructed
> , and then recomputed in line 143, in where activeMessages got a value 0,
> which means the messages is lost.
> then I set a breakpoint in line 138, and before its execution I execute an
> expression " g.triplets().collect() " which just collects the updated graph
> data. after I done this and execute the rest code, the messages is no
> longer
> empty and activeMessages got value 1 as expected.
>
> I have tested the code with both spark& 1.4 and 1.6 in scala 2.10,
> and got the same result.
>
> I must say this problem makes me really confused, I've spent almost 2 weeks
> to resolve it and I have no idea how to do it now. If this is not a bug, I
> totally can't understand why just executing a non-disturb 

Re: GraphX Pregel not update vertex state properly, cause messages loss

2016-11-23 Thread rohit13k
Created a JIRA for the same

https://issues.apache.org/jira/browse/SPARK-18568



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pregel-not-update-vertex-state-properly-cause-messages-loss-tp28100p28124.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: GraphX Pregel not update vertex state properly, cause messages loss

2016-11-23 Thread rohit13k
Hi 

I am facing a similar issue. It's not that the message is getting lost or
something. The vertex 1 attributes changes in super step 1 but when the
sendMsg gets the vertex attribute from the edge triplet in the 2nd superstep
it stills has the old value of vertex 1 and not the latest value. So as per
your code no new msg will be generated in the superstep. I think the bug is
in the replicatedVertexView where the srcAttr and dstAttr of the
edgeTripplet is updated from the latest version of the vertex after each
superstep.

How to get this bug raised? I am struggling to find an exact solution for it
except for recreating the graph after every superstep to reinforce edge
triplets to have the latest value of the vertex. but this is not a good
solution performance wise.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pregel-not-update-vertex-state-properly-cause-messages-loss-tp28100p28123.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: GraphX Connected Components

2016-11-08 Thread Robineast
Have you tried this?
https://spark.apache.org/docs/2.0.1/api/scala/index.html#org.apache.spark.graphx.GraphLoader$



-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Connected-Components-tp10869p28049.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: GraphX drawing algorithm

2016-09-11 Thread Michael Malak
In chapter 10 of Spark GraphX In Action, we describe how to use Zeppelin with 
d3.js to render graphs using d3's force-directed rendering algorithm. The 
source code can be downloaded for free from 
https://www.manning.com/books/spark-graphx-in-action
  From: agc studio 
 To: user@spark.apache.org 
 Sent: Sunday, September 11, 2016 5:59 PM
 Subject: GraphX drawing algorithm
   
Hi all,
I was wondering if a force-directed graph drawing algorithm has been 
implemented for graphX?

Thanks

   

Re: GraphX performance and settings

2016-07-22 Thread B YL
Hi,
We are also running Connected Components test with GraphX. We ran experiments 
using Spark 1.6.1 on machine which have 16 cores with 2-way and run only a 
single executor per machine. We got this result:
Facebook-like graph with 2^24 edges, using 4 executors with 90GB each, it took 
100 seconds to find Connected component. It takes 600s when we tried to 
increase the number of edges to 2^27. We are so interested in how you can get 
such good results. 
We will be so appreciated if you could answer my following questions:
1. Which Connected component code did you use? Did you use the default 
org.apache.spark.graphx.ConnectedComponents lib which implements using 
Pregel?Have you made any changes?
2.By saying 20 cores with 2-way,did you mean total 40 threads cpu?
3. Addition to the settings you have mentioned,have you made any other changes 
in files spark-default.conf and spark-env.sh? Could you please just paste the 
two files so that we can compare?
4.When you mean Parallel GC, could you please give more detail guides on how to 
optimize this setting?which parameters should we set?

Appreciating for any feedback!
Thank you,
Yilei

On 2016-06-16 09:01 (+0800), Maja Kabiljo wrote: 
> Hi,> 
> 
> We are running some experiments with GraphX in order to compare it with other 
> systems. There are multiple settings which significantly affect performance, 
> and we experimented a lot in order to tune them well. I'll share here what 
> are the best we found so far and which results we got with them, and would 
> really appreciate if anyone who used GraphX before has any advice on what 
> else can make it even better, or confirm that these results are as good as it 
> gets.> 
> 
> Algorithms we used are pagerank and connected components. We used Twitter and 
> UK graphs from the GraphX paper 
> (https://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf), and 
> also generated graphs with properties similar to Facebook social graph with 
> various number of edges. Apart from performance we tried to see what is the 
> minimum amount of resources it requires in order to handle graph of some 
> size.> 
> 
> We ran experiments using Spark 1.6.1, on machines which have 20 cores with 
> 2-way SMT, always fixing number of executors (min=max=initial), giving 40GB 
> or 80GB per executor, and making sure we run only a single executor per 
> machine. Additionally we used:> 
> 
> * spark.shuffle.manager=hash, spark.shuffle.service.enabled=false> 
> * Parallel GC> 
> * PartitionStrategy.EdgePartition2D> 
> * 8*numberOfExecutors partitions> 
> 
> Here are some data points which we got:> 
> 
> * Running on Facebook-like graph with 2 billion edges, using 4 executors with 
> 80GB each it took 451 seconds to do 20 iterations of pagerank and 236 seconds 
> to find connected components. It failed when we tried to use 2 executors, or 
> 4 executors with 40GB each.> 
> * For graph with 10 billion edges we needed 16 executors with 80GB each (it 
> failed with 8), 1041 seconds for 20 iterations of pagerank and 716 seconds 
> for connected components.> 
> * Twitter-2010 graph (1.5 billion edges), 8 executors, 40GB each, pagerank 
> 473s, connected components 264s. With 4 executors 80GB each it worked but was 
> struggling (pr 2475s, cc 4499s), with 8 executors 80GB pr 362s, cc 255s.> 
> 
> One more thing, we were not able to reproduce what's mentioned in the paper 
> about fault tolerance (section 5.2). If we kill an executor during first few 
> iterations it recovers successfully, but if killed in later iterations 
> reconstruction of each iteration starts taking exponentially longer and 
> doesn't finish after letting it run for a few hours. Are there some 
> additional parameters which we need to set in order for this to work?> 
> 
> Any feedback would be highly appreciated!> 
> 
> Thank you,> 
> Maja> 
>


发自我的 iPhone

Re: GraphX performance and settings

2016-07-22 Thread B YL
Hi,
We are also running Connected Components test with GraphX. We ran experiments 
using Spark 1.6.1 on machine which have 16 cores with 2-way and run only a 
single executor per machine. We got this result:
Facebook-like graph with 2^24 edges, using 4 executors with 90GB each, it took 
100 seconds to find Connected component. It takes 600s when we tried to 
increase the number of edges to 2^27. We are so interested in how you can get 
such good results.
We will be so appreciated if you could answer my following questions:
1. Which Connected component code did you use? Did you use the default 
org.apache.spark.graphx.ConnectedComponents lib which implements using 
Pregel?Have you made any changes?
2.By saying 20 cores with 2-way,did you mean total 40 threads cpu?
3. Addition to the settings you have mentioned,have you made any other changes 
in files spark-default.conf and spark-env.sh? Could you please just paste the 
two files so that we can compare?
4.When you mean Parallel GC, could you please give more detail guides on how to 
optimize this setting?which parameters should we set?

Appreciating for any feedback!
Thank you,
Yilei

On 2016-06-16 09:01 (+0800), Maja Kabiljo wrote:
> Hi,>
>
> We are running some experiments with GraphX in order to compare it with other 
> systems. There are multiple settings which significantly affect performance, 
> and we experimented a lot in order to tune them well. I'll share here what 
> are the best we found so far and which results we got with them, and would 
> really appreciate if anyone who used GraphX before has any advice on what 
> else can make it even better, or confirm that these results are as good as it 
> gets.>
>
> Algorithms we used are pagerank and connected components. We used Twitter and 
> UK graphs from the GraphX paper 
> (https://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf), and 
> also generated graphs with properties similar to Facebook social graph with 
> various number of edges. Apart from performance we tried to see what is the 
> minimum amount of resources it requires in order to handle graph of some 
> size.>
>
> We ran experiments using Spark 1.6.1, on machines which have 20 cores with 
> 2-way SMT, always fixing number of executors (min=max=initial), giving 40GB 
> or 80GB per executor, and making sure we run only a single executor per 
> machine. Additionally we used:>
>
> * spark.shuffle.manager=hash, spark.shuffle.service.enabled=false>
> * Parallel GC>
> * PartitionStrategy.EdgePartition2D>
> * 8*numberOfExecutors partitions>
>
> Here are some data points which we got:>
>
> * Running on Facebook-like graph with 2 billion edges, using 4 executors with 
> 80GB each it took 451 seconds to do 20 iterations of pagerank and 236 seconds 
> to find connected components. It failed when we tried to use 2 executors, or 
> 4 executors with 40GB each.>
> * For graph with 10 billion edges we needed 16 executors with 80GB each (it 
> failed with 8), 1041 seconds for 20 iterations of pagerank and 716 seconds 
> for connected components.>
> * Twitter-2010 graph (1.5 billion edges), 8 executors, 40GB each, pagerank 
> 473s, connected components 264s. With 4 executors 80GB each it worked but was 
> struggling (pr 2475s, cc 4499s), with 8 executors 80GB pr 362s, cc 255s.>
>
> One more thing, we were not able to reproduce what's mentioned in the paper 
> about fault tolerance (section 5.2). If we kill an executor during first few 
> iterations it recovers successfully, but if killed in later iterations 
> reconstruction of each iteration starts taking exponentially longer and 
> doesn't finish after letting it run for a few hours. Are there some 
> additional parameters which we need to set in order for this to work?>
>
> Any feedback would be highly appreciated!>
>
> Thank you,>
> Maja>
>


 iPhone



Re: GraphX performance and settings

2016-06-22 Thread Maja Kabiljo
Thank you for the reply Deepak.

I know with more executors / memory per executor it will work, we actually have 
a bunch of experiments we ran with various setups. I'm just trying to confirm 
that limits we are hitting are right, or there are some other configuration 
parameters we didn't try yet which would move the limits further. Since without 
any tuning limits for what we can run were much worse off.

Errors would be various executors lost: after heartbeat timeout of 10 minutes, 
out of memory errors or job just not making any progress (not completing any 
tasks) for many hours after which we'd kill them.

Maja

From: Deepak Goel <deic...@gmail.com<mailto:deic...@gmail.com>>
Date: Wednesday, June 15, 2016 at 7:13 PM
To: Maja Kabiljo <majakabi...@fb.com<mailto:majakabi...@fb.com>>
Cc: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: GraphX performance and settings


I am not an expert but some thoughts inline

On Jun 16, 2016 6:31 AM, "Maja Kabiljo" 
<majakabi...@fb.com<mailto:majakabi...@fb.com>> wrote:
>
> Hi,
>
> We are running some experiments with GraphX in order to compare it with other 
> systems. There are multiple settings which significantly affect performance, 
> and we experimented a lot in order to tune them well. I'll share here what 
> are the best we found so far and which results we got with them, and would 
> really appreciate if anyone who used GraphX before has any advice on what 
> else can make it even better, or confirm that these results are as good as it 
> gets.
>
> Algorithms we used are pagerank and connected components. We used Twitter and 
> UK graphs from the GraphX paper 
> (https://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf<https://urldefense.proofpoint.com/v2/url?u=https-3A__amplab.cs.berkeley.edu_wp-2Dcontent_uploads_2014_09_graphx.pdf=CwMFaQ=5VD0RTtNlTh3ycd41b3MUw=70jRDqS1TgwYK0kwXqlG2wyrzJDvH1bm4B3mynUQPGE=6-RoIPn2j_XKk53JhD7u64esgyinELGNJqDvZuyWC34=hrnz4WuTA6NrvS2FW6-ZHIjv3auKs4CRo_TwTsS3EA8=>),
>  and also generated graphs with properties similar to Facebook social graph 
> with various number of edges. Apart from performance we tried to see what is 
> the minimum amount of resources it requires in order to handle graph of some 
> size.
>
> We ran experiments using Spark 1.6.1, on machines which have 20 cores with 
> 2-way SMT, always fixing number of executors (min=max=initial), giving 40GB 
> or 80GB per executor, and making sure we run only a single executor per 
> machine.

***Deepak***
I guess you have 16 machines in your test. Is that right?
**Deepak***

Additionally we used:
> spark.shuffle.manager=hash, spark.shuffle.service.enabled=false
> Parallel GC
> PartitionStrategy.EdgePartition2D
> 8*numberOfExecutors partitions
> Here are some data points which we got:
> Running on Facebook-like graph with 2 billion edges, using 4 executors with 
> 80GB each it took 451 seconds to do 20 iterations of pagerank and 236 seconds 
> to find connected components. It failed when we tried to use 2 executors, or 
> 4 executors with 40GB each.
> For graph with 10 billion edges we needed 16 executors with 80GB each (it 
> failed with 8), 1041 seconds for 20 iterations of pagerank and 716 seconds 
> for connected component

**Deepak*
The executors are not scaling linearly. You should need max of 10 executors. 
Also what is the error it is showing for 8 executors?
*Deepak**

> Twitter-2010 graph (1.5 billion edges), 8 executors, 40GB each, pagerank 
> 473s, connected components 264s. With 4 executors 80GB each it worked but was 
> struggling (pr 2475s, cc 4499s), with 8 executors 80GB pr 362s, cc 255s.

*Deepak*
For 4 executors can you try with 160GB. Also if you could spell out the system 
statistics during the test it would be great. My guess is with 4 connectors a 
lot of spilling is happening
*Deepak***

> One more thing, we were not able to reproduce what's mentioned in the paper 
> about fault tolerance (section 5.2). If we kill an executor during first few 
> iterations it recovers successfully, but if killed in later iterations 
> reconstruction of each iteration starts taking exponentially longer and 
> doesn't finish after letting it run for a few hours. Are there some 
> additional parameters which we need to set in order for this to work?
>
> Any feedback would be highly appreciated!
>
> Thank you,
> Maja


Re: GraphX performance and settings

2016-06-15 Thread Deepak Goel
I am not an expert but some thoughts inline

On Jun 16, 2016 6:31 AM, "Maja Kabiljo"  wrote:
>
> Hi,
>
> We are running some experiments with GraphX in order to compare it with
other systems. There are multiple settings which significantly affect
performance, and we experimented a lot in order to tune them well. I’ll
share here what are the best we found so far and which results we got with
them, and would really appreciate if anyone who used GraphX before has any
advice on what else can make it even better, or confirm that these results
are as good as it gets.
>
> Algorithms we used are pagerank and connected components. We used Twitter
and UK graphs from the GraphX paper (
https://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf), and
also generated graphs with properties similar to Facebook social graph with
various number of edges. Apart from performance we tried to see what is the
minimum amount of resources it requires in order to handle graph of some
size.
>
> We ran experiments using Spark 1.6.1, on machines which have 20 cores
with 2-way SMT, always fixing number of executors (min=max=initial), giving
40GB or 80GB per executor, and making sure we run only a single executor
per machine.

***Deepak***
I guess you have 16 machines in your test. Is that right?
**Deepak***

Additionally we used:
> spark.shuffle.manager=hash, spark.shuffle.service.enabled=false
> Parallel GC
> PartitionStrategy.EdgePartition2D
> 8*numberOfExecutors partitions
> Here are some data points which we got:
> Running on Facebook-like graph with 2 billion edges, using 4 executors
with 80GB each it took 451 seconds to do 20 iterations of pagerank and 236
seconds to find connected components. It failed when we tried to use 2
executors, or 4 executors with 40GB each.
> For graph with 10 billion edges we needed 16 executors with 80GB each (it
failed with 8), 1041 seconds for 20 iterations of pagerank and 716 seconds
for connected component

**Deepak*
The executors are not scaling linearly. You should need max of 10
executors. Also what is the error it is showing for 8 executors?
*Deepak**

> Twitter-2010 graph (1.5 billion edges), 8 executors, 40GB each, pagerank
473s, connected components 264s. With 4 executors 80GB each it worked but
was struggling (pr 2475s, cc 4499s), with 8 executors 80GB pr 362s, cc 255s.

*Deepak*
For 4 executors can you try with 160GB. Also if you could spell out the
system statistics during the test it would be great. My guess is with 4
connectors a lot of spilling is happening
*Deepak***

> One more thing, we were not able to reproduce what’s mentioned in the
paper about fault tolerance (section 5.2). If we kill an executor during
first few iterations it recovers successfully, but if killed in later
iterations reconstruction of each iteration starts taking exponentially
longer and doesn’t finish after letting it run for a few hours. Are there
some additional parameters which we need to set in order for this to work?
>
> Any feedback would be highly appreciated!
>
> Thank you,
> Maja


RE: GraphX Java API

2016-06-08 Thread Felix Cheung
You might want to check out GraphFrames
graphframes.github.io





On Sun, Jun 5, 2016 at 6:40 PM -0700, "Santoshakhilesh" 
<santosh.akhil...@huawei.com> wrote:





Ok , thanks for letting me know. Yes Since Java and scala programs ultimately 
runs on JVM. So the APIs written in one language can be called from other.
When I had used GraphX (around 2015 beginning) the Java Native APIs were not 
available for GraphX.
So I chose to develop my application in scala and it turned out much simpler to 
develop  in scala due to some of its powerful functions like lambda , map , 
filter etc… which were not available to me in Java 7.
Regards,
Santosh Akhilesh

From: Sonal Goyal [mailto:sonalgoy...@gmail.com]
Sent: 01 June 2016 00:56
To: Santoshakhilesh
Cc: Kumar, Abhishek (US - Bengaluru); user@spark.apache.org; Golatkar, Jayesh 
(US - Bengaluru); Soni, Akhil Dharamprakash (US - Bengaluru); Matta, Rishul (US 
- Bengaluru); Aich, Risha (US - Bengaluru); Kumar, Rajinish (US - Bengaluru); 
Jain, Isha (US - Bengaluru); Kumar, Sandeep (US - Bengaluru)
Subject: Re: GraphX Java API

Its very much possible to use GraphX through Java, though some boilerplate may 
be needed. Here is an example.

Create a graph from edge and vertex RDD (JavaRDD<Tuple2<Object, Long>> 
vertices, JavaRDD<Edge> edges )


ClassTag longTag = scala.reflect.ClassTag$.MODULE$.apply(Long.class);
Graph<Long,Float> graph = Graph.apply(vertices.rdd(),
edges.rdd(), 0L, 
StorageLevel.MEMORY_ONLY(), StorageLevel.MEMORY_ONLY(),
longTag, longTag);



Then basically you can call graph.ops() and do available operations like 
triangleCounting etc,

Best Regards,
Sonal
Founder, Nube Technologies<http://www.nubetech.co>
Reifier at Strata Hadoop World<https://www.youtube.com/watch?v=eD3LkpPQIgM>
Reifier at Spark Summit 
2015<https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/>




On Tue, May 31, 2016 at 11:40 AM, Santoshakhilesh 
<santosh.akhil...@huawei.com<mailto:santosh.akhil...@huawei.com>> wrote:
Hi ,
Scala has similar package structure as java and finally it runs on JVM so 
probably you get an impression that its in Java.
As far as I know there are no Java API for GraphX. I had used GraphX last year 
and at that time I had to code in Scala to use the GraphX APIs.
Regards,
Santosh Akhilesh


From: Kumar, Abhishek (US - Bengaluru) 
[mailto:abhishekkuma...@deloitte.com<mailto:abhishekkuma...@deloitte.com>]
Sent: 30 May 2016 13:24
To: Santoshakhilesh; user@spark.apache.org<mailto:user@spark.apache.org>
Cc: Golatkar, Jayesh (US - Bengaluru); Soni, Akhil Dharamprakash (US - 
Bengaluru); Matta, Rishul (US - Bengaluru); Aich, Risha (US - Bengaluru); 
Kumar, Rajinish (US - Bengaluru); Jain, Isha (US - Bengaluru); Kumar, Sandeep 
(US - Bengaluru)
Subject: RE: GraphX Java API

Hey,
•   I see some graphx packages listed here:
http://spark.apache.org/docs/latest/api/java/index.html
•   
org.apache.spark.graphx<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/package-frame.html>
•   
org.apache.spark.graphx.impl<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/impl/package-frame.html>
•   
org.apache.spark.graphx.lib<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/lib/package-frame.html>
•   
org.apache.spark.graphx.util<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/util/package-frame.html>
Aren’t they meant to be used with JAVA?
Thanks

From: Santoshakhilesh [mailto:santosh.akhil...@huawei.com]
Sent: Friday, May 27, 2016 4:52 PM
To: Kumar, Abhishek (US - Bengaluru) 
<abhishekkuma...@deloitte.com<mailto:abhishekkuma...@deloitte.com>>; 
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: GraphX Java API

GraphX APis are available only in Scala. If you need to use GraphX you need to 
switch to Scala.

From: Kumar, Abhishek (US - Bengaluru) [mailto:abhishekkuma...@deloitte.com]
Sent: 27 May 2016 19:59
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: GraphX Java API

Hi,

We are trying to consume the Java API for GraphX, but there is no documentation 
available online on the usage or examples. It would be great if we could get 
some examples in Java.

Thanks and regards,

Abhishek Kumar






This message (including any attachments) contains confidential information 
intended for a specific individual and purpose, and is protected by law. If you 
are not the intended recipient, you should delete this message and any 
disclosure, copying, or distribution of this message, or the taking of any 
action based on it, by you is strictly prohibited.

v.E.1










RE: GraphX Java API

2016-06-05 Thread Santoshakhilesh
Ok , thanks for letting me know. Yes Since Java and scala programs ultimately 
runs on JVM. So the APIs written in one language can be called from other.
When I had used GraphX (around 2015 beginning) the Java Native APIs were not 
available for GraphX.
So I chose to develop my application in scala and it turned out much simpler to 
develop  in scala due to some of its powerful functions like lambda , map , 
filter etc… which were not available to me in Java 7.
Regards,
Santosh Akhilesh

From: Sonal Goyal [mailto:sonalgoy...@gmail.com]
Sent: 01 June 2016 00:56
To: Santoshakhilesh
Cc: Kumar, Abhishek (US - Bengaluru); user@spark.apache.org; Golatkar, Jayesh 
(US - Bengaluru); Soni, Akhil Dharamprakash (US - Bengaluru); Matta, Rishul (US 
- Bengaluru); Aich, Risha (US - Bengaluru); Kumar, Rajinish (US - Bengaluru); 
Jain, Isha (US - Bengaluru); Kumar, Sandeep (US - Bengaluru)
Subject: Re: GraphX Java API

Its very much possible to use GraphX through Java, though some boilerplate may 
be needed. Here is an example.

Create a graph from edge and vertex RDD (JavaRDD<Tuple2<Object, Long>> 
vertices, JavaRDD<Edge> edges )


ClassTag longTag = scala.reflect.ClassTag$.MODULE$.apply(Long.class);
Graph<Long,Float> graph = Graph.apply(vertices.rdd(),
edges.rdd(), 0L, 
StorageLevel.MEMORY_ONLY(), StorageLevel.MEMORY_ONLY(),
longTag, longTag);



Then basically you can call graph.ops() and do available operations like 
triangleCounting etc,

Best Regards,
Sonal
Founder, Nube Technologies<http://www.nubetech.co>
Reifier at Strata Hadoop World<https://www.youtube.com/watch?v=eD3LkpPQIgM>
Reifier at Spark Summit 
2015<https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/>




On Tue, May 31, 2016 at 11:40 AM, Santoshakhilesh 
<santosh.akhil...@huawei.com<mailto:santosh.akhil...@huawei.com>> wrote:
Hi ,
Scala has similar package structure as java and finally it runs on JVM so 
probably you get an impression that its in Java.
As far as I know there are no Java API for GraphX. I had used GraphX last year 
and at that time I had to code in Scala to use the GraphX APIs.
Regards,
Santosh Akhilesh


From: Kumar, Abhishek (US - Bengaluru) 
[mailto:abhishekkuma...@deloitte.com<mailto:abhishekkuma...@deloitte.com>]
Sent: 30 May 2016 13:24
To: Santoshakhilesh; user@spark.apache.org<mailto:user@spark.apache.org>
Cc: Golatkar, Jayesh (US - Bengaluru); Soni, Akhil Dharamprakash (US - 
Bengaluru); Matta, Rishul (US - Bengaluru); Aich, Risha (US - Bengaluru); 
Kumar, Rajinish (US - Bengaluru); Jain, Isha (US - Bengaluru); Kumar, Sandeep 
(US - Bengaluru)
Subject: RE: GraphX Java API

Hey,
•   I see some graphx packages listed here:
http://spark.apache.org/docs/latest/api/java/index.html
•   
org.apache.spark.graphx<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/package-frame.html>
•   
org.apache.spark.graphx.impl<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/impl/package-frame.html>
•   
org.apache.spark.graphx.lib<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/lib/package-frame.html>
•   
org.apache.spark.graphx.util<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/util/package-frame.html>
Aren’t they meant to be used with JAVA?
Thanks

From: Santoshakhilesh [mailto:santosh.akhil...@huawei.com]
Sent: Friday, May 27, 2016 4:52 PM
To: Kumar, Abhishek (US - Bengaluru) 
<abhishekkuma...@deloitte.com<mailto:abhishekkuma...@deloitte.com>>; 
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: GraphX Java API

GraphX APis are available only in Scala. If you need to use GraphX you need to 
switch to Scala.

From: Kumar, Abhishek (US - Bengaluru) [mailto:abhishekkuma...@deloitte.com]
Sent: 27 May 2016 19:59
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: GraphX Java API

Hi,

We are trying to consume the Java API for GraphX, but there is no documentation 
available online on the usage or examples. It would be great if we could get 
some examples in Java.

Thanks and regards,

Abhishek Kumar






This message (including any attachments) contains confidential information 
intended for a specific individual and purpose, and is protected by law. If you 
are not the intended recipient, you should delete this message and any 
disclosure, copying, or distribution of this message, or the taking of any 
action based on it, by you is strictly prohibited.

v.E.1










Re: GraphX Java API

2016-05-31 Thread Sonal Goyal
Its very much possible to use GraphX through Java, though some boilerplate
may be needed. Here is an example.

Create a graph from edge and vertex RDD (JavaRDD<Tuple2<Object, Long>>
vertices, JavaRDD<Edge> edges )


ClassTag longTag = scala.reflect.ClassTag$.MODULE$.apply(Long.class);
Graph<Long,Float> graph = Graph.apply(vertices.rdd(),
edges.rdd(), 0L, StorageLevel.MEMORY_ONLY(), StorageLevel.MEMORY_ONLY(),
longTag, longTag);



Then basically you can call graph.ops() and do available operations like
triangleCounting etc,

Best Regards,
Sonal
Founder, Nube Technologies <http://www.nubetech.co>
Reifier at Strata Hadoop World <https://www.youtube.com/watch?v=eD3LkpPQIgM>
Reifier at Spark Summit 2015
<https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/>

<http://in.linkedin.com/in/sonalgoyal>



On Tue, May 31, 2016 at 11:40 AM, Santoshakhilesh <
santosh.akhil...@huawei.com> wrote:

> Hi ,
>
> Scala has similar package structure as java and finally it runs on JVM so
> probably you get an impression that its in Java.
>
> As far as I know there are no Java API for GraphX. I had used GraphX last
> year and at that time I had to code in Scala to use the GraphX APIs.
>
> Regards,
> Santosh Akhilesh
>
>
>
>
>
> *From:* Kumar, Abhishek (US - Bengaluru) [mailto:
> abhishekkuma...@deloitte.com]
> *Sent:* 30 May 2016 13:24
> *To:* Santoshakhilesh; user@spark.apache.org
> *Cc:* Golatkar, Jayesh (US - Bengaluru); Soni, Akhil Dharamprakash (US -
> Bengaluru); Matta, Rishul (US - Bengaluru); Aich, Risha (US - Bengaluru);
> Kumar, Rajinish (US - Bengaluru); Jain, Isha (US - Bengaluru); Kumar,
> Sandeep (US - Bengaluru)
> *Subject:* RE: GraphX Java API
>
>
>
> Hey,
>
> ·   I see some graphx packages listed here:
>
> http://spark.apache.org/docs/latest/api/java/index.html
>
> ·   org.apache.spark.graphx
> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/package-frame.html>
>
> ·   org.apache.spark.graphx.impl
> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/impl/package-frame.html>
>
> ·   org.apache.spark.graphx.lib
> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/lib/package-frame.html>
>
> ·   org.apache.spark.graphx.util
> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/util/package-frame.html>
>
> Aren’t they meant to be used with JAVA?
>
> Thanks
>
>
>
> *From:* Santoshakhilesh [mailto:santosh.akhil...@huawei.com
> <santosh.akhil...@huawei.com>]
> *Sent:* Friday, May 27, 2016 4:52 PM
> *To:* Kumar, Abhishek (US - Bengaluru) <abhishekkuma...@deloitte.com>;
> user@spark.apache.org
> *Subject:* RE: GraphX Java API
>
>
>
> GraphX APis are available only in Scala. If you need to use GraphX you
> need to switch to Scala.
>
>
>
> *From:* Kumar, Abhishek (US - Bengaluru) [
> mailto:abhishekkuma...@deloitte.com <abhishekkuma...@deloitte.com>]
> *Sent:* 27 May 2016 19:59
> *To:* user@spark.apache.org
> *Subject:* GraphX Java API
>
>
>
> Hi,
>
>
>
> We are trying to consume the Java API for GraphX, but there is no
> documentation available online on the usage or examples. It would be great
> if we could get some examples in Java.
>
>
>
> Thanks and regards,
>
>
>
> *Abhishek Kumar*
>
>
>
>
>
>
>
> This message (including any attachments) contains confidential information
> intended for a specific individual and purpose, and is protected by law. If
> you are not the intended recipient, you should delete this message and any
> disclosure, copying, or distribution of this message, or the taking of any
> action based on it, by you is strictly prohibited.
>
> v.E.1
>
>
>
>
>
>
>
>
>


RE: GraphX Java API

2016-05-31 Thread Santoshakhilesh
Hi ,
Scala has similar package structure as java and finally it runs on JVM so 
probably you get an impression that its in Java.
As far as I know there are no Java API for GraphX. I had used GraphX last year 
and at that time I had to code in Scala to use the GraphX APIs.
Regards,
Santosh Akhilesh


From: Kumar, Abhishek (US - Bengaluru) [mailto:abhishekkuma...@deloitte.com]
Sent: 30 May 2016 13:24
To: Santoshakhilesh; user@spark.apache.org
Cc: Golatkar, Jayesh (US - Bengaluru); Soni, Akhil Dharamprakash (US - 
Bengaluru); Matta, Rishul (US - Bengaluru); Aich, Risha (US - Bengaluru); 
Kumar, Rajinish (US - Bengaluru); Jain, Isha (US - Bengaluru); Kumar, Sandeep 
(US - Bengaluru)
Subject: RE: GraphX Java API

Hey,
•   I see some graphx packages listed here:
http://spark.apache.org/docs/latest/api/java/index.html
•   
org.apache.spark.graphx<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/package-frame.html>
•   
org.apache.spark.graphx.impl<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/impl/package-frame.html>
•   
org.apache.spark.graphx.lib<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/lib/package-frame.html>
•   
org.apache.spark.graphx.util<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/util/package-frame.html>
Aren’t they meant to be used with JAVA?
Thanks

From: Santoshakhilesh [mailto:santosh.akhil...@huawei.com]
Sent: Friday, May 27, 2016 4:52 PM
To: Kumar, Abhishek (US - Bengaluru) 
<abhishekkuma...@deloitte.com<mailto:abhishekkuma...@deloitte.com>>; 
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: GraphX Java API

GraphX APis are available only in Scala. If you need to use GraphX you need to 
switch to Scala.

From: Kumar, Abhishek (US - Bengaluru) [mailto:abhishekkuma...@deloitte.com]
Sent: 27 May 2016 19:59
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: GraphX Java API

Hi,

We are trying to consume the Java API for GraphX, but there is no documentation 
available online on the usage or examples. It would be great if we could get 
some examples in Java.

Thanks and regards,

Abhishek Kumar






This message (including any attachments) contains confidential information 
intended for a specific individual and purpose, and is protected by law. If you 
are not the intended recipient, you should delete this message and any 
disclosure, copying, or distribution of this message, or the taking of any 
action based on it, by you is strictly prohibited.

v.E.1









Re: GraphX Java API

2016-05-30 Thread Chris Fregly
btw, GraphX in Action is one of the better books out on Spark.

Michael did a great job with this one.  He even breaks down snippets of
Scala for newbies to understand the seemingly-arbitrary syntax.  I learned
quite a bit about not only Spark, but also Scala.

And of course, we shouldn't forget about Sean's Advanced Analytics with
Spark which, of course, is a classic that I still reference regularly.  :)

On Mon, May 30, 2016 at 7:42 AM, Michael Malak <
michaelma...@yahoo.com.invalid> wrote:

> Yes, it is possible to use GraphX from Java but it requires 10x the amount
> of code and involves using obscure typing and pre-defined lambda prototype
> facilities. I give an example of it in my book, the source code for which
> can be downloaded for free from
> https://www.manning.com/books/spark-graphx-in-action The relevant example
> is EdgeCount.java in chapter 10.
>
> As I suggest in my book, likely the only reason you'd want to put yourself
> through that torture is corporate mandate or compatibility with Java
> bytecode tools.
>
>
> --
> *From:* Sean Owen <so...@cloudera.com>
> *To:* Takeshi Yamamuro <linguin@gmail.com>; "Kumar, Abhishek (US -
> Bengaluru)" <abhishekkuma...@deloitte.com>
> *Cc:* "user@spark.apache.org" <user@spark.apache.org>
> *Sent:* Monday, May 30, 2016 7:07 AM
> *Subject:* Re: GraphX Java API
>
> No, you can call any Scala API in Java. It is somewhat less convenient if
> the method was not written with Java in mind but does work.
>
> On Mon, May 30, 2016, 00:32 Takeshi Yamamuro <linguin@gmail.com>
> wrote:
>
> These package are used only for Scala.
>
> On Mon, May 30, 2016 at 2:23 PM, Kumar, Abhishek (US - Bengaluru) <
> abhishekkuma...@deloitte.com> wrote:
>
> Hey,
> ·   I see some graphx packages listed here:
> http://spark.apache.org/docs/latest/api/java/index.html
> ·   org.apache.spark.graphx
> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/package-frame.html>
> ·   org.apache.spark.graphx.impl
> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/impl/package-frame.html>
> ·   org.apache.spark.graphx.lib
> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/lib/package-frame.html>
> ·   org.apache.spark.graphx.util
> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/util/package-frame.html>
> Aren’t they meant to be used with JAVA?
> Thanks
>
> *From:* Santoshakhilesh [mailto:santosh.akhil...@huawei.com]
> *Sent:* Friday, May 27, 2016 4:52 PM
> *To:* Kumar, Abhishek (US - Bengaluru) <abhishekkuma...@deloitte.com>;
> user@spark.apache.org
> *Subject:* RE: GraphX Java API
>
> GraphX APis are available only in Scala. If you need to use GraphX you
> need to switch to Scala.
>
> *From:* Kumar, Abhishek (US - Bengaluru) [
> mailto:abhishekkuma...@deloitte.com <abhishekkuma...@deloitte.com>]
> *Sent:* 27 May 2016 19:59
> *To:* user@spark.apache.org
> *Subject:* GraphX Java API
>
> Hi,
>
> We are trying to consume the Java API for GraphX, but there is no
> documentation available online on the usage or examples. It would be great
> if we could get some examples in Java.
>
> Thanks and regards,
>
> *Abhishek Kumar*
>
>
>
> This message (including any attachments) contains confidential information
> intended for a specific individual and purpose, and is protected by law. If
> you are not the intended recipient, you should delete this message and any
> disclosure, copying, or distribution of this message, or the taking of any
> action based on it, by you is strictly prohibited.
> v.E.1
>
>
>
>
>
>
>
>
> --
> ---
> Takeshi Yamamuro
>
>
>
>


-- 
*Chris Fregly*
Research Scientist @ Flux Capacitor AI
"Bringing AI Back to the Future!"
San Francisco, CA
http://fluxcapacitor.ai


Re: GraphX Java API

2016-05-30 Thread Michael Malak
Yes, it is possible to use GraphX from Java but it requires 10x the amount of 
code and involves using obscure typing and pre-defined lambda prototype 
facilities. I give an example of it in my book, the source code for which can 
be downloaded for free from 
https://www.manning.com/books/spark-graphx-in-action The relevant example is 
EdgeCount.java in chapter 10.
As I suggest in my book, likely the only reason you'd want to put yourself 
through that torture is corporate mandate or compatibility with Java bytecode 
tools.

  From: Sean Owen <so...@cloudera.com>
 To: Takeshi Yamamuro <linguin@gmail.com>; "Kumar, Abhishek (US - 
Bengaluru)" <abhishekkuma...@deloitte.com> 
Cc: "user@spark.apache.org" <user@spark.apache.org>
 Sent: Monday, May 30, 2016 7:07 AM
 Subject: Re: GraphX Java API
   
No, you can call any Scala API in Java. It is somewhat less convenient if the 
method was not written with Java in mind but does work. 

On Mon, May 30, 2016, 00:32 Takeshi Yamamuro <linguin@gmail.com> wrote:

These package are used only for Scala.
On Mon, May 30, 2016 at 2:23 PM, Kumar, Abhishek (US - Bengaluru) 
<abhishekkuma...@deloitte.com> wrote:

Hey,·  I see some graphx packages listed 
here:http://spark.apache.org/docs/latest/api/java/index.html·  
org.apache.spark.graphx·  org.apache.spark.graphx.impl·  
org.apache.spark.graphx.lib·  org.apache.spark.graphx.utilAren’t they meant 
to be used with JAVA?Thanks From: Santoshakhilesh 
[mailto:santosh.akhil...@huawei.com]
Sent: Friday, May 27, 2016 4:52 PM
To: Kumar, Abhishek (US - Bengaluru) <abhishekkuma...@deloitte.com>; 
user@spark.apache.org
Subject: RE: GraphX Java API GraphX APis are available only in Scala. If you 
need to use GraphX you need to switch to Scala. From: Kumar, Abhishek (US - 
Bengaluru) [mailto:abhishekkuma...@deloitte.com]
Sent: 27 May 2016 19:59
To: user@spark.apache.org
Subject: GraphX Java API Hi, We are trying to consume the Java API for GraphX, 
but there is no documentation available online on the usage or examples. It 
would be great if we could get some examples in Java. Thanks and regards, 
Abhishek Kumar   This message (including any attachments) contains confidential 
information intended for a specific individual and purpose, and is protected by 
law. If you are not the intended recipient, you should delete this message and 
any disclosure, copying, or distribution of this message, or the taking of any 
action based on it, by you is strictly prohibited.v.E.1



-- 
---
Takeshi Yamamuro



  

Re: GraphX Java API

2016-05-30 Thread Sean Owen
No, you can call any Scala API in Java. It is somewhat less convenient if
the method was not written with Java in mind but does work.

On Mon, May 30, 2016, 00:32 Takeshi Yamamuro <linguin@gmail.com> wrote:

> These package are used only for Scala.
>
> On Mon, May 30, 2016 at 2:23 PM, Kumar, Abhishek (US - Bengaluru) <
> abhishekkuma...@deloitte.com> wrote:
>
>> Hey,
>>
>> ·   I see some graphx packages listed here:
>>
>> http://spark.apache.org/docs/latest/api/java/index.html
>>
>> ·   org.apache.spark.graphx
>> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/package-frame.html>
>>
>> ·   org.apache.spark.graphx.impl
>> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/impl/package-frame.html>
>>
>> ·   org.apache.spark.graphx.lib
>> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/lib/package-frame.html>
>>
>> ·   org.apache.spark.graphx.util
>> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/util/package-frame.html>
>>
>> Aren’t they meant to be used with JAVA?
>>
>> Thanks
>>
>>
>>
>> *From:* Santoshakhilesh [mailto:santosh.akhil...@huawei.com]
>> *Sent:* Friday, May 27, 2016 4:52 PM
>> *To:* Kumar, Abhishek (US - Bengaluru) <abhishekkuma...@deloitte.com>;
>> user@spark.apache.org
>> *Subject:* RE: GraphX Java API
>>
>>
>>
>> GraphX APis are available only in Scala. If you need to use GraphX you
>> need to switch to Scala.
>>
>>
>>
>> *From:* Kumar, Abhishek (US - Bengaluru) [
>> mailto:abhishekkuma...@deloitte.com <abhishekkuma...@deloitte.com>]
>> *Sent:* 27 May 2016 19:59
>> *To:* user@spark.apache.org
>> *Subject:* GraphX Java API
>>
>>
>>
>> Hi,
>>
>>
>>
>> We are trying to consume the Java API for GraphX, but there is no
>> documentation available online on the usage or examples. It would be great
>> if we could get some examples in Java.
>>
>>
>>
>> Thanks and regards,
>>
>>
>>
>> *Abhishek Kumar*
>>
>>
>>
>>
>>
>>
>>
>> This message (including any attachments) contains confidential
>> information intended for a specific individual and purpose, and is
>> protected by law. If you are not the intended recipient, you should delete
>> this message and any disclosure, copying, or distribution of this message,
>> or the taking of any action based on it, by you is strictly prohibited.
>>
>> v.E.1
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: GraphX Java API

2016-05-29 Thread Takeshi Yamamuro
These package are used only for Scala.

On Mon, May 30, 2016 at 2:23 PM, Kumar, Abhishek (US - Bengaluru) <
abhishekkuma...@deloitte.com> wrote:

> Hey,
>
> ·   I see some graphx packages listed here:
>
> http://spark.apache.org/docs/latest/api/java/index.html
>
> ·   org.apache.spark.graphx
> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/package-frame.html>
>
> ·   org.apache.spark.graphx.impl
> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/impl/package-frame.html>
>
> ·   org.apache.spark.graphx.lib
> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/lib/package-frame.html>
>
> ·   org.apache.spark.graphx.util
> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/util/package-frame.html>
>
> Aren’t they meant to be used with JAVA?
>
> Thanks
>
>
>
> *From:* Santoshakhilesh [mailto:santosh.akhil...@huawei.com]
> *Sent:* Friday, May 27, 2016 4:52 PM
> *To:* Kumar, Abhishek (US - Bengaluru) <abhishekkuma...@deloitte.com>;
> user@spark.apache.org
> *Subject:* RE: GraphX Java API
>
>
>
> GraphX APis are available only in Scala. If you need to use GraphX you
> need to switch to Scala.
>
>
>
> *From:* Kumar, Abhishek (US - Bengaluru) [
> mailto:abhishekkuma...@deloitte.com <abhishekkuma...@deloitte.com>]
> *Sent:* 27 May 2016 19:59
> *To:* user@spark.apache.org
> *Subject:* GraphX Java API
>
>
>
> Hi,
>
>
>
> We are trying to consume the Java API for GraphX, but there is no
> documentation available online on the usage or examples. It would be great
> if we could get some examples in Java.
>
>
>
> Thanks and regards,
>
>
>
> *Abhishek Kumar*
>
>
>
>
>
>
>
> This message (including any attachments) contains confidential information
> intended for a specific individual and purpose, and is protected by law. If
> you are not the intended recipient, you should delete this message and any
> disclosure, copying, or distribution of this message, or the taking of any
> action based on it, by you is strictly prohibited.
>
> v.E.1
>
>
>
>
>
>
>
>
>



-- 
---
Takeshi Yamamuro


RE: GraphX Java API

2016-05-29 Thread Kumar, Abhishek (US - Bengaluru)
Hey,
·   I see some graphx packages listed here:
http://spark.apache.org/docs/latest/api/java/index.html
·   
org.apache.spark.graphx<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/package-frame.html>
·   
org.apache.spark.graphx.impl<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/impl/package-frame.html>
·   
org.apache.spark.graphx.lib<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/lib/package-frame.html>
·   
org.apache.spark.graphx.util<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/util/package-frame.html>
Aren’t they meant to be used with JAVA?
Thanks

From: Santoshakhilesh [mailto:santosh.akhil...@huawei.com]
Sent: Friday, May 27, 2016 4:52 PM
To: Kumar, Abhishek (US - Bengaluru) <abhishekkuma...@deloitte.com>; 
user@spark.apache.org
Subject: RE: GraphX Java API

GraphX APis are available only in Scala. If you need to use GraphX you need to 
switch to Scala.

From: Kumar, Abhishek (US - Bengaluru) [mailto:abhishekkuma...@deloitte.com]
Sent: 27 May 2016 19:59
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: GraphX Java API

Hi,

We are trying to consume the Java API for GraphX, but there is no documentation 
available online on the usage or examples. It would be great if we could get 
some examples in Java.

Thanks and regards,

Abhishek Kumar






This message (including any attachments) contains confidential information 
intended for a specific individual and purpose, and is protected by law. If you 
are not the intended recipient, you should delete this message and any 
disclosure, copying, or distribution of this message, or the taking of any 
action based on it, by you is strictly prohibited.

v.E.1









Re: GraphX Java API

2016-05-29 Thread Jules Damji
Also, this blog talks about GraphsFrames implementation of some GraphX 
algorithms, accessible from Java, Scala, and Python 

https://databricks.com/blog/2016/03/03/introducing-graphframes.html

Cheers 
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

> On May 29, 2016, at 12:24 AM, Takeshi Yamamuro  wrote:
> 
> Hi,
> 
> Have you checked GraphFrame?
> See the related discussion: See 
> https://issues.apache.org/jira/browse/SPARK-3665
> 
> // maropu
> 
>> On Fri, May 27, 2016 at 8:22 PM, Santoshakhilesh 
>>  wrote:
>> GraphX APis are available only in Scala. If you need to use GraphX you need 
>> to switch to Scala.
>> 
>>  
>> 
>> From: Kumar, Abhishek (US - Bengaluru) [mailto:abhishekkuma...@deloitte.com] 
>> Sent: 27 May 2016 19:59
>> To: user@spark.apache.org
>> Subject: GraphX Java API
>> 
>>  
>> 
>> Hi,
>> 
>>  
>> 
>> We are trying to consume the Java API for GraphX, but there is no 
>> documentation available online on the usage or examples. It would be great 
>> if we could get some examples in Java.
>> 
>>  
>> 
>> Thanks and regards,
>> 
>>  
>> 
>> Abhishek Kumar
>> 
>> Products & Services | iLab
>> 
>> Deloitte Consulting LLP
>> 
>> Block ‘C’, Divyasree Technopolis, Survey No.: 123 & 132/2, Yemlur Post, 
>> Yemlur, Bengaluru – 560037, Karnataka, India
>> 
>> Mobile: +91 7736795770
>> 
>> abhishekkuma...@deloitte.com | www.deloitte.com
>> 
>>  
>> 
>> Please consider the environment before printing.
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> This message (including any attachments) contains confidential information 
>> intended for a specific individual and purpose, and is protected by law. If 
>> you are not the intended recipient, you should delete this message and any 
>> disclosure, copying, or distribution of this message, or the taking of any 
>> action based on it, by you is strictly prohibited.
>> 
>> v.E.1
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>>  
>> 
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro


Re: GraphX Java API

2016-05-29 Thread Takeshi Yamamuro
Hi,

Have you checked GraphFrame?
See the related discussion: See
https://issues.apache.org/jira/browse/SPARK-3665

// maropu

On Fri, May 27, 2016 at 8:22 PM, Santoshakhilesh <
santosh.akhil...@huawei.com> wrote:

> GraphX APis are available only in Scala. If you need to use GraphX you
> need to switch to Scala.
>
>
>
> *From:* Kumar, Abhishek (US - Bengaluru) [mailto:
> abhishekkuma...@deloitte.com]
> *Sent:* 27 May 2016 19:59
> *To:* user@spark.apache.org
> *Subject:* GraphX Java API
>
>
>
> Hi,
>
>
>
> We are trying to consume the Java API for GraphX, but there is no
> documentation available online on the usage or examples. It would be great
> if we could get some examples in Java.
>
>
>
> Thanks and regards,
>
>
>
> *Abhishek Kumar*
>
> Products & Services | iLab
>
> Deloitte Consulting LLP
>
> Block ‘C’, Divyasree Technopolis, Survey No.: 123 & 132/2, Yemlur Post,
> Yemlur, Bengaluru – 560037, Karnataka, India
>
> Mobile: +91 7736795770
>
> abhishekkuma...@deloitte.com | www.deloitte.com
>
>
>
> Please consider the environment before printing.
>
>
>
>
>
>
>
> This message (including any attachments) contains confidential information
> intended for a specific individual and purpose, and is protected by law. If
> you are not the intended recipient, you should delete this message and any
> disclosure, copying, or distribution of this message, or the taking of any
> action based on it, by you is strictly prohibited.
>
> v.E.1
>
>
>
>
>
>
>
>
>



-- 
---
Takeshi Yamamuro


RE: GraphX Java API

2016-05-27 Thread Santoshakhilesh
GraphX APis are available only in Scala. If you need to use GraphX you need to 
switch to Scala.

From: Kumar, Abhishek (US - Bengaluru) [mailto:abhishekkuma...@deloitte.com]
Sent: 27 May 2016 19:59
To: user@spark.apache.org
Subject: GraphX Java API

Hi,

We are trying to consume the Java API for GraphX, but there is no documentation 
available online on the usage or examples. It would be great if we could get 
some examples in Java.

Thanks and regards,

Abhishek Kumar
Products & Services | iLab
Deloitte Consulting LLP
Block ‘C’, Divyasree Technopolis, Survey No.: 123 & 132/2, Yemlur Post, Yemlur, 
Bengaluru – 560037, Karnataka, India
Mobile: +91 7736795770
abhishekkuma...@deloitte.com | 
www.deloitte.com

Please consider the environment before printing.






This message (including any attachments) contains confidential information 
intended for a specific individual and purpose, and is protected by law. If you 
are not the intended recipient, you should delete this message and any 
disclosure, copying, or distribution of this message, or the taking of any 
action based on it, by you is strictly prohibited.

v.E.1









Re: Graphx

2016-03-11 Thread Khaled Ammar
This is an interesting discussion,

I have had some success running GraphX on large graphs with more than a
Billion edges using clusters of different size up to 64 machines. However,
the performance goes down when I double the cluster size to reach 128
machines of r3.xlarge. Does any one have experience with very large GraphX
clusters?

@Ovidiu-Cristian, @Alexis and @Alexander, could you please share the
configurations for Spark / GraphX that works best for you?

Thanks,
-Khaled

On Fri, Mar 11, 2016 at 1:25 PM, John Lilley <john.lil...@redpoint.net>
wrote:

> We have almost zero node info – just an identifying integer.
>
> *John Lilley*
>
>
>
> *From:* Alexis Roos [mailto:alexis.r...@gmail.com]
> *Sent:* Friday, March 11, 2016 11:24 AM
> *To:* Alexander Pivovarov <apivova...@gmail.com>
> *Cc:* John Lilley <john.lil...@redpoint.net>; Ovidiu-Cristian MARCU <
> ovidiu-cristian.ma...@inria.fr>; lihu <lihu...@gmail.com>; Andrew A <
> andrew.a...@gmail.com>; u...@spark.incubator.apache.org; Geoff Thompson <
> geoff.thomp...@redpoint.net>
> *Subject:* Re: Graphx
>
>
>
> Also we keep the Node info minimal as needed for connected components and
> rejoin later.
>
>
>
> Alexis
>
>
>
> On Fri, Mar 11, 2016 at 10:12 AM, Alexander Pivovarov <
> apivova...@gmail.com> wrote:
>
> we use it in prod
>
>
>
> 70 boxes, 61GB RAM each
>
>
>
> GraphX Connected Components works fine on 250M Vertices and 1B Edges
> (takes about 5-10 min)
>
>
>
> Spark likes memory, so use r3.2xlarge boxes (61GB)
>
> For example 10 x r3.2xlarge (61GB) work much faster than 20 x r3.xlarge
> (30.5 GB) (especially if you have skewed data)
>
>
>
> Also, use checkpoints before and after Connected Components to reduce DAG
> delays
>
>
>
> You can also try to enable Kryo and register classes used in RDD
>
>
>
>
>
> On Fri, Mar 11, 2016 at 8:07 AM, John Lilley <john.lil...@redpoint.net>
> wrote:
>
> I suppose for a 2.6bn case we’d need Long:
>
>
>
> public class GenCCInput {
>
>   public static void main(String[] args) {
>
> if (args.length != 2) {
>
>   System.err.println("Usage: \njava GenCCInput  ");
>
>   System.exit(-1);
>
> }
>
> long edges = Long.parseLong(args[0]);
>
> long groupSize = Long.parseLong(args[1]);
>
> long currentEdge = 1;
>
> long currentGroupSize = 0;
>
> for (long i = 0; i < edges; i++) {
>
>   System.out.println(currentEdge + " " + (currentEdge + 1));
>
>   if (currentGroupSize == 0) {
>
> currentGroupSize = 2;
>
>   } else {
>
> currentGroupSize++;
>
>   }
>
>   if (currentGroupSize >= groupSize) {
>
> currentGroupSize = 0;
>
> currentEdge += 2;
>
>   } else {
>
> currentEdge++;
>
>   }
>
> }
>
>   }
>
> }
>
>
>
> *John Lilley*
>
> Chief Architect, RedPoint Global Inc.
>
> T: +1 303 541 1516  *| *M: +1 720 938 5761 *|* F: +1 781-705-2077
>
> Skype: jlilley.redpoint *|* john.lil...@redpoint.net *|* www.redpoint.net
>
>
>
> *From:* John Lilley [mailto:john.lil...@redpoint.net]
> *Sent:* Friday, March 11, 2016 8:46 AM
> *To:* Ovidiu-Cristian MARCU <ovidiu-cristian.ma...@inria.fr>
> *Cc:* lihu <lihu...@gmail.com>; Andrew A <andrew.a...@gmail.com>;
> u...@spark.incubator.apache.org; Geoff Thompson <
> geoff.thomp...@redpoint.net>
> *Subject:* RE: Graphx
>
>
>
> Ovidiu,
>
>
>
> IMHO, this is one of the biggest issues facing GraphX and Spark.  There
> are a lot of knobs and levers to pull to affect performance, with very
> little guidance about which settings work in general.  We cannot ship
> software that requires end-user tuning; it just has to work.  Unfortunately
> GraphX seems very sensitive to working set size relative to available RAM
> and fails catastrophically as opposed to gracefully when working set is too
> large.  It is also very sensitive to the nature of the data.  For example,
> if we build a test file with input-edge representation like:
>
> 1 2
>
> 2 3
>
> 3 4
>
> 5 6
>
> 6 7
>
> 7 8
>
> …
>
> this represents a graph with connected components in groups of four.  We
> found experimentally that when this data in input in clustered order, the
> required memory is lower and runtime is much faster than when data is input
> in random order.  This makes intuitive sense because of the additional
> communication required for the random order.
>
>
>
> Our 1bn-edge test case was of t

RE: Graphx

2016-03-11 Thread John Lilley
We have almost zero node info – just an identifying integer.
John Lilley

From: Alexis Roos [mailto:alexis.r...@gmail.com]
Sent: Friday, March 11, 2016 11:24 AM
To: Alexander Pivovarov <apivova...@gmail.com>
Cc: John Lilley <john.lil...@redpoint.net>; Ovidiu-Cristian MARCU 
<ovidiu-cristian.ma...@inria.fr>; lihu <lihu...@gmail.com>; Andrew A 
<andrew.a...@gmail.com>; u...@spark.incubator.apache.org; Geoff Thompson 
<geoff.thomp...@redpoint.net>
Subject: Re: Graphx

Also we keep the Node info minimal as needed for connected components and 
rejoin later.

Alexis

On Fri, Mar 11, 2016 at 10:12 AM, Alexander Pivovarov 
<apivova...@gmail.com<mailto:apivova...@gmail.com>> wrote:
we use it in prod

70 boxes, 61GB RAM each

GraphX Connected Components works fine on 250M Vertices and 1B Edges (takes 
about 5-10 min)

Spark likes memory, so use r3.2xlarge boxes (61GB)
For example 10 x r3.2xlarge (61GB) work much faster than 20 x r3.xlarge (30.5 
GB) (especially if you have skewed data)

Also, use checkpoints before and after Connected Components to reduce DAG delays

You can also try to enable Kryo and register classes used in RDD


On Fri, Mar 11, 2016 at 8:07 AM, John Lilley 
<john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>> wrote:
I suppose for a 2.6bn case we’d need Long:

public class GenCCInput {
  public static void main(String[] args) {
if (args.length != 2) {
  System.err.println("Usage: \njava GenCCInput  ");
  System.exit(-1);
}
long edges = Long.parseLong(args[0]);
long groupSize = Long.parseLong(args[1]);
long currentEdge = 1;
long currentGroupSize = 0;
for (long i = 0; i < edges; i++) {
  System.out.println(currentEdge + " " + (currentEdge + 1));
  if (currentGroupSize == 0) {
currentGroupSize = 2;
  } else {
currentGroupSize++;
  }
  if (currentGroupSize >= groupSize) {
currentGroupSize = 0;
currentEdge += 2;
  } else {
currentEdge++;
  }
}
  }
}

John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516<tel:%2B1%C2%A0303%C2%A0541%201516>  | M: +1 720 938 
5761<tel:%2B1%20720%20938%205761> | F: +1 781-705-2077<tel:%2B1%20781-705-2077>
Skype: jlilley.redpoint | 
john.lil...@redpoint.net<mailto:john.lil...@redpoint.net> | 
www.redpoint.net<http://www.redpoint.net/>

From: John Lilley 
[mailto:john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>]
Sent: Friday, March 11, 2016 8:46 AM
To: Ovidiu-Cristian MARCU 
<ovidiu-cristian.ma...@inria.fr<mailto:ovidiu-cristian.ma...@inria.fr>>
Cc: lihu <lihu...@gmail.com<mailto:lihu...@gmail.com>>; Andrew A 
<andrew.a...@gmail.com<mailto:andrew.a...@gmail.com>>; 
u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>; Geoff 
Thompson <geoff.thomp...@redpoint.net<mailto:geoff.thomp...@redpoint.net>>
Subject: RE: Graphx

Ovidiu,

IMHO, this is one of the biggest issues facing GraphX and Spark.  There are a 
lot of knobs and levers to pull to affect performance, with very little 
guidance about which settings work in general.  We cannot ship software that 
requires end-user tuning; it just has to work.  Unfortunately GraphX seems very 
sensitive to working set size relative to available RAM and fails 
catastrophically as opposed to gracefully when working set is too large.  It is 
also very sensitive to the nature of the data.  For example, if we build a test 
file with input-edge representation like:
1 2
2 3
3 4
5 6
6 7
7 8
…
this represents a graph with connected components in groups of four.  We found 
experimentally that when this data in input in clustered order, the required 
memory is lower and runtime is much faster than when data is input in random 
order.  This makes intuitive sense because of the additional communication 
required for the random order.

Our 1bn-edge test case was of this same form, input in clustered order, with 
groups of 10 vertices per component.  It failed at 8 x 60GB.  This is the kind 
of data that our application processes, so it is a realistic test for us.  I’ve 
found that social media test data sets tend to follow power-law distributions, 
and that GraphX has much less problem with them.

A comparable test scaled to your cluster (16 x 80GB) would be 2.6bn edges in 
10-vertex components using the synthetic test input I describe above.  I would 
be curious to know if this works and what settings you use to succeed, and if 
it continues to succeed for random input order.

As for the C++ algorithm, it scales multi-core.  It exhibits O(N^2) behavior 
for large data sets, but it processes the 1bn-edge case on a single 60GB node 
in about 20 minutes.  It degrades gracefully along the O(N^2) curve and 
additional memory reduces time.

John Lilley

From: Ovidiu-Cristian MARCU [mailto:ovidiu-cristian.ma...@inria.fr]

Re: Graphx

2016-03-11 Thread Alexis Roos
Also we keep the Node info minimal as needed for connected components and
rejoin later.

Alexis

On Fri, Mar 11, 2016 at 10:12 AM, Alexander Pivovarov <apivova...@gmail.com>
wrote:

> we use it in prod
>
> 70 boxes, 61GB RAM each
>
> GraphX Connected Components works fine on 250M Vertices and 1B Edges
> (takes about 5-10 min)
>
> Spark likes memory, so use r3.2xlarge boxes (61GB)
> For example 10 x r3.2xlarge (61GB) work much faster than 20 x r3.xlarge
> (30.5 GB) (especially if you have skewed data)
>
> Also, use checkpoints before and after Connected Components to reduce DAG
> delays
>
> You can also try to enable Kryo and register classes used in RDD
>
>
> On Fri, Mar 11, 2016 at 8:07 AM, John Lilley <john.lil...@redpoint.net>
> wrote:
>
>> I suppose for a 2.6bn case we’d need Long:
>>
>>
>>
>> public class GenCCInput {
>>
>>   public static void main(String[] args) {
>>
>> if (args.length != 2) {
>>
>>   System.err.println("Usage: \njava GenCCInput  ");
>>
>>   System.exit(-1);
>>
>> }
>>
>> long edges = Long.parseLong(args[0]);
>>
>> long groupSize = Long.parseLong(args[1]);
>>
>> long currentEdge = 1;
>>
>> long currentGroupSize = 0;
>>
>> for (long i = 0; i < edges; i++) {
>>
>>   System.out.println(currentEdge + " " + (currentEdge + 1));
>>
>>   if (currentGroupSize == 0) {
>>
>> currentGroupSize = 2;
>>
>>   } else {
>>
>> currentGroupSize++;
>>
>>   }
>>
>>   if (currentGroupSize >= groupSize) {
>>
>> currentGroupSize = 0;
>>
>> currentEdge += 2;
>>
>>   } else {
>>
>> currentEdge++;
>>
>>   }
>>
>> }
>>
>>   }
>>
>> }
>>
>>
>>
>> *John Lilley*
>>
>> Chief Architect, RedPoint Global Inc.
>>
>> T: +1 303 541 1516  *| *M: +1 720 938 5761 *|* F: +1 781-705-2077
>>
>> Skype: jlilley.redpoint *|* john.lil...@redpoint.net *|* www.redpoint.net
>>
>>
>>
>> *From:* John Lilley [mailto:john.lil...@redpoint.net]
>> *Sent:* Friday, March 11, 2016 8:46 AM
>> *To:* Ovidiu-Cristian MARCU <ovidiu-cristian.ma...@inria.fr>
>> *Cc:* lihu <lihu...@gmail.com>; Andrew A <andrew.a...@gmail.com>;
>> u...@spark.incubator.apache.org; Geoff Thompson <
>> geoff.thomp...@redpoint.net>
>> *Subject:* RE: Graphx
>>
>>
>>
>> Ovidiu,
>>
>>
>>
>> IMHO, this is one of the biggest issues facing GraphX and Spark.  There
>> are a lot of knobs and levers to pull to affect performance, with very
>> little guidance about which settings work in general.  We cannot ship
>> software that requires end-user tuning; it just has to work.  Unfortunately
>> GraphX seems very sensitive to working set size relative to available RAM
>> and fails catastrophically as opposed to gracefully when working set is too
>> large.  It is also very sensitive to the nature of the data.  For example,
>> if we build a test file with input-edge representation like:
>>
>> 1 2
>>
>> 2 3
>>
>> 3 4
>>
>> 5 6
>>
>> 6 7
>>
>> 7 8
>>
>> …
>>
>> this represents a graph with connected components in groups of four.  We
>> found experimentally that when this data in input in clustered order, the
>> required memory is lower and runtime is much faster than when data is input
>> in random order.  This makes intuitive sense because of the additional
>> communication required for the random order.
>>
>>
>>
>> Our 1bn-edge test case was of this same form, input in clustered order,
>> with groups of 10 vertices per component.  It failed at 8 x 60GB.  This is
>> the kind of data that our application processes, so it is a realistic test
>> for us.  I’ve found that social media test data sets tend to follow
>> power-law distributions, and that GraphX has much less problem with them.
>>
>>
>>
>> A comparable test scaled to your cluster (16 x 80GB) would be 2.6bn edges
>> in 10-vertex components using the synthetic test input I describe above.  I
>> would be curious to know if this works and what settings you use to
>> succeed, and if it continues to succeed for random input order.
>>
>>
>>
>> As for the C++ algorithm, it scales multi-core.  It exhibits O(N^2)
>> behavior for large data set

RE: Graphx

2016-03-11 Thread John Lilley
Thanks Alexander, this is really good information.  However it reinforces that 
we cannot use GraphX, because our customers typically have on-prem clusters in 
the 10-node range.  Very few have the kind of horsepower you are talking about. 
 We can’t just tell them to quadruple their cluster size to run our software on 
1bn edges.

John Lilley

From: Alexander Pivovarov [mailto:apivova...@gmail.com]
Sent: Friday, March 11, 2016 11:13 AM
To: John Lilley <john.lil...@redpoint.net>
Cc: Ovidiu-Cristian MARCU <ovidiu-cristian.ma...@inria.fr>; lihu 
<lihu...@gmail.com>; Andrew A <andrew.a...@gmail.com>; 
u...@spark.incubator.apache.org; Geoff Thompson <geoff.thomp...@redpoint.net>
Subject: Re: Graphx

we use it in prod

70 boxes, 61GB RAM each

GraphX Connected Components works fine on 250M Vertices and 1B Edges (takes 
about 5-10 min)

Spark likes memory, so use r3.2xlarge boxes (61GB)
For example 10 x r3.2xlarge (61GB) work much faster than 20 x r3.xlarge (30.5 
GB) (especially if you have skewed data)

Also, use checkpoints before and after Connected Components to reduce DAG delays

You can also try to enable Kryo and register classes used in RDD


On Fri, Mar 11, 2016 at 8:07 AM, John Lilley 
<john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>> wrote:
I suppose for a 2.6bn case we’d need Long:

public class GenCCInput {
  public static void main(String[] args) {
if (args.length != 2) {
  System.err.println("Usage: \njava GenCCInput  ");
  System.exit(-1);
}
long edges = Long.parseLong(args[0]);
long groupSize = Long.parseLong(args[1]);
long currentEdge = 1;
long currentGroupSize = 0;
for (long i = 0; i < edges; i++) {
  System.out.println(currentEdge + " " + (currentEdge + 1));
  if (currentGroupSize == 0) {
currentGroupSize = 2;
  } else {
currentGroupSize++;
  }
  if (currentGroupSize >= groupSize) {
currentGroupSize = 0;
currentEdge += 2;
  } else {
currentEdge++;
  }
}
  }
}

John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516<tel:%2B1%C2%A0303%C2%A0541%201516>  | M: +1 720 938 
5761<tel:%2B1%20720%20938%205761> | F: +1 781-705-2077<tel:%2B1%20781-705-2077>
Skype: jlilley.redpoint | 
john.lil...@redpoint.net<mailto:john.lil...@redpoint.net> | 
www.redpoint.net<http://www.redpoint.net/>

From: John Lilley 
[mailto:john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>]
Sent: Friday, March 11, 2016 8:46 AM
To: Ovidiu-Cristian MARCU 
<ovidiu-cristian.ma...@inria.fr<mailto:ovidiu-cristian.ma...@inria.fr>>
Cc: lihu <lihu...@gmail.com<mailto:lihu...@gmail.com>>; Andrew A 
<andrew.a...@gmail.com<mailto:andrew.a...@gmail.com>>; 
u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>; Geoff 
Thompson <geoff.thomp...@redpoint.net<mailto:geoff.thomp...@redpoint.net>>
Subject: RE: Graphx

Ovidiu,

IMHO, this is one of the biggest issues facing GraphX and Spark.  There are a 
lot of knobs and levers to pull to affect performance, with very little 
guidance about which settings work in general.  We cannot ship software that 
requires end-user tuning; it just has to work.  Unfortunately GraphX seems very 
sensitive to working set size relative to available RAM and fails 
catastrophically as opposed to gracefully when working set is too large.  It is 
also very sensitive to the nature of the data.  For example, if we build a test 
file with input-edge representation like:
1 2
2 3
3 4
5 6
6 7
7 8
…
this represents a graph with connected components in groups of four.  We found 
experimentally that when this data in input in clustered order, the required 
memory is lower and runtime is much faster than when data is input in random 
order.  This makes intuitive sense because of the additional communication 
required for the random order.

Our 1bn-edge test case was of this same form, input in clustered order, with 
groups of 10 vertices per component.  It failed at 8 x 60GB.  This is the kind 
of data that our application processes, so it is a realistic test for us.  I’ve 
found that social media test data sets tend to follow power-law distributions, 
and that GraphX has much less problem with them.

A comparable test scaled to your cluster (16 x 80GB) would be 2.6bn edges in 
10-vertex components using the synthetic test input I describe above.  I would 
be curious to know if this works and what settings you use to succeed, and if 
it continues to succeed for random input order.

As for the C++ algorithm, it scales multi-core.  It exhibits O(N^2) behavior 
for large data sets, but it processes the 1bn-edge case on a single 60GB node 
in about 20 minutes.  It degrades gracefully along the O(N^2) curve and 
additional memory reduces time.

John Lilley

From: Ovidiu-Cristian MARCU [mailto:ovidiu-cristian.ma...@inri

Re: Graphx

2016-03-11 Thread Alexander Pivovarov
we use it in prod

70 boxes, 61GB RAM each

GraphX Connected Components works fine on 250M Vertices and 1B Edges (takes
about 5-10 min)

Spark likes memory, so use r3.2xlarge boxes (61GB)
For example 10 x r3.2xlarge (61GB) work much faster than 20 x r3.xlarge
(30.5 GB) (especially if you have skewed data)

Also, use checkpoints before and after Connected Components to reduce DAG
delays

You can also try to enable Kryo and register classes used in RDD


On Fri, Mar 11, 2016 at 8:07 AM, John Lilley <john.lil...@redpoint.net>
wrote:

> I suppose for a 2.6bn case we’d need Long:
>
>
>
> public class GenCCInput {
>
>   public static void main(String[] args) {
>
> if (args.length != 2) {
>
>   System.err.println("Usage: \njava GenCCInput  ");
>
>   System.exit(-1);
>
> }
>
> long edges = Long.parseLong(args[0]);
>
> long groupSize = Long.parseLong(args[1]);
>
> long currentEdge = 1;
>
> long currentGroupSize = 0;
>
> for (long i = 0; i < edges; i++) {
>
>   System.out.println(currentEdge + " " + (currentEdge + 1));
>
>   if (currentGroupSize == 0) {
>
> currentGroupSize = 2;
>
>   } else {
>
> currentGroupSize++;
>
>   }
>
>   if (currentGroupSize >= groupSize) {
>
> currentGroupSize = 0;
>
> currentEdge += 2;
>
>   } else {
>
> currentEdge++;
>
>   }
>
> }
>
>   }
>
> }
>
>
>
> *John Lilley*
>
> Chief Architect, RedPoint Global Inc.
>
> T: +1 303 541 1516  *| *M: +1 720 938 5761 *|* F: +1 781-705-2077
>
> Skype: jlilley.redpoint *|* john.lil...@redpoint.net *|* www.redpoint.net
>
>
>
> *From:* John Lilley [mailto:john.lil...@redpoint.net]
> *Sent:* Friday, March 11, 2016 8:46 AM
> *To:* Ovidiu-Cristian MARCU <ovidiu-cristian.ma...@inria.fr>
> *Cc:* lihu <lihu...@gmail.com>; Andrew A <andrew.a...@gmail.com>;
> u...@spark.incubator.apache.org; Geoff Thompson <
> geoff.thomp...@redpoint.net>
> *Subject:* RE: Graphx
>
>
>
> Ovidiu,
>
>
>
> IMHO, this is one of the biggest issues facing GraphX and Spark.  There
> are a lot of knobs and levers to pull to affect performance, with very
> little guidance about which settings work in general.  We cannot ship
> software that requires end-user tuning; it just has to work.  Unfortunately
> GraphX seems very sensitive to working set size relative to available RAM
> and fails catastrophically as opposed to gracefully when working set is too
> large.  It is also very sensitive to the nature of the data.  For example,
> if we build a test file with input-edge representation like:
>
> 1 2
>
> 2 3
>
> 3 4
>
> 5 6
>
> 6 7
>
> 7 8
>
> …
>
> this represents a graph with connected components in groups of four.  We
> found experimentally that when this data in input in clustered order, the
> required memory is lower and runtime is much faster than when data is input
> in random order.  This makes intuitive sense because of the additional
> communication required for the random order.
>
>
>
> Our 1bn-edge test case was of this same form, input in clustered order,
> with groups of 10 vertices per component.  It failed at 8 x 60GB.  This is
> the kind of data that our application processes, so it is a realistic test
> for us.  I’ve found that social media test data sets tend to follow
> power-law distributions, and that GraphX has much less problem with them.
>
>
>
> A comparable test scaled to your cluster (16 x 80GB) would be 2.6bn edges
> in 10-vertex components using the synthetic test input I describe above.  I
> would be curious to know if this works and what settings you use to
> succeed, and if it continues to succeed for random input order.
>
>
>
> As for the C++ algorithm, it scales multi-core.  It exhibits O(N^2)
> behavior for large data sets, but it processes the 1bn-edge case on a
> single 60GB node in about 20 minutes.  It degrades gracefully along the
> O(N^2) curve and additional memory reduces time.
>
>
>
> *John Lilley*
>
>
>
> *From:* Ovidiu-Cristian MARCU [mailto:ovidiu-cristian.ma...@inria.fr
> <ovidiu-cristian.ma...@inria.fr>]
> *Sent:* Friday, March 11, 2016 8:14 AM
> *To:* John Lilley <john.lil...@redpoint.net>
> *Cc:* lihu <lihu...@gmail.com>; Andrew A <andrew.a...@gmail.com>;
> u...@spark.incubator.apache.org
> *Subject:* Re: Graphx
>
>
>
> Hi,
>
>
>
> I wonder what version of Spark and different parameter configuration you
> used.
>
> I was able to run CC for 1.8bn edges in about 8 minutes (23 iterations)
> 

RE: Graphx

2016-03-11 Thread John Lilley
I suppose for a 2.6bn case we’d need Long:

public class GenCCInput {
  public static void main(String[] args) {
if (args.length != 2) {
  System.err.println("Usage: \njava GenCCInput  ");
  System.exit(-1);
}
long edges = Long.parseLong(args[0]);
long groupSize = Long.parseLong(args[1]);
long currentEdge = 1;
long currentGroupSize = 0;
for (long i = 0; i < edges; i++) {
  System.out.println(currentEdge + " " + (currentEdge + 1));
  if (currentGroupSize == 0) {
currentGroupSize = 2;
  } else {
currentGroupSize++;
  }
  if (currentGroupSize >= groupSize) {
currentGroupSize = 0;
currentEdge += 2;
  } else {
currentEdge++;
  }
}
  }
}

John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516  | M: +1 720 938 5761 | F: +1 781-705-2077
Skype: jlilley.redpoint | 
john.lil...@redpoint.net<mailto:john.lil...@redpoint.net> | 
www.redpoint.net<http://www.redpoint.net/>

From: John Lilley [mailto:john.lil...@redpoint.net]
Sent: Friday, March 11, 2016 8:46 AM
To: Ovidiu-Cristian MARCU <ovidiu-cristian.ma...@inria.fr>
Cc: lihu <lihu...@gmail.com>; Andrew A <andrew.a...@gmail.com>; 
u...@spark.incubator.apache.org; Geoff Thompson <geoff.thomp...@redpoint.net>
Subject: RE: Graphx

Ovidiu,

IMHO, this is one of the biggest issues facing GraphX and Spark.  There are a 
lot of knobs and levers to pull to affect performance, with very little 
guidance about which settings work in general.  We cannot ship software that 
requires end-user tuning; it just has to work.  Unfortunately GraphX seems very 
sensitive to working set size relative to available RAM and fails 
catastrophically as opposed to gracefully when working set is too large.  It is 
also very sensitive to the nature of the data.  For example, if we build a test 
file with input-edge representation like:
1 2
2 3
3 4
5 6
6 7
7 8
…
this represents a graph with connected components in groups of four.  We found 
experimentally that when this data in input in clustered order, the required 
memory is lower and runtime is much faster than when data is input in random 
order.  This makes intuitive sense because of the additional communication 
required for the random order.

Our 1bn-edge test case was of this same form, input in clustered order, with 
groups of 10 vertices per component.  It failed at 8 x 60GB.  This is the kind 
of data that our application processes, so it is a realistic test for us.  I’ve 
found that social media test data sets tend to follow power-law distributions, 
and that GraphX has much less problem with them.

A comparable test scaled to your cluster (16 x 80GB) would be 2.6bn edges in 
10-vertex components using the synthetic test input I describe above.  I would 
be curious to know if this works and what settings you use to succeed, and if 
it continues to succeed for random input order.

As for the C++ algorithm, it scales multi-core.  It exhibits O(N^2) behavior 
for large data sets, but it processes the 1bn-edge case on a single 60GB node 
in about 20 minutes.  It degrades gracefully along the O(N^2) curve and 
additional memory reduces time.

John Lilley

From: Ovidiu-Cristian MARCU [mailto:ovidiu-cristian.ma...@inria.fr]
Sent: Friday, March 11, 2016 8:14 AM
To: John Lilley <john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>>
Cc: lihu <lihu...@gmail.com<mailto:lihu...@gmail.com>>; Andrew A 
<andrew.a...@gmail.com<mailto:andrew.a...@gmail.com>>; 
u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>
Subject: Re: Graphx

Hi,

I wonder what version of Spark and different parameter configuration you used.
I was able to run CC for 1.8bn edges in about 8 minutes (23 iterations) using 
16 nodes with around 80GB RAM each (Spark 1.5, default parameters)
John: I suppose your C++ app (algorithm) does not scale if you used only one 
node.
I don’t understand how RDD’s serialization is taking excessive time, compared 
to the total time or other expected time?

For the different RDD times you have events and UI console and a bunch of 
papers describing how measure different things, lihu: did you used some 
incomplete tool or what are you looking for?

Best,
Ovidiu

On 11 Mar 2016, at 16:02, John Lilley 
<john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>> wrote:

A colleague did the experiments and I don’t know exactly how he observed that.  
I think it was indirect from the Spark diagnostics indicating the amount of I/O 
he deduced that this was RDD serialization.  Also when he added light 
compression to RDD serialization this improved matters.

John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516  | M: +1 720 938 5761 | F: +1 781-705-2077
Skype: jlilley.redpoint | 
john.lil...@redpoint.net<mailto:john.lil...@redpoint.net> | 
www.redpoint.net<http://www.redpoin

RE: Graphx

2016-03-11 Thread John Lilley
PS: This is the code I use to generate clustered test dat:

public class GenCCInput {
  public static void main(String[] args) {
if (args.length != 2) {
  System.err.println("Usage: \njava GenCCInput  ");
  System.exit(-1);
}
int edges = Integer.parseInt(args[0]);
int groupSize = Integer.parseInt(args[1]);
int currentEdge = 1;
int currentGroupSize = 0;
for (int i = 0; i < edges; i++) {
  System.out.println(currentEdge + " " + (currentEdge + 1));
  if (currentGroupSize == 0) {
currentGroupSize = 2;
  } else {
currentGroupSize++;
  }
  if (currentGroupSize >= groupSize) {
currentGroupSize = 0;
currentEdge += 2;
  } else {
currentEdge++;
  }
}
  }
}

John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516  | M: +1 720 938 5761 | F: +1 781-705-2077
Skype: jlilley.redpoint | 
john.lil...@redpoint.net<mailto:john.lil...@redpoint.net> | 
www.redpoint.net<http://www.redpoint.net/>

From: Ovidiu-Cristian MARCU [mailto:ovidiu-cristian.ma...@inria.fr]
Sent: Friday, March 11, 2016 8:14 AM
To: John Lilley <john.lil...@redpoint.net>
Cc: lihu <lihu...@gmail.com>; Andrew A <andrew.a...@gmail.com>; 
u...@spark.incubator.apache.org
Subject: Re: Graphx

Hi,

I wonder what version of Spark and different parameter configuration you used.
I was able to run CC for 1.8bn edges in about 8 minutes (23 iterations) using 
16 nodes with around 80GB RAM each (Spark 1.5, default parameters)
John: I suppose your C++ app (algorithm) does not scale if you used only one 
node.
I don’t understand how RDD’s serialization is taking excessive time, compared 
to the total time or other expected time?

For the different RDD times you have events and UI console and a bunch of 
papers describing how measure different things, lihu: did you used some 
incomplete tool or what are you looking for?

Best,
Ovidiu

On 11 Mar 2016, at 16:02, John Lilley 
<john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>> wrote:

A colleague did the experiments and I don’t know exactly how he observed that.  
I think it was indirect from the Spark diagnostics indicating the amount of I/O 
he deduced that this was RDD serialization.  Also when he added light 
compression to RDD serialization this improved matters.

John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516  | M: +1 720 938 5761 | F: +1 781-705-2077
Skype: jlilley.redpoint | 
john.lil...@redpoint.net<mailto:john.lil...@redpoint.net> | 
www.redpoint.net<http://www.redpoint.net/>

From: lihu [mailto:lihu...@gmail.com]
Sent: Friday, March 11, 2016 7:58 AM
To: John Lilley <john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>>
Cc: Andrew A <andrew.a...@gmail.com<mailto:andrew.a...@gmail.com>>; 
u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>
Subject: Re: Graphx

Hi, John:
   I am very intersting in your experiment, How can you get that RDD 
serialization cost lots of time, from the log or some other tools?

On Fri, Mar 11, 2016 at 8:46 PM, John Lilley 
<john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>> wrote:
Andrew,

We conducted some tests for using Graphx to solve the connected-components 
problem and were disappointed.  On 8 nodes of 16GB each, we could not get above 
100M edges.  On 8 nodes of 60GB each, we could not process 1bn edges.  RDD 
serialization would take excessive time and then we would get failures.  By 
contrast, we have a C++ algorithm that solves 1bn edges using memory+disk on a 
single 16GB node in about an hour.  I think that a very large cluster will do 
better, but we did not explore that.

John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516<tel:%2B1%C2%A0303%C2%A0541%201516>  | M: +1 720 938 
5761<tel:%2B1%20720%20938%205761> | F: +1 781-705-2077<tel:%2B1%20781-705-2077>
Skype: jlilley.redpoint | 
john.lil...@redpoint.net<mailto:john.lil...@redpoint.net> | 
www.redpoint.net<http://www.redpoint.net/>

From: Andrew A [mailto:andrew.a...@gmail.com<mailto:andrew.a...@gmail.com>]
Sent: Thursday, March 10, 2016 2:44 PM
To: u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>
Subject: Graphx

Hi, is there anyone who use graphx in production? What maximum size of graphs 
did you process by spark and what cluster are you use for it?

i tried calculate pagerank for 1 Gb edges LJ - dataset for LiveJournalPageRank 
from spark examples and i faced with large volume shuffles produced by spark 
which fail my spark job.
Thank you,
Andrew



RE: Graphx

2016-03-11 Thread John Lilley
Ovidiu,

IMHO, this is one of the biggest issues facing GraphX and Spark.  There are a 
lot of knobs and levers to pull to affect performance, with very little 
guidance about which settings work in general.  We cannot ship software that 
requires end-user tuning; it just has to work.  Unfortunately GraphX seems very 
sensitive to working set size relative to available RAM and fails 
catastrophically as opposed to gracefully when working set is too large.  It is 
also very sensitive to the nature of the data.  For example, if we build a test 
file with input-edge representation like:
1 2
2 3
3 4
5 6
6 7
7 8
…
this represents a graph with connected components in groups of four.  We found 
experimentally that when this data in input in clustered order, the required 
memory is lower and runtime is much faster than when data is input in random 
order.  This makes intuitive sense because of the additional communication 
required for the random order.

Our 1bn-edge test case was of this same form, input in clustered order, with 
groups of 10 vertices per component.  It failed at 8 x 60GB.  This is the kind 
of data that our application processes, so it is a realistic test for us.  I’ve 
found that social media test data sets tend to follow power-law distributions, 
and that GraphX has much less problem with them.

A comparable test scaled to your cluster (16 x 80GB) would be 2.6bn edges in 
10-vertex components using the synthetic test input I describe above.  I would 
be curious to know if this works and what settings you use to succeed, and if 
it continues to succeed for random input order.

As for the C++ algorithm, it scales multi-core.  It exhibits O(N^2) behavior 
for large data sets, but it processes the 1bn-edge case on a single 60GB node 
in about 20 minutes.  It degrades gracefully along the O(N^2) curve and 
additional memory reduces time.

John Lilley

From: Ovidiu-Cristian MARCU [mailto:ovidiu-cristian.ma...@inria.fr]
Sent: Friday, March 11, 2016 8:14 AM
To: John Lilley <john.lil...@redpoint.net>
Cc: lihu <lihu...@gmail.com>; Andrew A <andrew.a...@gmail.com>; 
u...@spark.incubator.apache.org
Subject: Re: Graphx

Hi,

I wonder what version of Spark and different parameter configuration you used.
I was able to run CC for 1.8bn edges in about 8 minutes (23 iterations) using 
16 nodes with around 80GB RAM each (Spark 1.5, default parameters)
John: I suppose your C++ app (algorithm) does not scale if you used only one 
node.
I don’t understand how RDD’s serialization is taking excessive time, compared 
to the total time or other expected time?

For the different RDD times you have events and UI console and a bunch of 
papers describing how measure different things, lihu: did you used some 
incomplete tool or what are you looking for?

Best,
Ovidiu

On 11 Mar 2016, at 16:02, John Lilley 
<john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>> wrote:

A colleague did the experiments and I don’t know exactly how he observed that.  
I think it was indirect from the Spark diagnostics indicating the amount of I/O 
he deduced that this was RDD serialization.  Also when he added light 
compression to RDD serialization this improved matters.

John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516  | M: +1 720 938 5761 | F: +1 781-705-2077
Skype: jlilley.redpoint | 
john.lil...@redpoint.net<mailto:john.lil...@redpoint.net> | 
www.redpoint.net<http://www.redpoint.net/>

From: lihu [mailto:lihu...@gmail.com]
Sent: Friday, March 11, 2016 7:58 AM
To: John Lilley <john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>>
Cc: Andrew A <andrew.a...@gmail.com<mailto:andrew.a...@gmail.com>>; 
u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>
Subject: Re: Graphx

Hi, John:
   I am very intersting in your experiment, How can you get that RDD 
serialization cost lots of time, from the log or some other tools?

On Fri, Mar 11, 2016 at 8:46 PM, John Lilley 
<john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>> wrote:
Andrew,

We conducted some tests for using Graphx to solve the connected-components 
problem and were disappointed.  On 8 nodes of 16GB each, we could not get above 
100M edges.  On 8 nodes of 60GB each, we could not process 1bn edges.  RDD 
serialization would take excessive time and then we would get failures.  By 
contrast, we have a C++ algorithm that solves 1bn edges using memory+disk on a 
single 16GB node in about an hour.  I think that a very large cluster will do 
better, but we did not explore that.

John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516<tel:%2B1%C2%A0303%C2%A0541%201516>  | M: +1 720 938 
5761<tel:%2B1%20720%20938%205761> | F: +1 781-705-2077<tel:%2B1%20781-705-2077>
Skype: jlilley.redpoint | 
john.lil...@redpoint.net<mailto:john.lil...@redpoint.net> | 
www.redpoint.net<http://www.redpoint.net/>

From: 

Re: Graphx

2016-03-11 Thread Ovidiu-Cristian MARCU
Hi,

I wonder what version of Spark and different parameter configuration you used.
I was able to run CC for 1.8bn edges in about 8 minutes (23 iterations) using 
16 nodes with around 80GB RAM each (Spark 1.5, default parameters)
John: I suppose your C++ app (algorithm) does not scale if you used only one 
node.
I don’t understand how RDD’s serialization is taking excessive time, compared 
to the total time or other expected time? 

For the different RDD times you have events and UI console and a bunch of 
papers describing how measure different things, lihu: did you used some 
incomplete tool or what are you looking for?

Best,
Ovidiu

> On 11 Mar 2016, at 16:02, John Lilley <john.lil...@redpoint.net> wrote:
> 
> A colleague did the experiments and I don’t know exactly how he observed 
> that.  I think it was indirect from the Spark diagnostics indicating the 
> amount of I/O he deduced that this was RDD serialization.  Also when he added 
> light compression to RDD serialization this improved matters.
>  
> John Lilley
> Chief Architect, RedPoint Global Inc.
> T: +1 303 541 1516  | M: +1 720 938 5761 | F: +1 781-705-2077
> Skype: jlilley.redpoint | john.lil...@redpoint.net 
> <mailto:john.lil...@redpoint.net> | www.redpoint.net 
> <http://www.redpoint.net/>
>  
> From: lihu [mailto:lihu...@gmail.com] 
> Sent: Friday, March 11, 2016 7:58 AM
> To: John Lilley <john.lil...@redpoint.net>
> Cc: Andrew A <andrew.a...@gmail.com>; u...@spark.incubator.apache.org
> Subject: Re: Graphx
>  
> Hi, John:
>I am very intersting in your experiment, How can you get that RDD 
> serialization cost lots of time, from the log or some other tools?
>  
> On Fri, Mar 11, 2016 at 8:46 PM, John Lilley <john.lil...@redpoint.net 
> <mailto:john.lil...@redpoint.net>> wrote:
> Andrew,
>  
> We conducted some tests for using Graphx to solve the connected-components 
> problem and were disappointed.  On 8 nodes of 16GB each, we could not get 
> above 100M edges.  On 8 nodes of 60GB each, we could not process 1bn edges.  
> RDD serialization would take excessive time and then we would get failures.  
> By contrast, we have a C++ algorithm that solves 1bn edges using memory+disk 
> on a single 16GB node in about an hour.  I think that a very large cluster 
> will do better, but we did not explore that.
>  
> John Lilley
> Chief Architect, RedPoint Global Inc.
> T: +1 303 541 1516 <tel:%2B1%C2%A0303%C2%A0541%201516>  | M: +1 720 938 5761 
> <tel:%2B1%20720%20938%205761> | F: +1 781-705-2077 <tel:%2B1%20781-705-2077>
> Skype: jlilley.redpoint | john.lil...@redpoint.net 
> <mailto:john.lil...@redpoint.net> | www.redpoint.net 
> <http://www.redpoint.net/>
>  
> From: Andrew A [mailto:andrew.a...@gmail.com <mailto:andrew.a...@gmail.com>] 
> Sent: Thursday, March 10, 2016 2:44 PM
> To: u...@spark.incubator.apache.org <mailto:u...@spark.incubator.apache.org>
> Subject: Graphx
>  
> Hi, is there anyone who use graphx in production? What maximum size of graphs 
> did you process by spark and what cluster are you use for it?
> 
> i tried calculate pagerank for 1 Gb edges LJ - dataset for 
> LiveJournalPageRank from spark examples and i faced with large volume 
> shuffles produced by spark which fail my spark job.
> 
> Thank you,
> Andrew



Re: Graphx

2016-03-11 Thread lihu
Hi, John:
   I am very intersting in your experiment, How can you get that RDD
serialization cost lots of time, from the log or some other tools?

On Fri, Mar 11, 2016 at 8:46 PM, John Lilley 
wrote:

> Andrew,
>
>
>
> We conducted some tests for using Graphx to solve the connected-components
> problem and were disappointed.  On 8 nodes of 16GB each, we could not get
> above 100M edges.  On 8 nodes of 60GB each, we could not process 1bn
> edges.  RDD serialization would take excessive time and then we would get
> failures.  By contrast, we have a C++ algorithm that solves 1bn edges using
> memory+disk on a single 16GB node in about an hour.  I think that a very
> large cluster will do better, but we did not explore that.
>
>
>
> *John Lilley*
>
> Chief Architect, RedPoint Global Inc.
>
> T: +1 303 541 1516  *| *M: +1 720 938 5761 *|* F: +1 781-705-2077
>
> Skype: jlilley.redpoint *|* john.lil...@redpoint.net *|* www.redpoint.net
>
>
>
> *From:* Andrew A [mailto:andrew.a...@gmail.com]
> *Sent:* Thursday, March 10, 2016 2:44 PM
> *To:* u...@spark.incubator.apache.org
> *Subject:* Graphx
>
>
>
> Hi, is there anyone who use graphx in production? What maximum size of
> graphs did you process by spark and what cluster are you use for it?
>
> i tried calculate pagerank for 1 Gb edges LJ - dataset for
> LiveJournalPageRank from spark examples and i faced with large volume
> shuffles produced by spark which fail my spark job.
>
> Thank you,
>
> Andrew
>


RE: Graphx

2016-03-11 Thread John Lilley
A colleague did the experiments and I don’t know exactly how he observed that.  
I think it was indirect from the Spark diagnostics indicating the amount of I/O 
he deduced that this was RDD serialization.  Also when he added light 
compression to RDD serialization this improved matters.

John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516  | M: +1 720 938 5761 | F: +1 781-705-2077
Skype: jlilley.redpoint | 
john.lil...@redpoint.net<mailto:john.lil...@redpoint.net> | 
www.redpoint.net<http://www.redpoint.net/>

From: lihu [mailto:lihu...@gmail.com]
Sent: Friday, March 11, 2016 7:58 AM
To: John Lilley <john.lil...@redpoint.net>
Cc: Andrew A <andrew.a...@gmail.com>; u...@spark.incubator.apache.org
Subject: Re: Graphx

Hi, John:
   I am very intersting in your experiment, How can you get that RDD 
serialization cost lots of time, from the log or some other tools?

On Fri, Mar 11, 2016 at 8:46 PM, John Lilley 
<john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>> wrote:
Andrew,

We conducted some tests for using Graphx to solve the connected-components 
problem and were disappointed.  On 8 nodes of 16GB each, we could not get above 
100M edges.  On 8 nodes of 60GB each, we could not process 1bn edges.  RDD 
serialization would take excessive time and then we would get failures.  By 
contrast, we have a C++ algorithm that solves 1bn edges using memory+disk on a 
single 16GB node in about an hour.  I think that a very large cluster will do 
better, but we did not explore that.

John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516<tel:%2B1%C2%A0303%C2%A0541%201516>  | M: +1 720 938 
5761<tel:%2B1%20720%20938%205761> | F: +1 781-705-2077<tel:%2B1%20781-705-2077>
Skype: jlilley.redpoint | 
john.lil...@redpoint.net<mailto:john.lil...@redpoint.net> | 
www.redpoint.net<http://www.redpoint.net/>

From: Andrew A [mailto:andrew.a...@gmail.com<mailto:andrew.a...@gmail.com>]
Sent: Thursday, March 10, 2016 2:44 PM
To: u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>
Subject: Graphx

Hi, is there anyone who use graphx in production? What maximum size of graphs 
did you process by spark and what cluster are you use for it?

i tried calculate pagerank for 1 Gb edges LJ - dataset for LiveJournalPageRank 
from spark examples and i faced with large volume shuffles produced by spark 
which fail my spark job.
Thank you,
Andrew



RE: Graphx

2016-03-11 Thread John Lilley
Andrew,

We conducted some tests for using Graphx to solve the connected-components 
problem and were disappointed.  On 8 nodes of 16GB each, we could not get above 
100M edges.  On 8 nodes of 60GB each, we could not process 1bn edges.  RDD 
serialization would take excessive time and then we would get failures.  By 
contrast, we have a C++ algorithm that solves 1bn edges using memory+disk on a 
single 16GB node in about an hour.  I think that a very large cluster will do 
better, but we did not explore that.

John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516  | M: +1 720 938 5761 | F: +1 781-705-2077
Skype: jlilley.redpoint | 
john.lil...@redpoint.net | 
www.redpoint.net

From: Andrew A [mailto:andrew.a...@gmail.com]
Sent: Thursday, March 10, 2016 2:44 PM
To: u...@spark.incubator.apache.org
Subject: Graphx

Hi, is there anyone who use graphx in production? What maximum size of graphs 
did you process by spark and what cluster are you use for it?

i tried calculate pagerank for 1 Gb edges LJ - dataset for LiveJournalPageRank 
from spark examples and i faced with large volume shuffles produced by spark 
which fail my spark job.

Thank you,
Andrew


Re: GraphX can show graph?

2016-01-29 Thread Balachandar R.A.
Thanks... Will look into that

- Bala

On 28 January 2016 at 15:36, Sahil Sareen  wrote:

> Try Neo4j for visualization, GraphX does a pretty god job at distributed
> graph processing.
>
> On Thu, Jan 28, 2016 at 12:42 PM, Balachandar R.A. <
> balachandar...@gmail.com> wrote:
>
>> Hi
>>
>> I am new to GraphX. I have a simple csv file which I could load and
>> compute few graph statistics. However, I am not sure whether it is possible
>> to create ad show graph (for visualization purpose) using GraphX. Any
>> pointer to tutorial or information connected to this will be really helpful
>>
>> Thanks and regards
>> Bala
>>
>
>


Re: GraphX can show graph?

2016-01-29 Thread Russell Jurney
Maybe checkout Gephi. It is a program that does what you need out of the
box.

On Friday, January 29, 2016, Balachandar R.A. 
wrote:

> Thanks... Will look into that
>
> - Bala
>
> On 28 January 2016 at 15:36, Sahil Sareen  > wrote:
>
>> Try Neo4j for visualization, GraphX does a pretty god job at distributed
>> graph processing.
>>
>> On Thu, Jan 28, 2016 at 12:42 PM, Balachandar R.A. <
>> balachandar...@gmail.com
>> > wrote:
>>
>>> Hi
>>>
>>> I am new to GraphX. I have a simple csv file which I could load and
>>> compute few graph statistics. However, I am not sure whether it is possible
>>> to create ad show graph (for visualization purpose) using GraphX. Any
>>> pointer to tutorial or information connected to this will be really helpful
>>>
>>> Thanks and regards
>>> Bala
>>>
>>
>>
>

-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io


Re: GraphX can show graph?

2016-01-28 Thread Sahil Sareen
Try Neo4j for visualization, GraphX does a pretty god job at distributed
graph processing.

On Thu, Jan 28, 2016 at 12:42 PM, Balachandar R.A.  wrote:

> Hi
>
> I am new to GraphX. I have a simple csv file which I could load and
> compute few graph statistics. However, I am not sure whether it is possible
> to create ad show graph (for visualization purpose) using GraphX. Any
> pointer to tutorial or information connected to this will be really helpful
>
> Thanks and regards
> Bala
>


Re: GraphX - How to make a directed graph an undirected graph?

2015-11-26 Thread Robineast
1. GraphX doesn't have a concept of undirected graphs, Edges are always
specified with a srcId and dstId. However there is nothing to stop you
adding in edges that point in the other direction i.e. if you have an edge
with srcId -> dstId you can add an edge dstId -> srcId

2. In general APIs will return a single Graph object even if the resulting
graph is partitioned. You should read the API docs for the specifics though



-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-How-to-make-a-directed-graph-an-undirected-graph-tp25495p25499.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: graphx - mutable?

2015-10-14 Thread rohit13k
Hi

I am also working on the same area where the graph evolves over time and the
current approach of rebuilding the graph again and again is very slow and
memory consuming did you find any workaround?
What was your usecase?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/graphx-mutable-tp15777p25057.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: GraphX: How can I tell if 2 nodes are connected?

2015-10-06 Thread Dino Fancellu
Ok, thanks, just wanted to make sure I wasn't missing something
obvious. I've worked with Neo4j cypher as well, where it was rather
more obvious.

e.g. http://neo4j.com/docs/milestone/query-match.html#_shortest_path
http://neo4j.com/docs/stable/cypher-refcard/

Dino.

On 6 October 2015 at 06:43, Robineast [via Apache Spark User List]
 wrote:
> GraphX doesn't implement Tinkerpop functionality but there is an external
> effort to provide an implementation. See
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4279
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
>
> 
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-How-can-I-tell-if-2-nodes-are-connected-tp24926p24941.html
> To unsubscribe from GraphX: How can I tell if 2 nodes are connected?, click
> here.
> NAML




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-How-can-I-tell-if-2-nodes-are-connected-tp24926p24944.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Graphx hangs and crashes on EdgeRDD creation

2015-10-06 Thread William Saar
Hi, I get the same problem with both the CanonicalVertexCut and 
RandomVertexCut, with the graph code as follows

val graph = Graph.fromEdgeTuples(indexedEdges, 0, None, 
StorageLevel.MEMORY_AND_DISK_SER, StorageLevel.MEMORY_AND_DISK_SER);
graph.partitionBy(PartitionStrategy.RandomVertexCut);
graph.connectedComponents().vertices


From: Robin East [mailto:robin.e...@xense.co.uk]
Sent: den 5 oktober 2015 19:07
To: William Saar <william.s...@king.com>; user@spark.apache.org
Subject: Re: Graphx hangs and crashes on EdgeRDD creation

Have you tried using Graph.partitionBy? e.g. using 
PartitionStrategy.RandomVertexCut?
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action[manning.com]<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.manning.com_books_spark-2Dgraphx-2Din-2Daction=AwMFaQ=vMQQCxRI9pSsuxcncXNTCA=xeOF0uoF3ypq-qwRTA3Cs_a8VRgxDa7p2cKJGxm4bzY=NjuNJX6FdlEeBFp14TNHdGWA0s-sOJEtXvSo5UOhGsI=KQ-b16m0NmxXPZUt_c47Ly73IEs6qOQrzo0gYNP6xW0=>




On 5 Oct 2015, at 09:14, William Saar 
<william.s...@king.com<mailto:william.s...@king.com>> wrote:

Hi,
I am trying to run a GraphX job on 20 million edges with Spark 1.5.1, but the 
job seems to hang for 30 minutes on a single executor when creating the graph 
and eventually crashes with “IllegalArgumentException: Size exceeds 
Integer.MAX_VALUE”

I suspect this is because of partitioning problem, but how can I control the 
partitioning of the creation of the EdgeRDD?

My graph code only does the following:
val graph = Graph.fromEdgeTuples(indexedEdges, 0, None, 
StorageLevel.MEMORY_AND_DISK_SER, StorageLevel.MEMORY_AND_DISK_SER);
graph.connectedComponents().vertices

The web UI shows the following while the job is hanging (I am running this 
inside a transform operation on spark streaming)
transform at 
MyJob.scala:62<http://candy-bi01.skd.midasplayer.com:4040/stages/stage?id=10=0>+details
RDD: EdgeRDD, 
EdgeRDD<http://candy-bi01.skd.midasplayer.com:4040/storage/rdd?id=28>

org.apache.spark.streaming.dstream.DStream.transform(DStream.scala:649)

com.example.MyJob$.main(MyJob.scala:62)

com.example.MyJob.main(MyJob.scala)

sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

The executor thread dump while the job is hanging is the following
Thread 66: Executor task launch worker-1 (RUNNABLE)
java.lang.System.identityHashCode(Native Method)
com.esotericsoftware.kryo.util.IdentityObjectIntMap.get(IdentityObjectIntMap.java:241)
com.esotericsoftware.kryo.util.MapReferenceResolver.getWrittenId(MapReferenceResolver.java:28)
com.esotericsoftware.kryo.Kryo.writeReferenceOrNull(Kryo.java:588)
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:566)
org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:158)
org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153)
org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1190)
org.apache.spark.storage.DiskStore$$anonfun$putIterator$1.apply$mcV$sp(DiskStore.scala:81)
org.apache.spark.storage.DiskStore$$anonfun$putIterator$1.apply(DiskStore.scala:81)
org.apache.spark.storage.DiskStore$$anonfun$putIterator$1.apply(DiskStore.scala:81)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1206)
org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:82)
org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:791)
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:153)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:88)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

The failure stack trace is as follows:
15/10/02 17:09:54 ERROR JobSched

Re: GraphX: How can I tell if 2 nodes are connected?

2015-10-05 Thread Dino Fancellu
Ah thanks, got it working with that.

e.g.

val (_,smap)=shortest.vertices.filter(_._1==src).first
smap.contains(dest)

Is there anything a little less eager?

i.e. that doesn't compute all the distances from all source nodes, where I
can supply the source vertex id,  dest vertex id, and just get an int back.

Thanks 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-How-can-I-tell-if-2-nodes-are-connected-tp24926p24935.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: GraphX: How can I tell if 2 nodes are connected?

2015-10-05 Thread Anwar Rizal
Maybe connected component is what you need ?
On Oct 5, 2015 19:02, "Robineast"  wrote:

> GraphX has a Shortest Paths algorithm implementation which will tell you,
> for
> all vertices in the graph, the shortest distance to a specific ('landmark')
> vertex. The returned value is '/a graph where each vertex attribute is a
> map
> containing the shortest-path distance to each reachable landmark vertex/'.
> If there is no path to the landmark vertex then the map for the source
> vertex is empty
>
>
>
> -
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-How-can-I-tell-if-2-nodes-are-connected-tp24926p24930.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: GraphX: How can I tell if 2 nodes are connected?

2015-10-05 Thread Robineast
GraphX doesn't implement Tinkerpop functionality but there is an external
effort to provide an implementation. See
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4279



-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-How-can-I-tell-if-2-nodes-are-connected-tp24926p24941.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: GraphX: How can I tell if 2 nodes are connected?

2015-10-05 Thread Robineast
GraphX has a Shortest Paths algorithm implementation which will tell you, for
all vertices in the graph, the shortest distance to a specific ('landmark')
vertex. The returned value is '/a graph where each vertex attribute is a map
containing the shortest-path distance to each reachable landmark vertex/'.
If there is no path to the landmark vertex then the map for the source
vertex is empty



-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-How-can-I-tell-if-2-nodes-are-connected-tp24926p24930.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: GraphX create graph with multiple node attributes

2015-09-26 Thread JJ
Robineast wrote
> 2) let GraphX supply a null instead
>  val graph = Graph(vertices, edges)  // vertices found in 'edges' but
> not in 'vertices' will be set to null 

Thank you! This method works.

As a follow up (sorry I'm new to this, don't know if I should start a new
thread?): if I have vertices that are in 'vertices' but not in 'edges' (the
opposite of what you mention), will they be counted as part of the graph
but with 0 edges, or will they be dropped from the graph? When I count the
number of vertices with vertices.count, I get 13,628 nodes. When I count
graph vertices with graph.vertices.count, I get 12,274 nodes. When I count
vertices with 1+ degrees with graph.degrees.count I get 10,091 vertices...
What am I dropping each time?

Thanks again!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-create-graph-with-multiple-node-attributes-tp24827p24830.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: GraphX create graph with multiple node attributes

2015-09-26 Thread JJ
Here is all of my code. My first post had a simplified version. As I post
this, I realize one issue may be that when I convert my Ids to long (I
define a pageHash function to convert string Ids to long), the nodeIds are
no longer the same between the 'vertices' object and the 'edges' object. Do
you think this is what is causing the issue?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-create-graph-with-multiple-node-attributes-tp24827p24832.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: GraphX create graph with multiple node attributes

2015-09-26 Thread Robineast
Vertices that aren't connected to anything are perfectly valid e.g.

import org.apache.spark.graphx._

val vertices = sc.makeRDD(Seq((1L,1),(2L,1),(3L,1)))
val edges = sc.makeRDD(Seq(Edge(1L,2L,1)))

val g = Graph(vertices, edges)
g.vertices.count

gives 3

Not sure why vertices appear to be dropping off. Could you show your full
code.

g.degrees.count gives 2 - as the scaladocs mention 'The degree of each
vertex in the graph. @note Vertices with no edges are not returned in the
resulting RDD'






-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-create-graph-with-multiple-node-attributes-tp24827p24831.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: GraphX create graph with multiple node attributes

2015-09-26 Thread Nick Peterson
Have you checked to make sure that your hashing function doesn't have any
collisions?  Node ids have to be unique; so, if you're getting repeated ids
out of your hasher, it could certainly lead to dropping of duplicate ids,
and therefore loss of vertices.

On Sat, Sep 26, 2015 at 10:37 AM JJ  wrote:

> Here is all of my code. My first post had a simplified version. As I post
> this, I realize one issue may be that when I convert my Ids to long (I
> define a pageHash function to convert string Ids to long), the nodeIds are
> no longer the same between the 'vertices' object and the 'edges' object. Do
> you think this is what is causing the issue?
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-create-graph-with-multiple-node-attributes-tp24827p24832.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Graphx CompactBuffer help

2015-08-28 Thread Robineast
my previous reply got mangled
This should work:

coon.filter(x = x.exists(el = Seq(1,15).contains(el)))

CompactBuffer is a specialised form of a Scala Iterator

---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/malak/



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-CompactBuffer-help-tp24481p24490.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: graphx class not found error

2015-08-13 Thread Ted Yu
The code and error didn't go through.

Mind sending again ?

Which Spark release are you using ?

On Thu, Aug 13, 2015 at 6:17 PM, dizzy5112 dave.zee...@gmail.com wrote:

 the code below works perfectly on both cluster and local modes



 but when i try to create a graph in cluster mode (it works in local mode)


 I get the following error:



 any help appreciated



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/graphx-class-not-found-error-tp24253.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: graphx class not found error

2015-08-13 Thread dizzy5112
Oh forgot to note using the Scala REPL for this.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/graphx-class-not-found-error-tp24253p24254.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: GraphX Synth Benchmark

2015-07-09 Thread Khaled Ammar
Hi,

I am not a spark expert but I found that passing a small partitions value
might help. Try to use this option --numEPart=$partitions where
partitions=3 (number of workers) or at most 3*40 (total number of cores).

Thanks,
-Khaled

On Thu, Jul 9, 2015 at 11:37 AM, AshutoshRaghuvanshi 
ashutosh.raghuvans...@gmail.com wrote:

 I am running spark cluster over ssh in standalone mode,

 I have run pagerank LiveJounral example:

 MASTER=spark://172.17.27.12:7077 bin/run-example graphx.SynthBenchmark
 -app=pagerank -niters=100 -nverts=4847571  Output/soc-liveJounral.txt

 its been running for more than 2hours, I guess this is not normal, what am
 i
 doing wrong?

 system details:
 4 nodes (1+3)
 40 cores each, 64G memory out of which I have given spark.executer 50G

 one more this I notice one of the server is used more than others.

 Please help ASAP.

 Thank you
 http://apache-spark-user-list.1001560.n3.nabble.com/file/n23747/13.png



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Synth-Benchmark-tp23747.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Thanks,
-Khaled


Re: GraphX - ConnectedComponents (Pregel) - longer and longer interval between jobs

2015-06-29 Thread Thomas Gerber
It seems the root cause of the delay was the sheer size of the DAG for
those jobs, which are towards the end of a long series of jobs.

To reduce it, you can probably try to checkpoint (rdd.checkpoint) some
previous RDDs. That will:
1. save the RDD on disk
2. remove all references to the parents of this RDD

Which means the when a job uses that RDD, the DAG stops at that RDD and
does not looks at its parents as it doesn't have them anymore. It is very
similar to saving your RDD and re-loading it as a fresh RDD.

On Fri, Jun 26, 2015 at 9:14 AM, Thomas Gerber thomas.ger...@radius.com
wrote:

 Note that this problem is probably NOT caused directly by GraphX, but
 GraphX reveals it because as you go further down the iterations, you get
 further and further away of a shuffle you can rely on.

 On Thu, Jun 25, 2015 at 7:43 PM, Thomas Gerber thomas.ger...@radius.com
 wrote:

 Hello,

 We run GraphX ConnectedComponents, and we notice that there is a time gap
 that becomes larger and larger during Jobs, that is not accounted for.

 In the screenshot attached, you will notice that each job only takes
 around 2 1/2min. At first, the next job/iteration starts immediately after
 the previous one. But as we go through iterations, there is a gap (time
 where job N+1 starts - time where job N finishes) that grows, reaching
 ultimately 6 minutes around the 30th iteration .

 I suspect it has to do with DAG computation on the driver, as evidenced
 by the very large (and getting larger at every iteration) of pending stages
 that are ultimately skipped.

 So,
 1. is there anything obvious we can do to make that gap between
 iterations shorter?
 2. would dividing the number of partitions in the input RDD per 2 divide
 the gap by 2 as well?

 I ask because 3 min gap on average for a job length of 2 1/2 min = we
 are wasting 50% of CPU time on the Executors.

 Thanks!
 Thomas





Re: GraphX - ConnectedComponents (Pregel) - longer and longer interval between jobs

2015-06-26 Thread Thomas Gerber
Note that this problem is probably NOT caused directly by GraphX, but
GraphX reveals it because as you go further down the iterations, you get
further and further away of a shuffle you can rely on.

On Thu, Jun 25, 2015 at 7:43 PM, Thomas Gerber thomas.ger...@radius.com
wrote:

 Hello,

 We run GraphX ConnectedComponents, and we notice that there is a time gap
 that becomes larger and larger during Jobs, that is not accounted for.

 In the screenshot attached, you will notice that each job only takes
 around 2 1/2min. At first, the next job/iteration starts immediately after
 the previous one. But as we go through iterations, there is a gap (time
 where job N+1 starts - time where job N finishes) that grows, reaching
 ultimately 6 minutes around the 30th iteration .

 I suspect it has to do with DAG computation on the driver, as evidenced by
 the very large (and getting larger at every iteration) of pending stages
 that are ultimately skipped.

 So,
 1. is there anything obvious we can do to make that gap between
 iterations shorter?
 2. would dividing the number of partitions in the input RDD per 2 divide
 the gap by 2 as well?

 I ask because 3 min gap on average for a job length of 2 1/2 min = we are
 wasting 50% of CPU time on the Executors.

 Thanks!
 Thomas



Re: GraphX: unbalanced computation and slow runtime on livejournal network

2015-04-19 Thread hnahak
Hi Steve

i did spark 1.3.0 page rank bench-marking  on soc-LiveJournal1 in 4 node
cluster. 16,16,8,8 Gbs ram respectively. Cluster have 4 worker including
master with 4,4,2,2 CPUs 
I set executor memroy to 3g and driver to 5g.   

No. of Iterations   -- GraphX(mins)
1   -- 1
2   -- 1.2
3   -- 1.3
5   -- 1.6
10  -- 2.6
20  -- 3.9
   



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-unbalanced-computation-and-slow-runtime-on-livejournal-network-tp22565p22566.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [GraphX] aggregateMessages with active set

2015-04-13 Thread James
Hello,

Great thanks for your reply. From the code I found that the reason why my
program will scan all the edges is becasue of the EdgeDirection I passed
into is EdgeDirection.Either.

However I still met the problem of Time consuming of each iteration will
not decrease by time. Thus I have two questions:

1. what is the meaning of activeFraction in [1]
2. As my edgeRDD is too large to cache into memory, I used
StorageLevel.MEMORY_AND_DISK_SER as persist level. thus if the program used
aggregateMessagesIndexScan, will the program still have to load all edge
list into the memory?

[1]
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L237-266

Alcaid


2015-04-10 2:47 GMT+08:00 Ankur Dave ankurd...@gmail.com:

 Actually, GraphX doesn't need to scan all the edges, because it
 maintains a clustered index on the source vertex id (that is, it sorts
 the edges by source vertex id and stores the offsets in a hash table).
 If the activeDirection is appropriately set, it can then jump only to
 the clusters with active source vertices.

 See the EdgePartition#index field [1], which stores the offsets, and
 the logic in GraphImpl#aggregateMessagesWithActiveSet [2], which
 decides whether to do a full scan or use the index.

 [1]
 https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/EdgePartition.scala#L60
 [2].
 https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L237-266

 Ankur


 On Thu, Apr 9, 2015 at 3:21 AM, James alcaid1...@gmail.com wrote:
  In aggregateMessagesWithActiveSet, Spark still have to read all edges. It
  means that a fixed time which scale with graph size is unavoidable on a
  pregel-like iteration.
 
  But what if I have to iterate nearly 100 iterations but at the last 50
  iterations there are only  0.1% nodes need to be updated ? The fixed
 time
  make the program finished at a unacceptable time consumption.



Re: [GraphX] aggregateMessages with active set

2015-04-09 Thread James
In aggregateMessagesWithActiveSet, Spark still have to read all edges. It
means that a fixed time which scale with graph size is unavoidable on a
pregel-like iteration.

But what if I have to iterate nearly 100 iterations but at the last 50
iterations there are only  0.1% nodes need to be updated ? The fixed time
make the program finished at a unacceptable time consumption.

Alcaid

2015-04-08 1:41 GMT+08:00 Ankur Dave ankurd...@gmail.com:

 We thought it would be better to simplify the interface, since the
 active set is a performance optimization but the result is identical
 to calling subgraph before aggregateMessages.

 The active set option is still there in the package-private method
 aggregateMessagesWithActiveSet. You can actually access it publicly
 via GraphImpl, though the API isn't guaranteed to be stable:
 graph.asInstanceOf[GraphImpl[VD,ED]].aggregateMessagesWithActiveSet(...)
 Ankur


 On Tue, Apr 7, 2015 at 2:56 AM, James alcaid1...@gmail.com wrote:
  Hello,
 
  The old api of GraphX mapReduceTriplets has an optional parameter
  activeSetOpt: Option[(VertexRDD[_] that limit the input of sendMessage.
 
  However, to the new api aggregateMessages I could not find this option,
  why it does not offer any more?
 
  Alcaid



Re: [GraphX] aggregateMessages with active set

2015-04-09 Thread Ankur Dave
Actually, GraphX doesn't need to scan all the edges, because it
maintains a clustered index on the source vertex id (that is, it sorts
the edges by source vertex id and stores the offsets in a hash table).
If the activeDirection is appropriately set, it can then jump only to
the clusters with active source vertices.

See the EdgePartition#index field [1], which stores the offsets, and
the logic in GraphImpl#aggregateMessagesWithActiveSet [2], which
decides whether to do a full scan or use the index.

[1] 
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/EdgePartition.scala#L60
[2]. 
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L237-266

Ankur


On Thu, Apr 9, 2015 at 3:21 AM, James alcaid1...@gmail.com wrote:
 In aggregateMessagesWithActiveSet, Spark still have to read all edges. It
 means that a fixed time which scale with graph size is unavoidable on a
 pregel-like iteration.

 But what if I have to iterate nearly 100 iterations but at the last 50
 iterations there are only  0.1% nodes need to be updated ? The fixed time
 make the program finished at a unacceptable time consumption.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [GraphX] aggregateMessages with active set

2015-04-07 Thread Ankur Dave
We thought it would be better to simplify the interface, since the
active set is a performance optimization but the result is identical
to calling subgraph before aggregateMessages.

The active set option is still there in the package-private method
aggregateMessagesWithActiveSet. You can actually access it publicly
via GraphImpl, though the API isn't guaranteed to be stable:
graph.asInstanceOf[GraphImpl[VD,ED]].aggregateMessagesWithActiveSet(...)
Ankur


On Tue, Apr 7, 2015 at 2:56 AM, James alcaid1...@gmail.com wrote:
 Hello,

 The old api of GraphX mapReduceTriplets has an optional parameter
 activeSetOpt: Option[(VertexRDD[_] that limit the input of sendMessage.

 However, to the new api aggregateMessages I could not find this option,
 why it does not offer any more?

 Alcaid

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Graphx gets slower as the iteration number increases

2015-03-24 Thread Ankur Dave
This might be because partitions are getting dropped from memory and
needing to be recomputed. How much memory is in the cluster, and how large
are the partitions? This information should be in the Executors and Storage
pages in the web UI.

Ankur http://www.ankurdave.com/

On Tue, Mar 24, 2015 at 7:12 PM, orangepri...@foxmail.com 
orangepri...@foxmail.com wrote:

 I'm working with graphx to calculate the pageranks of an extreme large
 social network with billion verteces.
 As iteration number increases, the speed of each iteration becomes slower
 and unacceptable. Is there any reason of it?



Re: GraphX: Get edges for a vertex

2015-03-18 Thread Jeffrey Jedele
Hi Mas,
I never actually worked with GraphX, but one idea:

As far as I know, you can directly access the vertex and edge RDDs of your
Graph object. Why not simply run a .filter() on the edge RDD to get all
edges that originate from or end at your vertex?

Regards,
Jeff

2015-03-18 10:52 GMT+01:00 mas mas.ha...@gmail.com:

 Hi,

 Just to continue with the question.
 I need to find the edges of one particular vertex. However,
 (collectNeighbors/collectNeighborIds) provides the neighbors/neighborids
 for
 all the vertices of the graph.
 Any help in this regard will be highly appreciated.
 Thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Get-edges-for-a-vertex-tp18880p22115.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: GraphX Snapshot Partitioning

2015-03-14 Thread Takeshi Yamamuro
Large edge partitions could cause java.lang.OutOfMemoryError, and then
spark tasks fails.

FWIW, each edge partition can have at most 2^32 edges because 64-bit vertex
IDs are
mapped into 32-bit ones in each partitions.
If #edges is over the limit, graphx could throw
ArrayIndexOutOfBoundsException,
or something. So, each partition can have more edges than you expect.





On Wed, Mar 11, 2015 at 11:42 PM, Matthew Bucci mrbucci...@gmail.com
wrote:

 Hi,

 Thanks for the response! That answered some questions I had, but the last
 one I was wondering is what happens if you run a partition strategy and one
 of the partitions ends up being too large? For example, let's say
 partitions can hold 64MB (actually knowing the maximum possible size of a
 partition would probably also be helpful to me). You try to partition the
 edges of a graph to 3 separate partitions but the edges in the first
 partition end up being 80MB worth of edges so it cannot all fit in the
 first partition . Would the extra 16MB flood over into a new 4th partition
 or would the system try to split it so that the 1st and 4th partition are
 both at 40MB, or would the partition strategy just fail with a memory
 error?

 Thank You,
 Matthew Bucci

 On Mon, Mar 9, 2015 at 11:07 PM, Takeshi Yamamuro linguin@gmail.com
 wrote:

 Hi,

 Vertices are simply hash-paritioned by their 64-bit IDs, so
 they are evenly spread over parititons.

 As for edges, GraphLoader#edgeList builds edge paritions
 through hadoopFile(), so the initial parititons depend
 on InputFormat#getSplits implementations
 (e.g, partitions are mostly equal to 64MB blocks for HDFS).

 Edges can be re-partitioned by ParititonStrategy;
 a graph is partitioned considering graph structures and
 a source ID and a destination ID are used as partition keys.
 The partitions might suffer from skewness depending
 on graph properties (hub nodes, or something).

 Thanks,
 takeshi


 On Tue, Mar 10, 2015 at 2:21 AM, Matthew Bucci mrbucci...@gmail.com
 wrote:

 Hello,

 I am working on a project where we want to split graphs of data into
 snapshots across partitions and I was wondering what would happen if one
 of
 the snapshots we had was too large to fit into a single partition. Would
 the
 snapshot be split over the two partitions equally, for example, and how
 is a
 single snapshot spread over multiple partitions?

 Thank You,
 Matthew Bucci



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Snapshot-Partitioning-tp21977.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 ---
 Takeshi Yamamuro





-- 
---
Takeshi Yamamuro


Re: [GRAPHX] could not process graph with 230M edges

2015-03-14 Thread Takeshi Yamamuro
Hi,

If you have heap problems in spark/graphx, it'd be better to split
partitions
into smaller ones so as to fit the partition on memory.

On Sat, Mar 14, 2015 at 12:09 AM, Hlib Mykhailenko 
hlib.mykhaile...@inria.fr wrote:

 Hello,

 I cannot process graph with 230M edges.
 I cloned apache.spark, build it and then tried it on cluster.

 I used Spark Standalone Cluster:
 -5 machines (each has 12 cores/32GB RAM)
 -'spark.executor.memory' ==  25g
 -'spark.driver.memory' == 3g


 Graph has 231359027 edges. And its file weights 4,524,716,369 bytes.
 Graph is represented in text format:
 source vertex id destination vertex id

 My code:

 object Canonical {

   def main(args: Array[String]) {

 val numberOfArguments = 3
 require(args.length == numberOfArguments, sWrong argument number.
 Should be $numberOfArguments .

  |Usage: path_to_grpah partiotioner_name minEdgePartitions
 .stripMargin)

 var graph: Graph[Int, Int] = null
 val nameOfGraph = args(0).substring(args(0).lastIndexOf(/) + 1)
 val partitionerName = args(1)
 val minEdgePartitions = args(2).toInt

 val sc = new SparkContext(new SparkConf()
.setSparkHome(System.getenv(SPARK_HOME))
.setAppName(s partitioning | $nameOfGraph |
 $partitionerName | $minEdgePartitions parts )

  .setJars(SparkContext.jarOfClass(this.getClass).toList))

 graph = GraphLoader.edgeListFile(sc, args(0), false, edgeStorageLevel
 = StorageLevel.MEMORY_AND_DISK,
vertexStorageLevel
 = StorageLevel.MEMORY_AND_DISK, minEdgePartitions = minEdgePartitions)
 graph =
 graph.partitionBy(PartitionStrategy.fromString(partitionerName))
 println(graph.edges.collect.length)
 println(graph.vertices.collect.length)
   }
 }

 After I run it I encountered number of java.lang.OutOfMemoryError: Java
 heap space errors and of course I did not get a result.

 Do I have problem in the code? Or in cluster configuration?

 Because it works fine for relatively small graphs. But for this graph it
 never worked. (And I do not think that 230M edges is too big data)


 Thank you for any advise!



 --
 Cordialement,
 *Hlib Mykhailenko*
 Doctorant à INRIA Sophia-Antipolis Méditerranée
 http://www.inria.fr/centre/sophia/
 2004 Route des Lucioles BP93
 06902 SOPHIA ANTIPOLIS cedex




-- 
---
Takeshi Yamamuro


Re: GraphX Snapshot Partitioning

2015-03-11 Thread Matthew Bucci
Hi,

Thanks for the response! That answered some questions I had, but the last
one I was wondering is what happens if you run a partition strategy and one
of the partitions ends up being too large? For example, let's say
partitions can hold 64MB (actually knowing the maximum possible size of a
partition would probably also be helpful to me). You try to partition the
edges of a graph to 3 separate partitions but the edges in the first
partition end up being 80MB worth of edges so it cannot all fit in the
first partition . Would the extra 16MB flood over into a new 4th partition
or would the system try to split it so that the 1st and 4th partition are
both at 40MB, or would the partition strategy just fail with a memory
error?

Thank You,
Matthew Bucci

On Mon, Mar 9, 2015 at 11:07 PM, Takeshi Yamamuro linguin@gmail.com
wrote:

 Hi,

 Vertices are simply hash-paritioned by their 64-bit IDs, so
 they are evenly spread over parititons.

 As for edges, GraphLoader#edgeList builds edge paritions
 through hadoopFile(), so the initial parititons depend
 on InputFormat#getSplits implementations
 (e.g, partitions are mostly equal to 64MB blocks for HDFS).

 Edges can be re-partitioned by ParititonStrategy;
 a graph is partitioned considering graph structures and
 a source ID and a destination ID are used as partition keys.
 The partitions might suffer from skewness depending
 on graph properties (hub nodes, or something).

 Thanks,
 takeshi


 On Tue, Mar 10, 2015 at 2:21 AM, Matthew Bucci mrbucci...@gmail.com
 wrote:

 Hello,

 I am working on a project where we want to split graphs of data into
 snapshots across partitions and I was wondering what would happen if one
 of
 the snapshots we had was too large to fit into a single partition. Would
 the
 snapshot be split over the two partitions equally, for example, and how
 is a
 single snapshot spread over multiple partitions?

 Thank You,
 Matthew Bucci



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Snapshot-Partitioning-tp21977.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 ---
 Takeshi Yamamuro



Re: GraphX Snapshot Partitioning

2015-03-09 Thread Takeshi Yamamuro
Hi,

Vertices are simply hash-paritioned by their 64-bit IDs, so
they are evenly spread over parititons.

As for edges, GraphLoader#edgeList builds edge paritions
through hadoopFile(), so the initial parititons depend
on InputFormat#getSplits implementations
(e.g, partitions are mostly equal to 64MB blocks for HDFS).

Edges can be re-partitioned by ParititonStrategy;
a graph is partitioned considering graph structures and
a source ID and a destination ID are used as partition keys.
The partitions might suffer from skewness depending
on graph properties (hub nodes, or something).

Thanks,
takeshi


On Tue, Mar 10, 2015 at 2:21 AM, Matthew Bucci mrbucci...@gmail.com wrote:

 Hello,

 I am working on a project where we want to split graphs of data into
 snapshots across partitions and I was wondering what would happen if one of
 the snapshots we had was too large to fit into a single partition. Would
 the
 snapshot be split over the two partitions equally, for example, and how is
 a
 single snapshot spread over multiple partitions?

 Thank You,
 Matthew Bucci



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Snapshot-Partitioning-tp21977.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
---
Takeshi Yamamuro


Re: GraphX path traversal

2015-03-04 Thread Robin East
Actually your Pregel code works for me:

import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

val vertexlist = Array((1L,One), (2L,Two), (3L,Three), 
(4L,Four),(5L,Five),(6L,Six))
val edgelist = Array(Edge(6,5,6 to 5),Edge(5,4,5 to 4),Edge(4,3,4 to 3), 
Edge(3,2,3 to 2), Edge(2,1,2 to 1))
val vertices: RDD[(VertexId, String)] =  sc.parallelize(vertexlist)
val edges = sc.parallelize(edgelist)
val graph = Graph(vertices, edges)


val parentGraph = Pregel(
  graph.mapVertices((id, attr) = Set[VertexId]()),
  Set[VertexId](),
  Int.MaxValue,
  EdgeDirection.Out)(
(id, attr, msg) = (msg ++ attr),
edge = { if (edge.srcId != edge.dstId) 
  { Iterator((edge.dstId, (edge.srcAttr + edge.srcId))) 
  } 
  else Iterator.empty 
 },
(a, b) = (a ++ b))
parentGraph.vertices.collect.foreach(println(_))

Output:

(4,Set(6, 5))
(1,Set(5, 6, 2, 3, 4))
(5,Set(6))
(6,Set())
(2,Set(6, 5, 4, 3))
(3,Set(6, 5, 4))

Maybe your data.csv has edges the wrong way round

Robin

 On 3 Mar 2015, at 16:32, Madabhattula Rajesh Kumar mrajaf...@gmail.com 
 wrote:
 
 Hi,
 
 I have tried below program using pergel API but I'm not able to get my 
 required output. I'm getting exactly reverse output which I'm expecting. 
 
 // Creating graph using above mail mentioned edgefile
  val graph: Graph[Int, Int] = GraphLoader.edgeListFile(sc, 
 /home/rajesh/Downloads/graphdata/data.csv).cache()
 
  val parentGraph = Pregel(
   graph.mapVertices((id, attr) = Set[VertexId]()),
   Set[VertexId](),
   Int.MaxValue,
   EdgeDirection.Out)(
 (id, attr, msg) = (msg ++ attr),
 edge = { if (edge.srcId != edge.dstId) 
   { Iterator((edge.dstId, (edge.srcAttr + edge.srcId))) 
   } 
   else Iterator.empty 
  },
 (a, b) = (a ++ b))
 parentGraph.vertices.collect.foreach(println(_))
 
 Output :
 
 (4,Set(1, 2, 3))
 (1,Set())
 (6,Set(5, 1, 2, 3, 4))
 (3,Set(1, 2))
 (5,Set(1, 2, 3, 4))
 (2,Set(1))
 
 But I'm looking below output. 
 
 (4,Set(5, 6))
 (1,Set(2, 3, 4, 5, 6))
 (6,Set())
 (3,Set(4, 5, 6))
 (5,Set(6))
 (2,Set(3, 4, 5, 6))
 
 Could you please correct me where I'm doing wrong.
 
 Regards,
 Rajesh
  
 
 On Tue, Mar 3, 2015 at 8:42 PM, Madabhattula Rajesh Kumar 
 mrajaf...@gmail.com mailto:mrajaf...@gmail.com wrote:
 Hi Robin,
 
 Thank you for your response. Please find below my question. I have a below 
 edge file
 
 Source Vertex Destination Vertex
 1 2
 2 3
 3 4
 4 5
 5 6
 6 6
 
 In this graph 1st vertex is connected to 2nd vertex, 2nd Vertex is connected 
 to 3rd vertex,. 6th vertex is connected to 6th vertex. So 6th vertex is a 
 root node. Please find below graph
 
 image.png
 In this graph, How can I compute the 1st vertex parents like 2,3,4,5,6. 
 Similarly 2nd vertex parents like 3,4,5,6  6th vertex parent like 6 
 because this is the root node.
 
 I'm planning to use pergel API but I'm not able to define messages and vertex 
 program in that API. Could you please help me on this.
 
 Please let me know if you need more information.
 
 Regards,
 Rajesh
 
 
 On Tue, Mar 3, 2015 at 8:15 PM, Robin East robin.e...@xense.co.uk 
 mailto:robin.e...@xense.co.uk wrote:
 Rajesh
 
 I'm not sure if I can help you, however I don't even understand the question. 
 Could you restate what you are trying to do.
 
 Sent from my iPhone
 
 On 2 Mar 2015, at 11:17, Madabhattula Rajesh Kumar mrajaf...@gmail.com 
 mailto:mrajaf...@gmail.com wrote:
 
 Hi,
 
 I have a below edge list. How to find the parents path for every vertex?
 
 Example :
 
 Vertex 1 path : 2, 3, 4, 5, 6
 Vertex 2 path : 3, 4, 5, 6
 Vertex 3 path : 4,5,6
 vertex 4 path : 5,6
 vertex 5 path : 6
 
 Could you please let me know how to do this? (or) Any suggestion
 
 Source VertexDestination Vertex
 12
 23
 34
 45
 56
 
 Regards,
 Rajesh
 
 



Re: GraphX path traversal

2015-03-03 Thread Madabhattula Rajesh Kumar
Hi,

Could you please let me know how to do this? (or) Any suggestion

Regards,
Rajesh

On Mon, Mar 2, 2015 at 4:47 PM, Madabhattula Rajesh Kumar 
mrajaf...@gmail.com wrote:

 Hi,

 I have a below edge list. How to find the parents path for every vertex?

 Example :

 Vertex 1 path : 2, 3, 4, 5, 6
 Vertex 2 path : 3, 4, 5, 6
 Vertex 3 path : 4,5,6
 vertex 4 path : 5,6
 vertex 5 path : 6

 Could you please let me know how to do this? (or) Any suggestion

   Source Vertex Destination Vertex  1 2  2 3  3 4  4 5  5 6
 Regards,
 Rajesh



Re: GraphX path traversal

2015-03-03 Thread Madabhattula Rajesh Kumar
Hi Robin,

Thank you for your response. Please find below my question. I have a below
edge file

  Source Vertex Destination Vertex  1 2  2 3  3 4  4 5  5 6  6 6
In this graph 1st vertex is connected to 2nd vertex, 2nd Vertex is
connected to 3rd vertex,. 6th vertex is connected to 6th vertex. So 6th
vertex is a root node. Please find below graph

[image: Inline image 1]
In this graph, How can I compute the 1st vertex parents like 2,3,4,5,6.
Similarly 2nd vertex parents like 3,4,5,6  6th vertex parent like 6
because this is the root node.

I'm planning to use pergel API but I'm not able to define messages and
vertex program in that API. Could you please help me on this.

Please let me know if you need more information.

Regards,
Rajesh


On Tue, Mar 3, 2015 at 8:15 PM, Robin East robin.e...@xense.co.uk wrote:

 Rajesh

 I'm not sure if I can help you, however I don't even understand the
 question. Could you restate what you are trying to do.

 Sent from my iPhone

 On 2 Mar 2015, at 11:17, Madabhattula Rajesh Kumar mrajaf...@gmail.com
 wrote:

 Hi,

 I have a below edge list. How to find the parents path for every vertex?

 Example :

 Vertex 1 path : 2, 3, 4, 5, 6
 Vertex 2 path : 3, 4, 5, 6
 Vertex 3 path : 4,5,6
 vertex 4 path : 5,6
 vertex 5 path : 6

 Could you please let me know how to do this? (or) Any suggestion

   Source Vertex Destination Vertex  1 2  2 3  3 4  4 5  5 6
 Regards,
 Rajesh




Re: GraphX path traversal

2015-03-03 Thread Madabhattula Rajesh Kumar
Hi,

I have tried below program using pergel API but I'm not able to get my
required output. I'm getting exactly reverse output which I'm expecting.

// Creating graph using above mail mentioned edgefile
 val graph: Graph[Int, Int] = GraphLoader.edgeListFile(sc,
/home/rajesh/Downloads/graphdata/data.csv).cache()

 val parentGraph = Pregel(
  graph.mapVertices((id, attr) = Set[VertexId]()),
  Set[VertexId](),
  Int.MaxValue,
  EdgeDirection.Out)(
(id, attr, msg) = (msg ++ attr),
edge = { if (edge.srcId != edge.dstId)
  { Iterator((edge.dstId, (edge.srcAttr + edge.srcId)))
  }
  else Iterator.empty
 },
(a, b) = (a ++ b))
parentGraph.vertices.collect.foreach(println(_))

*Output :*

(4,Set(1, 2, 3))
(1,Set())
(6,Set(5, 1, 2, 3, 4))
(3,Set(1, 2))
(5,Set(1, 2, 3, 4))
(2,Set(1))

*But I'm looking below output. *

(4,Set(5, 6))
(1,Set(2, 3, 4, 5, 6))
(6,Set())
(3,Set(4, 5, 6))
(5,Set(6))
(2,Set(3, 4, 5, 6))

Could you please correct me where I'm doing wrong.

Regards,
Rajesh


On Tue, Mar 3, 2015 at 8:42 PM, Madabhattula Rajesh Kumar 
mrajaf...@gmail.com wrote:

 Hi Robin,

 Thank you for your response. Please find below my question. I have a below
 edge file

   Source Vertex Destination Vertex  1 2  2 3  3 4  4 5  5 6  6 6
 In this graph 1st vertex is connected to 2nd vertex, 2nd Vertex is
 connected to 3rd vertex,. 6th vertex is connected to 6th vertex. So 6th
 vertex is a root node. Please find below graph

 [image: Inline image 1]
 In this graph, How can I compute the 1st vertex parents like 2,3,4,5,6.
 Similarly 2nd vertex parents like 3,4,5,6  6th vertex parent like 6
 because this is the root node.

 I'm planning to use pergel API but I'm not able to define messages and
 vertex program in that API. Could you please help me on this.

 Please let me know if you need more information.

 Regards,
 Rajesh


 On Tue, Mar 3, 2015 at 8:15 PM, Robin East robin.e...@xense.co.uk wrote:

 Rajesh

 I'm not sure if I can help you, however I don't even understand the
 question. Could you restate what you are trying to do.

 Sent from my iPhone

 On 2 Mar 2015, at 11:17, Madabhattula Rajesh Kumar mrajaf...@gmail.com
 wrote:

 Hi,

 I have a below edge list. How to find the parents path for every vertex?

 Example :

 Vertex 1 path : 2, 3, 4, 5, 6
 Vertex 2 path : 3, 4, 5, 6
 Vertex 3 path : 4,5,6
 vertex 4 path : 5,6
 vertex 5 path : 6

 Could you please let me know how to do this? (or) Any suggestion

   Source Vertex Destination Vertex  1 2  2 3  3 4  4 5  5 6
 Regards,
 Rajesh





Re: GraphX path traversal

2015-03-03 Thread Robin East
Have you tried EdgeDirection.In?
 On 3 Mar 2015, at 16:32, Robin East robin.e...@xense.co.uk wrote:
 
 What about the following which can be run in spark shell:
 
 import org.apache.spark._
 import org.apache.spark.graphx._
 import org.apache.spark.rdd.RDD
 
 val vertexlist = Array((1L,One), (2L,Two), (3L,Three), 
 (4L,Four),(5L,Five),(6L,Six))
 val edgelist = Array(Edge(6,5,6 to 5),Edge(5,4,5 to 4),Edge(4,3,4 to 
 3), Edge(3,2,3 to 2), Edge(2,1,2 to 1))
 val vertices: RDD[(VertexId, String)] =  sc.parallelize(vertexlist)
 val edges = sc.parallelize(edgelist)
 val graph = Graph(vertices, edges)
 
 val triplets = graph.triplets
 
 triplets.foreach(t = println(sparent for ${t.dstId} is ${t.srcId}))
 
 It doesn’t set vertex 6 to have parent 6 but you get the idea.
 
 It doesn’t use Pregel but that sounds like overkill for what you are trying 
 to achieve.
 
 Does that answer your question or were you after something different?
 
 
 
 On 3 Mar 2015, at 15:12, Madabhattula Rajesh Kumar mrajaf...@gmail.com 
 mailto:mrajaf...@gmail.com wrote:
 
 Hi Robin,
 
 Thank you for your response. Please find below my question. I have a below 
 edge file
 
 Source VertexDestination Vertex
 12
 23
 34
 45
 56
 66
 
 In this graph 1st vertex is connected to 2nd vertex, 2nd Vertex is connected 
 to 3rd vertex,. 6th vertex is connected to 6th vertex. So 6th vertex is 
 a root node. Please find below graph
 
 image.png
 In this graph, How can I compute the 1st vertex parents like 2,3,4,5,6. 
 Similarly 2nd vertex parents like 3,4,5,6  6th vertex parent like 6 
 because this is the root node.
 
 I'm planning to use pergel API but I'm not able to define messages and 
 vertex program in that API. Could you please help me on this.
 
 Please let me know if you need more information.
 
 Regards,
 Rajesh
 
 
 On Tue, Mar 3, 2015 at 8:15 PM, Robin East robin.e...@xense.co.uk 
 mailto:robin.e...@xense.co.uk wrote:
 Rajesh
 
 I'm not sure if I can help you, however I don't even understand the 
 question. Could you restate what you are trying to do.
 
 Sent from my iPhone
 
 On 2 Mar 2015, at 11:17, Madabhattula Rajesh Kumar mrajaf...@gmail.com 
 mailto:mrajaf...@gmail.com wrote:
 
 Hi,
 
 I have a below edge list. How to find the parents path for every vertex?
 
 Example :
 
 Vertex 1 path : 2, 3, 4, 5, 6
 Vertex 2 path : 3, 4, 5, 6
 Vertex 3 path : 4,5,6
 vertex 4 path : 5,6
 vertex 5 path : 6
 
 Could you please let me know how to do this? (or) Any suggestion
 
 Source Vertex   Destination Vertex
 1   2
 2   3
 3   4
 4   5
 5   6
 
 Regards,
 Rajesh



Re: GraphX path traversal

2015-03-03 Thread Robin East
What about the following which can be run in spark shell:

import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

val vertexlist = Array((1L,One), (2L,Two), (3L,Three), 
(4L,Four),(5L,Five),(6L,Six))
val edgelist = Array(Edge(6,5,6 to 5),Edge(5,4,5 to 4),Edge(4,3,4 to 3), 
Edge(3,2,3 to 2), Edge(2,1,2 to 1))
val vertices: RDD[(VertexId, String)] =  sc.parallelize(vertexlist)
val edges = sc.parallelize(edgelist)
val graph = Graph(vertices, edges)

val triplets = graph.triplets

triplets.foreach(t = println(sparent for ${t.dstId} is ${t.srcId}))

It doesn’t set vertex 6 to have parent 6 but you get the idea.

It doesn’t use Pregel but that sounds like overkill for what you are trying to 
achieve.

Does that answer your question or were you after something different?



 On 3 Mar 2015, at 15:12, Madabhattula Rajesh Kumar mrajaf...@gmail.com 
 wrote:
 
 Hi Robin,
 
 Thank you for your response. Please find below my question. I have a below 
 edge file
 
 Source Vertex Destination Vertex
 1 2
 2 3
 3 4
 4 5
 5 6
 6 6
 
 In this graph 1st vertex is connected to 2nd vertex, 2nd Vertex is connected 
 to 3rd vertex,. 6th vertex is connected to 6th vertex. So 6th vertex is a 
 root node. Please find below graph
 
 image.png
 In this graph, How can I compute the 1st vertex parents like 2,3,4,5,6. 
 Similarly 2nd vertex parents like 3,4,5,6  6th vertex parent like 6 
 because this is the root node.
 
 I'm planning to use pergel API but I'm not able to define messages and vertex 
 program in that API. Could you please help me on this.
 
 Please let me know if you need more information.
 
 Regards,
 Rajesh
 
 
 On Tue, Mar 3, 2015 at 8:15 PM, Robin East robin.e...@xense.co.uk 
 mailto:robin.e...@xense.co.uk wrote:
 Rajesh
 
 I'm not sure if I can help you, however I don't even understand the question. 
 Could you restate what you are trying to do.
 
 Sent from my iPhone
 
 On 2 Mar 2015, at 11:17, Madabhattula Rajesh Kumar mrajaf...@gmail.com 
 mailto:mrajaf...@gmail.com wrote:
 
 Hi,
 
 I have a below edge list. How to find the parents path for every vertex?
 
 Example :
 
 Vertex 1 path : 2, 3, 4, 5, 6
 Vertex 2 path : 3, 4, 5, 6
 Vertex 3 path : 4,5,6
 vertex 4 path : 5,6
 vertex 5 path : 6
 
 Could you please let me know how to do this? (or) Any suggestion
 
 Source VertexDestination Vertex
 12
 23
 34
 45
 56
 
 Regards,
 Rajesh
 



Re: GraphX path traversal

2015-03-03 Thread Robin East
Rajesh

I'm not sure if I can help you, however I don't even understand the question. 
Could you restate what you are trying to do.

Sent from my iPhone

 On 2 Mar 2015, at 11:17, Madabhattula Rajesh Kumar mrajaf...@gmail.com 
 wrote:
 
 Hi,
 
 I have a below edge list. How to find the parents path for every vertex?
 
 Example :
 
 Vertex 1 path : 2, 3, 4, 5, 6
 Vertex 2 path : 3, 4, 5, 6
 Vertex 3 path : 4,5,6
 vertex 4 path : 5,6
 vertex 5 path : 6
 
 Could you please let me know how to do this? (or) Any suggestion
 
 Source Vertex Destination Vertex
 1 2
 2 3
 3 4
 4 5
 5 6
 
 Regards,
 Rajesh


Re: [GraphX] Excessive value recalculations during aggregateMessages cycles

2015-02-15 Thread Takeshi Yamamuro
Hi,

I tried quick and simple tests though, ISTM the vertices below were
correctly cached.
Could you give me the differences between my codes and yours?

import org.apache.spark.graphx._
import org.apache.spark.graphx.lib._

object Prog {
  def processInt(d: Int) = d * 2
}

val g = GraphLoader.edgeListFile(sc, ../temp/graph.txt)
.cache

val g2 = g.outerJoinVertices(g.degrees)(
  (vid, old, msg) = Prog.processInt(msg.getOrElse(0)))
.cache

g2.vertices.count

val g3 = g.outerJoinVertices(g.degrees)(
  (vid, old, msg) = msg.getOrElse(0))
.mapVertices((vid, d) = Prog.processInt(d))
.cache

g3.vertices.count

'g2.vertices.toDebugString' outputs;

(2) VertexRDDImpl[16] at RDD at VertexRDD.scala:57 []
 |  VertexRDD ZippedPartitionsRDD2[15] at zipPartitions at
VertexRDDImpl.scala:121 []
 |  CachedPartitions: 2; MemorySize: 3.3 KB; TachyonSize: 0.0 B;
DiskSize: 0.0 B
 |  VertexRDD, VertexRDD MapPartitionsRDD[8] at mapPartitions at
VertexRDD.scala:319 []
 |  CachedPartitions: 2; MemorySize: 3.3 KB; TachyonSize: 0.0 B;
DiskSize: 0.0 B
 |  MapPartitionsRDD[7] at mapPartitions at VertexRDD.scala:335 []
 |  ShuffledRDD[6] at partitionBy at VertexRDD.scala:335 []
 +-(2) VertexRDD.createRoutingTables - vid2pid (aggregation)
MapPartitionsRDD[5] at mapPartitions at VertexRDD.scala:330 []
|  GraphLoader.edgeListFile - edges (../temp/graph.txt), EdgeRDD,
EdgeRDD MapPartitionsRDD[2] at mapPartitionsWithIndex at Graph...


'g3.vertices.toDebugString' outputs;

(2) VertexRDDImpl[33] at RDD at VertexRDD.scala:57 []
 |  VertexRDD MapPartitionsRDD[32] at mapPartitions at
VertexRDDImpl.scala:96 []
 |  CachedPartitions: 2; MemorySize: 3.3 KB; TachyonSize: 0.0 B;
DiskSize: 0.0 B
 |  VertexRDD ZippedPartitionsRDD2[24] at zipPartitions at
VertexRDDImpl.scala:121 []
 |  CachedPartitions: 2; MemorySize: 3.3 KB; TachyonSize: 0.0 B;
DiskSize: 0.0 B
 |  VertexRDD, VertexRDD MapPartitionsRDD[8] at mapPartitions at
VertexRDD.scala:319 []
 |  CachedPartitions: 2; MemorySize: 3.3 KB; TachyonSize: 0.0 B;
DiskSize: 0.0 B
 |  MapPartitionsRDD[7] at mapPartitions at VertexRDD.scala:335 []
 |  ShuffledRDD[6] at partitionBy at VertexRDD.scala:335 []
 +-(2) VertexRDD.createRoutingTables - vid2pid (aggregation)
MapPartitionsRDD[5] at mapPar...

-- maropu

On Mon, Feb 9, 2015 at 5:47 AM, Kyle Ellrott kellr...@soe.ucsc.edu wrote:

 I changed the

 curGraph = curGraph.outerJoinVertices(curMessages)(
   (vid, vertex, message) =
 vertex.process(message.getOrElse(List[Message]()), ti)
 ).cache()

 to

 curGraph = curGraph.outerJoinVertices(curMessages)(
   (vid, vertex, message) = (vertex,
 message.getOrElse(List[Message]()))
 ).mapVertices( (x,y) = y._1.process( y._2, ti ) ).cache()

 So the call to the 'process' method was moved out of the outerJoinVertices
 and into a separate mapVertices call, and the problem went away. Now,
 'process' is only called once during the correct cycle.
 So it would appear that outerJoinVertices caches the closure to be
 recalculated if needed again while mapVertices actually caches the
 derived values.

 Is this a bug or a feature?

 Kyle



 On Sat, Feb 7, 2015 at 11:44 PM, Kyle Ellrott kellr...@soe.ucsc.edu
 wrote:

 I'm trying to setup a simple iterative message/update problem in GraphX
 (spark 1.2.0), but I'm running into issues with the caching and
 re-calculation of data. I'm trying to follow the example found in the
 Pregel implementation of materializing and cacheing messages and graphs and
 then unpersisting them after the next cycle has been done.
 It doesn't seem to be working, because every cycle gets progressively
 slower and it seems as if more and more of the values are being
 re-calculated despite my attempts to cache them.

 The code:
 ```
   var oldMessages : VertexRDD[List[Message]] = null
   var oldGraph : Graph[MyVertex, MyEdge ] = null
   curGraph = curGraph.mapVertices((x, y) = y.init())
   for (i - 0 to cycle_count) {
 val curMessages = curGraph.aggregateMessages[List[Message]](x = {
   //send messages
   .
 },
 (x, y) = {
//collect messages into lists
 val out = x ++ y
 out
   }
 ).cache()
 curMessages.count()
 val ti = i
 oldGraph = curGraph
 curGraph = curGraph.outerJoinVertices(curMessages)(
   (vid, vertex, message) =
 vertex.process(message.getOrElse(List[Message]()), ti)
 ).cache()
 curGraph.vertices.count()
 oldGraph.unpersistVertices(blocking = false)
 oldGraph.edges.unpersist(blocking = false)
 oldGraph = curGraph
 if (oldMessages != null ) {
   oldMessages.unpersist(blocking=false)
 }
 oldMessages = curMessages
   }
 ```

 The MyVertex.process method takes the list of incoming messages, averages
 them and returns a new MyVertex object. I've also set it up to 

Re: [GraphX] Excessive value recalculations during aggregateMessages cycles

2015-02-08 Thread Kyle Ellrott
I changed the

curGraph = curGraph.outerJoinVertices(curMessages)(
  (vid, vertex, message) =
vertex.process(message.getOrElse(List[Message]()), ti)
).cache()

to

curGraph = curGraph.outerJoinVertices(curMessages)(
  (vid, vertex, message) = (vertex,
message.getOrElse(List[Message]()))
).mapVertices( (x,y) = y._1.process( y._2, ti ) ).cache()

So the call to the 'process' method was moved out of the outerJoinVertices
and into a separate mapVertices call, and the problem went away. Now,
'process' is only called once during the correct cycle.
So it would appear that outerJoinVertices caches the closure to be
recalculated if needed again while mapVertices actually caches the derived
values.

Is this a bug or a feature?

Kyle



On Sat, Feb 7, 2015 at 11:44 PM, Kyle Ellrott kellr...@soe.ucsc.edu wrote:

 I'm trying to setup a simple iterative message/update problem in GraphX
 (spark 1.2.0), but I'm running into issues with the caching and
 re-calculation of data. I'm trying to follow the example found in the
 Pregel implementation of materializing and cacheing messages and graphs and
 then unpersisting them after the next cycle has been done.
 It doesn't seem to be working, because every cycle gets progressively
 slower and it seems as if more and more of the values are being
 re-calculated despite my attempts to cache them.

 The code:
 ```
   var oldMessages : VertexRDD[List[Message]] = null
   var oldGraph : Graph[MyVertex, MyEdge ] = null
   curGraph = curGraph.mapVertices((x, y) = y.init())
   for (i - 0 to cycle_count) {
 val curMessages = curGraph.aggregateMessages[List[Message]](x = {
   //send messages
   .
 },
 (x, y) = {
//collect messages into lists
 val out = x ++ y
 out
   }
 ).cache()
 curMessages.count()
 val ti = i
 oldGraph = curGraph
 curGraph = curGraph.outerJoinVertices(curMessages)(
   (vid, vertex, message) =
 vertex.process(message.getOrElse(List[Message]()), ti)
 ).cache()
 curGraph.vertices.count()
 oldGraph.unpersistVertices(blocking = false)
 oldGraph.edges.unpersist(blocking = false)
 oldGraph = curGraph
 if (oldMessages != null ) {
   oldMessages.unpersist(blocking=false)
 }
 oldMessages = curMessages
   }
 ```

 The MyVertex.process method takes the list of incoming messages, averages
 them and returns a new MyVertex object. I've also set it up to append the
 cycle number (the second argument) into a log file named after the vertex.
 What ends up getting dumped into the log file for every vertex (in the
 exact same pattern) is
 ```
 Cycle: 0
 Cycle: 1
 Cycle: 0
 Cycle: 2
 Cycle: 0
 Cycle: 0
 Cycle: 1
 Cycle: 3
 Cycle: 0
 Cycle: 0
 Cycle: 1
 Cycle: 0
 Cycle: 0
 Cycle: 1
 Cycle: 2
 Cycle: 4
 Cycle: 0
 Cycle: 0
 Cycle: 1
 Cycle: 0
 Cycle: 0
 Cycle: 1
 Cycle: 2
 Cycle: 0
 Cycle: 0
 Cycle: 1
 Cycle: 0
 Cycle: 0
 Cycle: 1
 Cycle: 2
 Cycle: 3
 Cycle: 5
 ```

 Any ideas about what I might be doing wrong for the caching? And how I can
 avoid re-calculating so many of the values.


 Kyle





  1   2   >