Re: GraphX: How can I tell if 2 nodes are connected?

2015-10-05 Thread Anwar Rizal
Maybe connected component is what you need ?
On Oct 5, 2015 19:02, "Robineast"  wrote:

> GraphX has a Shortest Paths algorithm implementation which will tell you,
> for
> all vertices in the graph, the shortest distance to a specific ('landmark')
> vertex. The returned value is '/a graph where each vertex attribute is a
> map
> containing the shortest-path distance to each reachable landmark vertex/'.
> If there is no path to the landmark vertex then the map for the source
> vertex is empty
>
>
>
> -
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-How-can-I-tell-if-2-nodes-are-connected-tp24926p24930.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark streaming alerting

2015-03-24 Thread Anwar Rizal
Helena,

The CassandraInputDStream sounds interesting. I dont find many things in
the jira though. Do you have more details on what it tries to achieve ?

Thanks,
Anwar.

On Tue, Mar 24, 2015 at 2:39 PM, Helena Edelson helena.edel...@datastax.com
 wrote:

 Streaming _from_ cassandra, CassandraInputDStream, is coming BTW
 https://issues.apache.org/jira/browse/SPARK-6283
 I am working on it now.

 Helena
 @helenaedelson

 On Mar 23, 2015, at 5:22 AM, Khanderao Kand Gmail 
 khanderao.k...@gmail.com wrote:

 Akhil

 You are right in tour answer to what Mohit wrote. However what Mohit seems
 to be alluring but did not write properly might be different.

 Mohit

 You are wrong in saying generally streaming works in HDFS and cassandra
 . Streaming typically works with streaming or queing source like Kafka,
 kinesis, Twitter, flume, zeroMQ, etc (but can also from HDFS and S3 )
 However , streaming context ( receiver wishing the streaming context )
 gets events/messages/records and forms a time window based batch (RDD)-

 So there is a maximum gap of window time from alert message was available
 to spark and when the processing happens. I think you meant about this.

 As per spark programming model, RDD is the right way to deal with data.
 If you are fine with the minimum delay of say a sec (based on min time
 window that dstreaming can support) then what Rohit gave is a right model.

 Khanderao

 On Mar 22, 2015, at 11:39 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 What do you mean you can't send it directly from spark workers? Here's a
 simple approach which you could do:

 val data = ssc.textFileStream(sigmoid/)
 val dist = data.filter(_.contains(ERROR)).foreachRDD(rdd =
 alert(Errors : + rdd.count()))

 And the alert() function could be anything triggering an email or sending
 an SMS alert.

 Thanks
 Best Regards

 On Sun, Mar 22, 2015 at 1:52 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

 Is there a module in spark streaming that lets you listen to
 the alerts/conditions as they happen in the streaming module? Generally
 spark streaming components will execute on large set of clusters like hdfs
 or Cassandra, however when it comes to alerting you generally can't send it
 directly from the spark workers, which means you need a way to listen to
 the alerts.






Re: propogating edges

2015-01-11 Thread Anwar Rizal
It looks like to be similar (simpler) to the connected component
implementation in GraphX.
Have you checked that ?

I have questions though, in your example, the graph is a tree. What is the
behavior if it is a more general graph ?

Cheers,
Anwar Rizal.

On Mon, Jan 12, 2015 at 1:02 AM, dizzy5112 dave.zee...@gmail.com wrote:

 Hi all looking for some help in propagating some values in edges. What i
 want
 to achieve (see diagram) is for each connected part of the graph assign an
 incrementing value  for each of the out links from the root node. This
 value
 will restart again for the next part of the graph. ie node 1 has out links
 to node 2,3,and 4. The edge attribute for these will be 1,2 and 3
 respectively. For each of the out links from these nodes they keep this
 value right through to the final node in their path. For node 9 with out
 link to 10 this has an edge attribute of 1 etc etc.  Thanks in advance for
 any help :) dave

 
 http://apache-spark-user-list.1001560.n3.nabble.com/file/n21086/Drawing2.jpg
 




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/propogating-edges-tp21086.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Find the file info of when load the data into RDD

2014-12-21 Thread Anwar Rizal
Yeah..., buat apparently mapPartitionsWithInputSplit thing
is mapPartitionsWithInputSplit is tagged as DeveloperApi. Because of that,
I'm not sure that it's a good idea to use the function.

For this problem, I had to create a subclass HadoopRDD and use
mapPartitions instead.

Is there any reason why mapPartitionsWithInputSplit  has DeveloperApi
annotation ? Is it possible to remove ?

Best regards,
Anwar Rizal.

On Sun, Dec 21, 2014 at 10:47 PM, Shuai Zheng szheng.c...@gmail.com wrote:

 I just found a possible answer:


 http://themodernlife.github.io/scala/spark/hadoop/hdfs/2014/09/28/spark-input-filename/

 Will give a try on it. Although it is a bit troublesome, but if it works,
 will give what I want.

 Sorry for bother everyone here

 Regards,

 Shuai

 On Sun, Dec 21, 2014 at 4:43 PM, Shuai Zheng szheng.c...@gmail.com
 wrote:

 Hi All,

 When I try to load a folder into the RDDs, any way for me to find the
 input file name of particular partitions? So I can track partitions from
 which file.

 In the hadoop, I can find this information through the code:

 FileSplit fileSplit = (FileSplit) context.getInputSplit();
 String strFilename = fileSplit.getPath().getName();

 But how can I do this in spark?

 Regards,

 Shuai





Re: sc.textFileGroupByPath(*/*.txt)

2014-06-01 Thread Anwar Rizal
I presume that you need to have access to the path of each file you are
reading.

I don't know whether there is a good way to do that for HDFS, I need to
read the files myself, something like:

def openWithPath(inputPath: String, sc:SparkContext) =  {
  val fs= (new
Path(inputPath)).getFileSystem(sc.hadoopConfiguration)
  val filesIt   = fs.listFiles(path, false)
  val paths   = new ListBuffer[URI]
  while (filesIt.hasNext) {
paths += filesIt.next.getPath.toUri
  }
  val withPaths = paths.toList.map{  p =
sc.newAPIHadoopFile[LongWritable, Text,
TextInputFormat](p.toString).map{ case (_,s)  = (p, s.toString) }
  }
  withPaths.reduce{ _ ++ _ }
}
...

I would be interested if there is a better way to do the same thing ...

Cheers,
a:


On Sun, Jun 1, 2014 at 6:00 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Could you provide an example of what you mean?

 I know it's possible to create an RDD from a path with wildcards, like in
 the subject.

 For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also
 provide a comma delimited list of paths.

 Nick

 2014년 6월 1일 일요일, Oleg Proudnikovoleg.proudni...@gmail.com님이 작성한 메시지:

 Hi All,

 Is it possible to create an RDD from a directory tree of the following
 form?

 RDD[(PATH, Seq[TEXT])]

 Thank you,
 Oleg




Re: Selecting first ten values in a RDD/partition

2014-05-29 Thread Anwar Rizal
Can you clarify what you're trying to achieve here ?

If you want to take only top 10 of each RDD, why don't sort followed by
take(10) of every RDD ?

Or, you want to take top 10 of five minutes ?

Cheers,



On Thu, May 29, 2014 at 2:04 PM, nilmish nilmish@gmail.com wrote:

 I have a DSTREAM which consists of RDD partitioned every 2 sec. I have
 sorted
 each RDD and want to retain only top 10 values and discard further value.
 How can I retain only top 10 values ?

 I am trying to get top 10 hashtags.  Instead of sorting the entire of
 5-minute-counts (thereby, incurring the cost of a data shuffle), I am
 trying
 to get the top 10 hashtags in each partition. I am struck at how to retain
 top 10 hashtags in each partition.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.