output the datas(txt)

2016-02-27 Thread Bonsen
I get results from RDDs,
like :
Array(Array(1,2,3),Array(2,3,4),Array(3,4,6))
how can I output them to 1.txt
like :
1 2 3
2 3 4
3 4 6




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/output-the-datas-txt-tp26350.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Ordering two dimensional arrays of (String, Int) in the order of second element

2016-02-27 Thread Ashok Kumar
no particular reason. just wanted to know if there was another way as well.
thanks 

On Saturday, 27 February 2016, 22:12, Yin Yang  wrote:
 

 Is there particular reason you cannot use temporary table ?
Thanks
On Sat, Feb 27, 2016 at 10:59 AM, Ashok Kumar  wrote:

Thank you sir.
Can one do this sorting without using temporary table if possible?
Best 

On Saturday, 27 February 2016, 18:50, Yin Yang  wrote:
 

 scala>  Seq((1, "b", "test"), (2, "a", "foo")).toDF("id", "a", 
"b").registerTempTable("test")
scala> val df = sql("SELECT struct(id, b, a) from test order by b")df: 
org.apache.spark.sql.DataFrame = [struct(id, b, a): struct]
scala> df.show++|struct(id, b, a)|++|       
[2,foo,a]||      [1,test,b]|++
On Sat, Feb 27, 2016 at 10:25 AM, Ashok Kumar  
wrote:

 Hello,
I like to be able to solve this using arrays.
I have two dimensional array of (String,Int) with 5  entries say arr("A",20), 
arr("B",13), arr("C", 18), arr("D",10), arr("E",19)
I like to write a small code to order these in the order of highest Int column 
so I will have arr("A",20), arr("E",19), arr("C",18) 
What is the best way of doing this using arrays only?
Thanks



   



  

Re: Ordering two dimensional arrays of (String, Int) in the order of second element

2016-02-27 Thread Yin Yang
Is there particular reason you cannot use temporary table ?

Thanks

On Sat, Feb 27, 2016 at 10:59 AM, Ashok Kumar  wrote:

> Thank you sir.
>
> Can one do this sorting without using temporary table if possible?
>
> Best
>
>
> On Saturday, 27 February 2016, 18:50, Yin Yang  wrote:
>
>
> scala>  Seq((1, "b", "test"), (2, "a", "foo")).toDF("id", "a",
> "b").registerTempTable("test")
>
> scala> val df = sql("SELECT struct(id, b, a) from test order by b")
> df: org.apache.spark.sql.DataFrame = [struct(id, b, a): struct string ... 1 more field>]
>
> scala> df.show
> ++
> |struct(id, b, a)|
> ++
> |   [2,foo,a]|
> |  [1,test,b]|
> ++
>
> On Sat, Feb 27, 2016 at 10:25 AM, Ashok Kumar <
> ashok34...@yahoo.com.invalid> wrote:
>
> Hello,
>
> I like to be able to solve this using arrays.
>
> I have two dimensional array of (String,Int) with 5  entries say
> arr("A",20), arr("B",13), arr("C", 18), arr("D",10), arr("E",19)
>
> I like to write a small code to order these in the order of highest Int
> column so I will have arr("A",20), arr("E",19), arr("C",18) 
>
> What is the best way of doing this using arrays only?
>
> Thanks
>
>
>
>
>


Re: Spark streaming not remembering previous state

2016-02-27 Thread Vinti Maheshwari
Thanks much Amit, Sebastian. It worked.

Regards,
~Vinti

On Sat, Feb 27, 2016 at 12:44 PM, Amit Assudani 
wrote:

> Your context is not being created using checkpoints, use get or create,
>
> From: Vinti Maheshwari 
> Date: Saturday, February 27, 2016 at 3:28 PM
> To: user 
> Subject: Spark streaming not remembering previous state
>
> Hi All,
>
> I wrote spark streaming program with stateful transformation.
> It seems like my spark streaming application is doing computation
> correctly with check pointing.
> But i terminate my program and i start it again, it's not reading the
> previous checkpointing data and staring from the beginning. Is it the
> expected behaviour?
>
> Do i need to change anything in my program so that it will remember the
> previous data and start computation from there?
>
> Thanks in advance.
>
> For reference my program:
>
>
>   def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setAppName("HBaseStream")
> val sc = new SparkContext(conf)
> val ssc = new StreamingContext(sc, Seconds(5))
> val inputStream = ssc.socketTextStream("ttsv-vccp-01.juniper.net", )
> 
> ssc.checkpoint("hdfs://ttsv-lab-vmdb-01.englab.juniper.net:8020/user/spark/checkpoints_dir")
> inputStream.print(1)
> val parsedStream = inputStream
>   .map(line => {
> val splitLines = line.split(",")
> (splitLines(1), splitLines.slice(2, 
> splitLines.length).map((_.trim.toLong)))
>   })
> import breeze.linalg.{DenseVector => BDV}
> import scala.util.Try
>
> val state: DStream[(String, Array[Long])] = parsedStream.updateStateByKey(
>   (current: Seq[Array[Long]], prev: Option[Array[Long]]) =>  {
> prev.map(_ +: current).orElse(Some(current))
>   .flatMap(as => Try(as.map(BDV(_)).reduce(_ + _).toArray).toOption)
>   })
> state.checkpoint(Duration(1))
> state.foreachRDD(rdd => rdd.foreach(Blaher.blah))
>
> // Start the computation
> ssc.start()
> // Wait for the computation to terminate
> ssc.awaitTermination()
>
>   }
> }
>
>
> Regards,
>
> ~Vinti
>
>
> --
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>


Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-27 Thread Abhishek Anand
Hi Ryan,

I am using mapWithState after doing reduceByKey.

I am right now using mapWithState as you suggested and triggering the count
manually.

But, still unable to see any checkpointing taking place. In the DAG I can
see that the reduceByKey operation for the previous batches are also being
computed.


Thanks
Abhi


On Tue, Feb 23, 2016 at 2:36 AM, Shixiong(Ryan) Zhu  wrote:

> Hey Abhi,
>
> Using reducebykeyandwindow and mapWithState will trigger the bug
> in SPARK-6847. Here is a workaround to trigger checkpoint manually:
>
> JavaMapWithStateDStream<...> stateDStream =
> myPairDstream.mapWithState(StateSpec.function(mappingFunc));
> stateDStream.foreachRDD(new Function1<...>() {
>   @Override
>   public Void call(JavaRDD<...> rdd) throws Exception {
> rdd.count();
>   }
> });
> return stateDStream.stateSnapshots();
>
>
> On Mon, Feb 22, 2016 at 12:25 PM, Abhishek Anand 
> wrote:
>
>> Hi Ryan,
>>
>> Reposting the code.
>>
>> Basically my use case is something like - I am receiving the web
>> impression logs and may get the notify (listening from kafka) for those
>> impressions in the same interval (for me its 1 min) or any next interval
>> (upto 2 hours). Now, when I receive notify for a particular impression I
>> need to swap the date field in impression with the date field in notify
>> logs. The notify for an impression has the same key as impression.
>>
>> static Function3> Tuple2> mappingFunc =
>> new Function3> MyClass>>() {
>> @Override
>> public Tuple2 call(String key, Optional one,
>> State state) {
>> MyClass nullObj = new MyClass();
>> nullObj.setImprLog(null);
>> nullObj.setNotifyLog(null);
>> MyClass current = one.or(nullObj);
>>
>> if(current!= null && current.getImprLog() != null &&
>> current.getMyClassType() == 1 /*this is impression*/){
>> return new Tuple2<>(key, null);
>> }
>> else if (current.getNotifyLog() != null  && current.getMyClassType() == 3
>> /*notify for the impression received*/){
>> MyClass oldState = (state.exists() ? state.get() : nullObj);
>> if(oldState!= null && oldState.getNotifyLog() != null){
>> oldState.getImprLog().setCrtd(current.getNotifyLog().getCrtd());
>>  //swappping the dates
>> return new Tuple2<>(key, oldState);
>> }
>> else{
>> return new Tuple2<>(key, null);
>> }
>> }
>> else{
>> return new Tuple2<>(key, null);
>> }
>>
>> }
>> };
>>
>>
>> return
>> myPairDstream.mapWithState(StateSpec.function(mappingFunc)).stateSnapshots();
>>
>>
>> Currently I am using reducebykeyandwindow without the inverse function
>> and I am able to get the correct data. But, issue the might arise is when I
>> have to restart my application from checkpoint and it repartitions and
>> computes the previous 120 partitions, which delays the incoming batches.
>>
>>
>> Thanks !!
>> Abhi
>>
>> On Tue, Feb 23, 2016 at 1:25 AM, Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>>> Hey Abhi,
>>>
>>> Could you post how you use mapWithState? By default, it should do
>>> checkpointing every 10 batches.
>>> However, there is a known issue that prevents mapWithState from
>>> checkpointing in some special cases:
>>> https://issues.apache.org/jira/browse/SPARK-6847
>>>
>>> On Mon, Feb 22, 2016 at 5:47 AM, Abhishek Anand >> > wrote:
>>>
 Any Insights on this one ?


 Thanks !!!
 Abhi

 On Mon, Feb 15, 2016 at 11:08 PM, Abhishek Anand <
 abhis.anan...@gmail.com> wrote:

> I am now trying to use mapWithState in the following way using some
> example codes. But, by looking at the DAG it does not seem to checkpoint
> the state and when restarting the application from checkpoint, it
> re-partitions all the previous batches data from kafka.
>
> static Function3 Tuple2> mappingFunc =
> new Function3 Tuple2>() {
> @Override
> public Tuple2 call(String key, Optional one,
> State state) {
> MyClass nullObj = new MyClass();
> nullObj.setImprLog(null);
> nullObj.setNotifyLog(null);
> MyClass current = one.or(nullObj);
>
> if(current!= null && current.getImprLog() != null &&
> current.getMyClassType() == 1){
> return new Tuple2<>(key, null);
> }
> else if (current.getNotifyLog() != null  && current.getMyClassType()
> == 3){
> MyClass oldState = (state.exists() ? state.get() : nullObj);
> if(oldState!= null && oldState.getNotifyLog() != null){
> oldState.getImprLog().setCrtd(current.getNotifyLog().getCrtd());
> return new Tuple2<>(key, oldState);
> }
> else{
> return new Tuple2<>(key, null);
> }
> }
> else{
> return new Tuple2<>(key, null);
> }
>
> }
> };
>
>
> Please suggest if this is 

Re: Spark streaming not remembering previous state

2016-02-27 Thread Sebastian Piu
Here:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/RecoverableNetworkWordCount.scala

On Sat, 27 Feb 2016, 20:42 Sebastian Piu,  wrote:

> You need to create the streaming context using an existing checkpoint for
> it to work
>
> See sample here
>
> On Sat, 27 Feb 2016, 20:28 Vinti Maheshwari,  wrote:
>
>> Hi All,
>>
>> I wrote spark streaming program with stateful transformation.
>> It seems like my spark streaming application is doing computation
>> correctly with check pointing.
>> But i terminate my program and i start it again, it's not reading the
>> previous checkpointing data and staring from the beginning. Is it the
>> expected behaviour?
>>
>> Do i need to change anything in my program so that it will remember the
>> previous data and start computation from there?
>>
>> Thanks in advance.
>>
>> For reference my program:
>>
>>
>>   def main(args: Array[String]): Unit = {
>> val conf = new SparkConf().setAppName("HBaseStream")
>> val sc = new SparkContext(conf)
>> val ssc = new StreamingContext(sc, Seconds(5))
>> val inputStream = ssc.socketTextStream("ttsv-vccp-01.juniper.net", )
>> 
>> ssc.checkpoint("hdfs://ttsv-lab-vmdb-01.englab.juniper.net:8020/user/spark/checkpoints_dir")
>> inputStream.print(1)
>> val parsedStream = inputStream
>>   .map(line => {
>> val splitLines = line.split(",")
>> (splitLines(1), splitLines.slice(2, 
>> splitLines.length).map((_.trim.toLong)))
>>   })
>> import breeze.linalg.{DenseVector => BDV}
>> import scala.util.Try
>>
>> val state: DStream[(String, Array[Long])] = 
>> parsedStream.updateStateByKey(
>>   (current: Seq[Array[Long]], prev: Option[Array[Long]]) =>  {
>> prev.map(_ +: current).orElse(Some(current))
>>   .flatMap(as => Try(as.map(BDV(_)).reduce(_ + _).toArray).toOption)
>>   })
>> state.checkpoint(Duration(1))
>> state.foreachRDD(rdd => rdd.foreach(Blaher.blah))
>>
>> // Start the computation
>> ssc.start()
>> // Wait for the computation to terminate
>> ssc.awaitTermination()
>>
>>   }
>> }
>>
>>
>> Regards,
>>
>> ~Vinti
>>
>>


Re: Spark streaming not remembering previous state

2016-02-27 Thread Sebastian Piu
You need to create the streaming context using an existing checkpoint for
it to work

See sample here

On Sat, 27 Feb 2016, 20:28 Vinti Maheshwari,  wrote:

> Hi All,
>
> I wrote spark streaming program with stateful transformation.
> It seems like my spark streaming application is doing computation
> correctly with check pointing.
> But i terminate my program and i start it again, it's not reading the
> previous checkpointing data and staring from the beginning. Is it the
> expected behaviour?
>
> Do i need to change anything in my program so that it will remember the
> previous data and start computation from there?
>
> Thanks in advance.
>
> For reference my program:
>
>
>   def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setAppName("HBaseStream")
> val sc = new SparkContext(conf)
> val ssc = new StreamingContext(sc, Seconds(5))
> val inputStream = ssc.socketTextStream("ttsv-vccp-01.juniper.net", )
> 
> ssc.checkpoint("hdfs://ttsv-lab-vmdb-01.englab.juniper.net:8020/user/spark/checkpoints_dir")
> inputStream.print(1)
> val parsedStream = inputStream
>   .map(line => {
> val splitLines = line.split(",")
> (splitLines(1), splitLines.slice(2, 
> splitLines.length).map((_.trim.toLong)))
>   })
> import breeze.linalg.{DenseVector => BDV}
> import scala.util.Try
>
> val state: DStream[(String, Array[Long])] = parsedStream.updateStateByKey(
>   (current: Seq[Array[Long]], prev: Option[Array[Long]]) =>  {
> prev.map(_ +: current).orElse(Some(current))
>   .flatMap(as => Try(as.map(BDV(_)).reduce(_ + _).toArray).toOption)
>   })
> state.checkpoint(Duration(1))
> state.foreachRDD(rdd => rdd.foreach(Blaher.blah))
>
> // Start the computation
> ssc.start()
> // Wait for the computation to terminate
> ssc.awaitTermination()
>
>   }
> }
>
>
> Regards,
>
> ~Vinti
>
>


Re: Ordering two dimensional arrays of (String, Int) in the order of second element

2016-02-27 Thread Ashok Kumar
Thank you sir.
Can one do this sorting without using temporary table if possible?
Best 

On Saturday, 27 February 2016, 18:50, Yin Yang  wrote:
 

 scala>  Seq((1, "b", "test"), (2, "a", "foo")).toDF("id", "a", 
"b").registerTempTable("test")
scala> val df = sql("SELECT struct(id, b, a) from test order by b")df: 
org.apache.spark.sql.DataFrame = [struct(id, b, a): struct]
scala> df.show++|struct(id, b, a)|++|       
[2,foo,a]||      [1,test,b]|++
On Sat, Feb 27, 2016 at 10:25 AM, Ashok Kumar  
wrote:

 Hello,
I like to be able to solve this using arrays.
I have two dimensional array of (String,Int) with 5  entries say arr("A",20), 
arr("B",13), arr("C", 18), arr("D",10), arr("E",19)
I like to write a small code to order these in the order of highest Int column 
so I will have arr("A",20), arr("E",19), arr("C",18) 
What is the best way of doing this using arrays only?
Thanks



  

Re: Ordering two dimensional arrays of (String, Int) in the order of second element

2016-02-27 Thread Yin Yang
scala>  Seq((1, "b", "test"), (2, "a", "foo")).toDF("id", "a",
"b").registerTempTable("test")

scala> val df = sql("SELECT struct(id, b, a) from test order by b")
df: org.apache.spark.sql.DataFrame = [struct(id, b, a): struct]

scala> df.show
++
|struct(id, b, a)|
++
|   [2,foo,a]|
|  [1,test,b]|
++

On Sat, Feb 27, 2016 at 10:25 AM, Ashok Kumar 
wrote:

> Hello,
>
> I like to be able to solve this using arrays.
>
> I have two dimensional array of (String,Int) with 5  entries say
> arr("A",20), arr("B",13), arr("C", 18), arr("D",10), arr("E",19)
>
> I like to write a small code to order these in the order of highest Int
> column so I will have arr("A",20), arr("E",19), arr("C",18) 
>
> What is the best way of doing this using arrays only?
>
> Thanks
>


Re: Ordering two dimensional arrays of (String, Int) in the order of second element

2016-02-27 Thread Yin Yang
Is this what you look for ?

scala> Seq((2, "a", "test"), (2, "b", "foo")).toDF("id", "a",
"b").registerTempTable("test")

scala> val df = sql("SELECT struct(id, b, a) from test")
df: org.apache.spark.sql.DataFrame = [struct(id, b, a): struct]

scala> df.show
++
|struct(id, b, a)|
++
|  [2,test,a]|
|   [2,foo,b]|
++

You can adjust the order of the columns in struct() .

FYI

On Sat, Feb 27, 2016 at 10:25 AM, Ashok Kumar 
wrote:

> Hello,
>
> I like to be able to solve this using arrays.
>
> I have two dimensional array of (String,Int) with 5  entries say
> arr("A",20), arr("B",13), arr("C", 18), arr("D",10), arr("E",19)
>
> I like to write a small code to order these in the order of highest Int
> column so I will have arr("A",20), arr("E",19), arr("C",18) 
>
> What is the best way of doing this using arrays only?
>
> Thanks
>


Ordering two dimensional arrays of (String, Int) in the order of second element

2016-02-27 Thread Ashok Kumar
 Hello,
I like to be able to solve this using arrays.
I have two dimensional array of (String,Int) with 5  entries say arr("A",20), 
arr("B",13), arr("C", 18), arr("D",10), arr("E",19)
I like to write a small code to order these in the order of highest Int column 
so I will have arr("A",20), arr("E",19), arr("C",18) 
What is the best way of doing this using arrays only?
Thanks

RE: Get all vertexes with outDegree equals to 0 with GraphX

2016-02-27 Thread Mohammed Guller
Perhaps, the documentation of the filter method would help. Here is the method 
signature (copied from the API doc)

def  filter[VD2, ED2](preprocess: (Graph[VD, ED]) => Graph[VD2, ED2], epred: 
(EdgeTriplet[VD2, ED2]) => Boolean = (x: EdgeTriplet[VD2, ED2]) => true, vpred: 
(VertexId, VD2) => Boolean = (v: VertexId, d: VD2) => true)
This method returns a subgraph of the original graph. The  data in the original 
graph remains unchanged. Brief description of the arguments:

VD2:vertex type the vpred operates on
ED2:edge type the epred operates on
preprocess:   a function to compute new vertex and edge data before filtering
epred:   edge predicate to filter on after preprocess
vpred:   vertex predicate to filter on after prerocess

In the solution below, the first function literal is the preprocess argument. 
The vpred argument is passed as named argument since we are using the default 
value for epred.

HTH.

Mohammed
Author: Big Data Analytics with 
Spark

From: Guillermo Ortiz [mailto:konstt2...@gmail.com]
Sent: Saturday, February 27, 2016 6:17 AM
To: Mohammed Guller
Cc: Robin East; user
Subject: Re: Get all vertexes with outDegree equals to 0 with GraphX

Thank you, I have to think what the code does,, because I am a little noob in 
scala and it's hard to understand it to me.

2016-02-27 3:53 GMT+01:00 Mohammed Guller 
>:
Here is another solution (minGraph is the graph from your code. I assume that 
is your original graph):

val graphWithNoOutEdges = minGraph.filter(
  graph => graph.outerJoinVertices(graph.outDegrees) {(vId, vData, 
outDegreesOpt) => outDegreesOpt.getOrElse(0)},
  vpred = (vId: VertexId, vOutDegrees: Int) => vOutDegrees == 0
)

val verticesWithNoOutEdges = graphWithNoOutEdges.vertices

Mohammed
Author: Big Data Analytics with 
Spark

From: Guillermo Ortiz [mailto:konstt2...@gmail.com]
Sent: Friday, February 26, 2016 5:46 AM
To: Robin East
Cc: user
Subject: Re: Get all vertexes with outDegree equals to 0 with GraphX

Yes, I am not really happy with that "collect".
I was taking a look to use subgraph method and others options and didn't figure 
out anything easy or direct..

I'm going to try your idea.

2016-02-26 14:16 GMT+01:00 Robin East 
>:
Whilst I can think of other ways to do it I don’t think they would be 
conceptually or syntactically any simpler. GraphX doesn’t have the concept of 
built-in vertex properties which would make this simpler - a vertex in GraphX 
is a Vertex ID (Long) and a bunch of custom attributes that you assign. This 
means you have to find a way of ‘pushing’ the vertex degree into the graph so 
you can do comparisons (cf a join in relational databases) or as you have done 
create a list and filter against that (cf filtering against a sub-query in 
relational database).

One thing I would point out is that you probably want to avoid 
finalVerexes.collect() for a large-scale system - this will pull all the 
vertices into the driver and then push them out to the executors again as part 
of the filter operation. A better strategy for large graphs would be:

1. build a graph based on the existing graph where the vertex attribute is the 
vertex degree - the GraphX documentation shows how to do this
2. filter this “degrees” graph to just give you 0 degree vertices
3 use graph.mask passing in the 0-degree graph to get the original graph with 
just 0 degree vertices

Just one variation on several possibilities, the key point is that everything 
is just a graph transformation until you call an action on the resulting graph
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action




On 26 Feb 2016, at 11:59, Guillermo Ortiz 
> wrote:

I'm new with graphX. I need to get the vertex without out edges..
I guess that it's pretty easy but I did it pretty complicated.. and inefficienct


val vertices: RDD[(VertexId, (List[String], List[String]))] =
  sc.parallelize(Array((1L, (List("a"), List[String]())),
(2L, (List("b"), List[String]())),
(3L, (List("c"), List[String]())),
(4L, (List("d"), List[String]())),
(5L, (List("e"), List[String]())),
(6L, (List("f"), List[String]()

// Create an RDD for edges
val relationships: RDD[Edge[Boolean]] =
  sc.parallelize(Array(Edge(1L, 2L, true), Edge(2L, 3L, true), Edge(3L, 4L, 
true), Edge(5L, 2L, true)))

val out = minGraph.outDegrees.map(vertex => vertex._1)

val finalVertexes = minGraph.vertices.keys.subtract(out)

//It must be something better than this way..
val nodes = finalVertexes.collect()
val result = 

Re: .cache() changes contents of RDD

2016-02-27 Thread Sabarish Sasidharan
This is because Hadoop writables are being reused. Just map it to some
custom type and then do further operations including cache() on it.

Regards
Sab
On 27-Feb-2016 9:11 am, "Yan Yang"  wrote:

> Hi
>
> I am pretty new to Spark, and after experimentation on our pipelines. I
> ran into this weird issue.
>
> The Scala code is as below:
>
> val input = sc.newAPIHadoopRDD(...)
> val rdd = input.map(...)
> rdd.cache()
> rdd.saveAsTextFile(...)
>
> I found rdd to consist of 80+K identical rows. To be more precise, the
> number of rows is right, but all are identical.
>
> The truly weird part is if I remove rdd.cache(), everything works just
> fine. I have encountered this issue on a few occasions.
>
> Thanks
> Yan
>
>
>
>
>


Re: Get all vertexes with outDegree equals to 0 with GraphX

2016-02-27 Thread Guillermo Ortiz
Thank you, I have to think what the code does,, because I am a little noob
in scala and it's hard to understand it to me.

2016-02-27 3:53 GMT+01:00 Mohammed Guller :

> Here is another solution (minGraph is the graph from your code. I assume
> that is your original graph):
>
>
>
> val graphWithNoOutEdges = minGraph.filter(
>
>   graph => graph.outerJoinVertices(graph.outDegrees) {(vId, vData,
> outDegreesOpt) => outDegreesOpt.getOrElse(0)},
>
>   vpred = (vId: VertexId, vOutDegrees: Int) => vOutDegrees == 0
>
> )
>
>
>
> val verticesWithNoOutEdges = graphWithNoOutEdges.vertices
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> 
>
>
>
> *From:* Guillermo Ortiz [mailto:konstt2...@gmail.com]
> *Sent:* Friday, February 26, 2016 5:46 AM
> *To:* Robin East
> *Cc:* user
> *Subject:* Re: Get all vertexes with outDegree equals to 0 with GraphX
>
>
>
> Yes, I am not really happy with that "collect".
>
> I was taking a look to use subgraph method and others options and didn't
> figure out anything easy or direct..
>
>
>
> I'm going to try your idea.
>
>
>
> 2016-02-26 14:16 GMT+01:00 Robin East :
>
> Whilst I can think of other ways to do it I don’t think they would be
> conceptually or syntactically any simpler. GraphX doesn’t have the concept
> of built-in vertex properties which would make this simpler - a vertex in
> GraphX is a Vertex ID (Long) and a bunch of custom attributes that you
> assign. This means you have to find a way of ‘pushing’ the vertex degree
> into the graph so you can do comparisons (cf a join in relational
> databases) or as you have done create a list and filter against that (cf
> filtering against a sub-query in relational database).
>
>
>
> One thing I would point out is that you probably want to avoid
> finalVerexes.collect() for a large-scale system - this will pull all the
> vertices into the driver and then push them out to the executors again as
> part of the filter operation. A better strategy for large graphs would be:
>
>
>
> 1. build a graph based on the existing graph where the vertex attribute is
> the vertex degree - the GraphX documentation shows how to do this
>
> 2. filter this “degrees” graph to just give you 0 degree vertices
>
> 3 use graph.mask passing in the 0-degree graph to get the original graph
> with just 0 degree vertices
>
>
>
> Just one variation on several possibilities, the key point is that
> everything is just a graph transformation until you call an action on the
> resulting graph
>
>
> ---
>
> Robin East
>
> *Spark GraphX in Action *Michael Malak and Robin East
>
> Manning Publications Co.
>
> http://www.manning.com/books/spark-graphx-in-action
>
>
>
>
>
>
>
>
>
> On 26 Feb 2016, at 11:59, Guillermo Ortiz  wrote:
>
>
>
> I'm new with graphX. I need to get the vertex without out edges..
>
> I guess that it's pretty easy but I did it pretty complicated.. and
> inefficienct
>
>
>
> *val *vertices: RDD[(VertexId, (List[String], List[String]))] =
>   sc.parallelize(*Array*((1L, (*List*(*"a"*), *List*[String]())),
> (2L, (*List*(*"b"*), *List*[String]())),
> (3L, (*List*(*"c"*), *List*[String]())),
> (4L, (*List*(*"d"*), *List*[String]())),
> (5L, (*List*(*"e"*), *List*[String]())),
> (6L, (*List*(*"f"*), *List*[String]()
>
>
> *// Create an RDD for edges**val *relationships: RDD[Edge[Boolean]] =
>   sc.parallelize(*Array*(*Edge*(1L, 2L, *true*), *Edge*(2L, 3L, *true*), 
> *Edge*(3L, 4L, *true*), *Edge*(5L, 2L, *true*)))
>
> *val *out = minGraph.*outDegrees*.map(vertex => vertex._1)
>
> *val *finalVertexes = minGraph.vertices.keys.subtract(out)
>
> //It must be something better than this way..
> *val *nodes = finalVertexes.collect()
> *val *result = minGraph.vertices.filter(v => nodes.contains(v._1))
>
>
>
> What's the good way to do this operation? It seems that it should be pretty 
> easy.
>
>
>
>
>


Re: Restrictions on SQL operations on Spark temporary tables

2016-02-27 Thread 刘虓
Hi,
For now Spark-sql does not support subquery,I guess that's the reason your
query fails

2016-02-27 20:01 GMT+08:00 Mich Talebzadeh :

> It appeas that certain SQL on Spark temporary tables do not support Hive
> SQL even when they are using HiveContext
>
> example
>
> scala> HiveContext.sql("select count(1) from tmp  where ID in (select
> max(id) from tmp)")
> org.apache.spark.sql.AnalysisException:
> Unsupported language features in query: select count(1) from tmp  where ID
> in (select max(id) from tmp)
>
> that tmp table is from d.registerTempTable("tmp")
> The same Query in Hive on the same underlying table works fine.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>


deal with datas' structure

2016-02-27 Thread Bonsen
Now,I have a map
val ji =
scala.collection.mutable.Map[String,scala.collection.mutable.ArrayBuffer[String]]()
there are so many datas like:
ji =
map("a"->ArrayBuffer["1","2","3"],"b"->ArrayBuffer["1","2","3"],"c"->ArrayBuffer["2","3"])

if "a" choose "1","b" and "c" can't choose "1",
for example,
ji = map("b"->ArrayBuffer["2","3"],"c"->ArrayBuffer["2","3"])
if "b" choose "2","c" can't choose "2",
for example,
ji = map("c"->ArrayBuffer["3"])

And I need get all the Possibilities and sort them from small to big and
ouput them to result.txt.
Results Like:
a:1  b:2  c:3
a:1  c:2  b:3
b:1  a:2  c:3
b:1  c:2  a:3

Finally,we can get result.txt:
 a b c
 a c b
 b a c
 b c a
   
What should I do? I think it is difficult to me.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/deal-with-datas-structure-tp26349.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Restrictions on SQL operations on Spark temporary tables

2016-02-27 Thread Mich Talebzadeh
It appeas that certain SQL on Spark temporary tables do not support Hive
SQL even when they are using HiveContext

example

scala> HiveContext.sql("select count(1) from tmp  where ID in (select
max(id) from tmp)")
org.apache.spark.sql.AnalysisException:
Unsupported language features in query: select count(1) from tmp  where ID
in (select max(id) from tmp)

that tmp table is from d.registerTempTable("tmp")
The same Query in Hive on the same underlying table works fine.


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


Re: .cache() changes contents of RDD

2016-02-27 Thread Igor Berman
are you using avro format by any chance?
there is some formats that need to be "deep"-copy before caching or
aggregating
try something like
val input = sc.newAPIHadoopRDD(...)
val rdd = input.map(deepCopyTransformation).map(...)
rdd.cache()
rdd.saveAsTextFile(...)

where deepCopyTransformation is function that deep copies every object

On 26 February 2016 at 19:41, Yan Yang  wrote:

> Hi
>
> I am pretty new to Spark, and after experimentation on our pipelines. I
> ran into this weird issue.
>
> The Scala code is as below:
>
> val input = sc.newAPIHadoopRDD(...)
> val rdd = input.map(...)
> rdd.cache()
> rdd.saveAsTextFile(...)
>
> I found rdd to consist of 80+K identical rows. To be more precise, the
> number of rows is right, but all are identical.
>
> The truly weird part is if I remove rdd.cache(), everything works just
> fine. I have encountered this issue on a few occasions.
>
> Thanks
> Yan
>
>
>
>
>


Re: DirectFileOutputCommiter

2016-02-27 Thread Igor Berman
Hi Reynold,
thanks for the response
Yes, speculation mode needs some coordination.
Regarding job failure :
correct me if I wrong - if one of jobs fails - client code will be sort of
"notified" by exception or something similar, so the client can decide to
re-submit action(job), i.e. it won't be "silent" failure.


On 26 February 2016 at 11:50, Reynold Xin  wrote:

> It could lose data in speculation mode, or if any job fails.
>
> On Fri, Feb 26, 2016 at 3:45 AM, Igor Berman 
> wrote:
>
>> Takeshi, do you know the reason why they wanted to remove this commiter
>> in SPARK-10063?
>> the jira has no info inside
>> as far as I understand the direct committer can't be used when either of
>> two is true
>> 1. speculation mode
>> 2. append mode(ie. not creating new version of data but appending to
>> existing data)
>>
>> On 26 February 2016 at 08:24, Takeshi Yamamuro 
>> wrote:
>>
>>> Hi,
>>>
>>> Great work!
>>> What is the concrete performance gain of the committer on s3?
>>> I'd like to know.
>>>
>>> I think there is no direct committer for files because these kinds of
>>> committer has risks
>>> to loss data (See: SPARK-10063).
>>> Until this resolved, ISTM files cannot support direct commits.
>>>
>>> thanks,
>>>
>>>
>>>
>>> On Fri, Feb 26, 2016 at 8:39 AM, Teng Qiu  wrote:
>>>
 yes, should be this one
 https://gist.github.com/aarondav/c513916e72101bbe14ec

 then need to set it in spark-defaults.conf :
 https://github.com/zalando/spark/commit/3473f3f1ef27830813c1e0b3686e96a55f49269c#diff-f7a46208be9e80252614369be6617d65R13

 Am Freitag, 26. Februar 2016 schrieb Yin Yang :
 > The header of DirectOutputCommitter.scala says Databricks.
 > Did you get it from Databricks ?
 > On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu  wrote:
 >>
 >> interesting in this topic as well, why the DirectFileOutputCommitter
 not included?
 >> we added it in our fork,
 under 
 core/src/main/scala/org/apache/spark/mapred/DirectOutputCommitter.scala
 >> moreover, this DirectFileOutputCommitter is not working for the
 insert operations in HiveContext, since the Committer is called by hive
 (means uses dependencies in hive package)
 >> we made some hack to fix this, you can take a look:
 >>
 https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando
 >>
 >> may bring some ideas to other spark contributors to find a better
 way to use s3.
 >>
 >> 2016-02-22 23:18 GMT+01:00 igor.berman :
 >>>
 >>> Hi,
 >>> Wanted to understand if anybody uses DirectFileOutputCommitter or
 alikes
 >>> especially when working with s3?
 >>> I know that there is one impl in spark distro for parquet format,
 but not
 >>> for files -  why?
 >>>
 >>> Imho, it can bring huge performance boost.
 >>> Using default FileOutputCommiter with s3 has big overhead at commit
 stage
 >>> when all parts are copied one-by-one to destination dir from
 _temporary,
 >>> which is bottleneck when number of partitions is high.
 >>>
 >>> Also, wanted to know if there are some problems when using
 >>> DirectFileOutputCommitter?
 >>> If writing one partition directly will fail in the middle is spark
 will
 >>> notice this and will fail job(say after all retries)?
 >>>
 >>> thanks in advance
 >>>
 >>>
 >>>
 >>>
 >>> --
 >>> View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/DirectFileOutputCommiter-tp26296.html
 >>> Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
 >>>
 >>>
 -
 >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 >>> For additional commands, e-mail: user-h...@spark.apache.org
 >>>
 >>
 >
 >

>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>


2 tables join happens at Hive but not in spark

2016-02-27 Thread Sandeep Khurana
Hello

We have 2 tables  (tab1, tab2) exposed using hive. The data is in different
hdfs folders. We are trying to join these 2 tables on certain single column
using sparkR join. But inspite of join columns having same values, it
returns zero rows.

But when I run the same join sql in hive, from hive console, to get the
count(*), I do get millions of records meeting the join criteria.

The join columns are of 'int' type. Also, when I join 'tab1' from one of
these 2 tables for which join is not working with another 3rd table 'tab3'
separately, that join works.

To debug , we selected just 1 row in the sparkR script from tab1 and also 1
row row having the same value of join column from tab2 also. We used
'select' sparkR function for this. Now, our dataframes for tab1 and tab2
have single row each and the join columns have same value in both, but
still joining these 2 dataframes having single row each and with same join
column, the join returned zero rows.


We are running the script from rstudio. It does not give any error. It runs
fine. But gives zero join results whereas on hive I do get many rows for
same join. Any idea what might be the cause of this?



-- 
Architect
Infoworks.io
http://Infoworks.io


Re: Is spark.driver.maxResultSize used correctly ?

2016-02-27 Thread Reynold Xin
But sometimes you might have skew and almost all the result data are in one
or a few tasks though.

On Friday, February 26, 2016, Jeff Zhang  wrote:

>
> My job get this exception very easily even when I set large value of
> spark.driver.maxResultSize. After checking the spark code, I found
> spark.driver.maxResultSize is also used in Executor side to decide whether
> DirectTaskResult/InDirectTaskResult sent. This doesn't make sense to me.
> Using  spark.driver.maxResultSize / taskNum might be more proper. Because
> if  spark.driver.maxResultSize is 1g and we have 10 tasks each has 200m
> output. Then even the output of each task is less than
>  spark.driver.maxResultSize so DirectTaskResult will be sent to driver, but
> the total result size is 2g which will cause exception in driver side.
>
>
> 16/02/26 10:10:49 INFO DAGScheduler: Job 4 failed: treeAggregate at
> LogisticRegression.scala:283, took 33.796379 s
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Total size of serialized results of 1 tasks (1085.0
> MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: Clarification on RDD

2016-02-27 Thread Mich Talebzadeh
Hi,

The data (in this case example README.md) is kept in Hadoop Distributed
File System (HDFS) among all datanodes in Hadoop cluster. The metadata that
is used to get info about the storage of this file is kept in namenode.
Your data is always stored in HDFS.

Spark is an application that can access this data and do something
useful with it. RDD is an Spark construct (construct used in a general term
here). It has pointers to the partitions of that file that are distributed
throughout HDFS. In a rough and ready language it is an interface between
that file of yours and your Spark application.

One of the most important concepts of RDDs is that they are immutable. This
means that given the same RDD, we will always get the same answer. This
also allows for Spark to make some optimizations under the hood. If a job
fails, it just has to perform the operation again. There is no state
(beyond the current step it is performing) that Spark needs to keep track
of.

You can go from an RDD to a DataFrame (if the RDD is in a tabular format)
via the toDF method. In general it is recommended to use a DataFrame where
possible due to the built in query optimization.

A DataFrame is equivalent to a table in RDBMS and can also be manipulated
in similar ways to the "native" distributed collections in RDDs. Unlike
RDDs , DataFrames keep track of the schema and support various relational
operations that lead to more optimized execution. Each DataFrame object
represents a logical plan but because of their "lazy" nature no execution
occurs until the user calls a specific "output operation".

HTH

Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 27 February 2016 at 01:37, Mohammed Guller 
wrote:

> HDFS, as the name implies, is a distributed file system. A file stored on
> HDFS is already distributed. So if you create an RDD from a HDFS file, the
> created RDD just points to the file partitions on different nodes.
>
>
>
> You can read more about HDFS here.
>
>
>
>
> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> 
>
>
>
> *From:* Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID]
> *Sent:* Friday, February 26, 2016 9:41 AM
> *To:* User
> *Subject:* Clarification on RDD
>
>
>
> Hi,
>
>
>
> Spark doco says
>
>
>
> Spark’s primary abstraction is a distributed collection of items called a
> Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop
> InputFormats (such as HDFS files) or by transforming other RDDs
>
>
>
> example:
>
>
>
> val textFile = sc.textFile("README.md")
>
>
>
>
>
> my question is when RDD is created like above from a file stored on HDFS,
> does that mean that data is distributed among all the nodes in the cluster
> or data from the md file is copied to each node of the cluster so each node
> has complete copy of data? Has the data is actually moved around or data is
> not copied over until an action like COUNT() is performed on RDD?
>
>
>
> Thanks
>
>
>