Fwd: spark graphx storage RDD memory leak

2016-04-11 Thread zhang juntao
yes I use version 1.6 , and thanks Ted 

> Begin forwarded message:
> 
> From: Robin East 
> Subject: Re: spark graphx storage RDD memory leak
> Date: April 12, 2016 at 2:13:10 AM GMT+8
> To: zhang juntao 
> Cc: Ted Yu , dev@spark.apache.org
> 
> this looks like https://issues.apache.org/jira/browse/SPARK-12655 
>  fixed in 2.0
> ---
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action 
> 
> 
> 
> 
> 
> 
>> On 11 Apr 2016, at 07:23, zhang juntao > > wrote:
>> 
>> thanks ted for replying ,
>> these three lines can’t release param graph cache, it only release g ( 
>> graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache() )
>> ConnectedComponents.scala param graph will cache in ccGraph and won’t be 
>> release in Pregel
>>   def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): Graph[VertexId, 
>> ED] = {
>> val ccGraph = graph.mapVertices { case (vid, _) => vid }
>> def sendMessage(edge: EdgeTriplet[VertexId, ED]): Iterator[(VertexId, 
>> VertexId)] = {
>>   if (edge.srcAttr < edge.dstAttr) {
>> Iterator((edge.dstId, edge.srcAttr))
>>   } else if (edge.srcAttr > edge.dstAttr) {
>> Iterator((edge.srcId, edge.dstAttr))
>>   } else {
>> Iterator.empty
>>   }
>> }
>> val initialMessage = Long.MaxValue
>> Pregel(ccGraph, initialMessage, activeDirection = EdgeDirection.Either)(
>>   vprog = (id, attr, msg) => math.min(attr, msg),
>>   sendMsg = sendMessage,
>>   mergeMsg = (a, b) => math.min(a, b))
>>   } // end of connectedComponents
>> }
>> thanks
>> juntao
>> 
>> 
>>> Begin forwarded message:
>>> 
>>> From: Ted Yu >
>>> Subject: Re: spark graphx storage RDD memory leak
>>> Date: April 11, 2016 at 1:15:23 AM GMT+8
>>> To: zhang juntao >> >
>>> Cc: "dev@spark.apache.org " 
>>> >
>>> 
>>> I see the following code toward the end of the method:
>>> 
>>>   // Unpersist the RDDs hidden by newly-materialized RDDs
>>>   oldMessages.unpersist(blocking = false)
>>>   prevG.unpersistVertices(blocking = false)
>>>   prevG.edges.unpersist(blocking = false)
>>> 
>>> Wouldn't the above achieve same effect ?
>>> 
>>> On Sun, Apr 10, 2016 at 9:08 AM, zhang juntao >> > wrote:
>>> hi experts,
>>> 
>>> I’m reporting a problem about spark graphx, I use zeppelin submit spark 
>>> jobs, 
>>> note that scala environment shares the same SparkContext, SQLContext 
>>> instance,
>>> and I call  Connected components algorithm to do some Business,  
>>> found that every time when the job finished, some graph storage RDDs 
>>> weren’t bean released, 
>>> after several times there would be a lot of  storage RDDs existing even 
>>> through all the jobs have finished . 
>>> 
>>> 
>>> 
>>> So I check the code of connectedComponents  and find that may be a problem 
>>> in Pregel.scala .
>>> when param graph has been cached, there isn’t any way to unpersist,  
>>> so I add red font code to solve the problem
>>> def apply[VD: ClassTag, ED: ClassTag, A: ClassTag]
>>>(graph: Graph[VD, ED],
>>> initialMsg: A,
>>> maxIterations: Int = Int.MaxValue,
>>> activeDirection: EdgeDirection = EdgeDirection.Either)
>>>(vprog: (VertexId, VD, A) => VD,
>>> sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
>>> mergeMsg: (A, A) => A)
>>>   : Graph[VD, ED] =
>>> {
>>>   ..
>>>   var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, 
>>> initialMsg)).cache()
>>>   graph.unpersistVertices(blocking = false)
>>>   graph.edges.unpersist(blocking = false)
>>>   ..
>>> 
>>> } // end of apply
>>> 
>>> I'm not sure if this is a bug, 
>>> and thank you for your time,
>>> juntao
>>> 
>>> 
>>> 
>> 
> 



Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-04-11 Thread Mark Hamstra
Yes, some organization do lag behind the current release by sometimes a
significant amount.  That is a bug, not a feature -- and one that increases
pressure toward fragmentation of the Spark community.  To date, that hasn't
been a significant problem, and I think that is mainly because the factors
motivating a decision not to upgrade in a timely fashion are almost
entirely internal to a lagging organization -- Spark itself has tried to
present minimal impediments to upgrading as soon as a new release is
available.

Changing the supported Java and Scala versions within the same quarter in
which the next version is scheduled for release would represent more than a
minimal impediment, and would increase fragmentation pressure to a degree
with which I am not entirely comfortable.

On Mon, Apr 11, 2016 at 12:10 PM, Daniel Siegmann <
daniel.siegm...@teamaol.com> wrote:

> On Wed, Apr 6, 2016 at 2:57 PM, Mark Hamstra 
> wrote:
>
> ... My concern is that either of those options will take more resources
>> than some Spark users will have available in the ~3 months remaining before
>> Spark 2.0.0, which will cause fragmentation into Spark 1.x and Spark 2.x
>> user communities. ...
>>
>
> It's not as if everyone is going to switch over to Spark 2.0.0 on release
> day anyway. It's not that unusual to see posts on the user list from people
> who are a version or two behind. I think a few extra months lag time will
> be OK for a major version.
>
> Besides, in my experience if you give people more time to upgrade, they're
> just going to kick the can down the road a ways and you'll eventually end
> up with the same problem. I don't see a good reason to *not* drop Java 7
> and Scala 2.10 support with Spark 2.0.0. Time to bite the bullet. If
> companies stick with Spark 1.x and find themselves missing the new features
> in the 2.x line, that will be a good motivation for them to upgrade.
>
> ~Daniel Siegmann
>


Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-04-11 Thread Daniel Siegmann
On Wed, Apr 6, 2016 at 2:57 PM, Mark Hamstra 
wrote:

... My concern is that either of those options will take more resources
> than some Spark users will have available in the ~3 months remaining before
> Spark 2.0.0, which will cause fragmentation into Spark 1.x and Spark 2.x
> user communities. ...
>

It's not as if everyone is going to switch over to Spark 2.0.0 on release
day anyway. It's not that unusual to see posts on the user list from people
who are a version or two behind. I think a few extra months lag time will
be OK for a major version.

Besides, in my experience if you give people more time to upgrade, they're
just going to kick the can down the road a ways and you'll eventually end
up with the same problem. I don't see a good reason to *not* drop Java 7
and Scala 2.10 support with Spark 2.0.0. Time to bite the bullet. If
companies stick with Spark 1.x and find themselves missing the new features
in the 2.x line, that will be a good motivation for them to upgrade.

~Daniel Siegmann


Re: spark graphx storage RDD memory leak

2016-04-11 Thread Robin East
this looks like https://issues.apache.org/jira/browse/SPARK-12655 
 fixed in 2.0
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 






> On 11 Apr 2016, at 07:23, zhang juntao  wrote:
> 
> thanks ted for replying ,
> these three lines can’t release param graph cache, it only release g ( 
> graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache() )
> ConnectedComponents.scala param graph will cache in ccGraph and won’t be 
> release in Pregel
>   def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): Graph[VertexId, 
> ED] = {
> val ccGraph = graph.mapVertices { case (vid, _) => vid }
> def sendMessage(edge: EdgeTriplet[VertexId, ED]): Iterator[(VertexId, 
> VertexId)] = {
>   if (edge.srcAttr < edge.dstAttr) {
> Iterator((edge.dstId, edge.srcAttr))
>   } else if (edge.srcAttr > edge.dstAttr) {
> Iterator((edge.srcId, edge.dstAttr))
>   } else {
> Iterator.empty
>   }
> }
> val initialMessage = Long.MaxValue
> Pregel(ccGraph, initialMessage, activeDirection = EdgeDirection.Either)(
>   vprog = (id, attr, msg) => math.min(attr, msg),
>   sendMsg = sendMessage,
>   mergeMsg = (a, b) => math.min(a, b))
>   } // end of connectedComponents
> }
> thanks
> juntao
> 
> 
>> Begin forwarded message:
>> 
>> From: Ted Yu >
>> Subject: Re: spark graphx storage RDD memory leak
>> Date: April 11, 2016 at 1:15:23 AM GMT+8
>> To: zhang juntao > >
>> Cc: "dev@spark.apache.org " 
>> >
>> 
>> I see the following code toward the end of the method:
>> 
>>   // Unpersist the RDDs hidden by newly-materialized RDDs
>>   oldMessages.unpersist(blocking = false)
>>   prevG.unpersistVertices(blocking = false)
>>   prevG.edges.unpersist(blocking = false)
>> 
>> Wouldn't the above achieve same effect ?
>> 
>> On Sun, Apr 10, 2016 at 9:08 AM, zhang juntao > > wrote:
>> hi experts,
>> 
>> I’m reporting a problem about spark graphx, I use zeppelin submit spark 
>> jobs, 
>> note that scala environment shares the same SparkContext, SQLContext 
>> instance,
>> and I call  Connected components algorithm to do some Business,  
>> found that every time when the job finished, some graph storage RDDs weren’t 
>> bean released, 
>> after several times there would be a lot of  storage RDDs existing even 
>> through all the jobs have finished . 
>> 
>> 
>> 
>> So I check the code of connectedComponents  and find that may be a problem 
>> in Pregel.scala .
>> when param graph has been cached, there isn’t any way to unpersist,  
>> so I add red font code to solve the problem
>> def apply[VD: ClassTag, ED: ClassTag, A: ClassTag]
>>(graph: Graph[VD, ED],
>> initialMsg: A,
>> maxIterations: Int = Int.MaxValue,
>> activeDirection: EdgeDirection = EdgeDirection.Either)
>>(vprog: (VertexId, VD, A) => VD,
>> sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
>> mergeMsg: (A, A) => A)
>>   : Graph[VD, ED] =
>> {
>>   ..
>>   var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, 
>> initialMsg)).cache()
>>   graph.unpersistVertices(blocking = false)
>>   graph.edges.unpersist(blocking = false)
>>   ..
>> 
>> } // end of apply
>> 
>> I'm not sure if this is a bug, 
>> and thank you for your time,
>> juntao
>> 
>> 
>> 
> 



Re: [Streaming] textFileStream has no events shown in web UI

2016-04-11 Thread Yogesh Mahajan
Yes, this has observed in my case also. The Input Rate is 0 even in case of
rawSocketStream.
Is there a way we can enable the Input Rate for these types of streams ?

Thanks,
http://www.snappydata.io/blog 

On Wed, Mar 16, 2016 at 4:21 PM, Hao Ren  wrote:

> Just a quick question,
>
> When using textFileStream, I did not see any events via web UI.
> Actually, I am uploading files to s3 every 5 seconds,
> And the mini-batch duration is 30 seconds.
> On web ui,:
>
>  *Input Rate*
> Avg: 0.00 events/sec
>
> But the schedule time and processing time are correct, and the output of
> the steam is also correct. Not sure why web ui has not detected any events.
>
> Thank you.
>
> --
> Hao Ren
>
> Data Engineer @ leboncoin
>
> Paris, France
>


Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-11 Thread Ted Yu
Gentle ping: spark-1.6.1-bin-hadoop2.4.tgz from S3 is still corrupt.

On Wed, Apr 6, 2016 at 12:55 PM, Josh Rosen 
wrote:

> Sure, I'll take a look. Planning to do full verification in a bit.
>
> On Wed, Apr 6, 2016 at 12:54 PM Ted Yu  wrote:
>
>> Josh:
>> Can you check spark-1.6.1-bin-hadoop2.4.tgz ?
>>
>> $ tar zxf spark-1.6.1-bin-hadoop2.4.tgz
>>
>> gzip: stdin: not in gzip format
>> tar: Child returned status 1
>> tar: Error is not recoverable: exiting now
>>
>> $ ls -l !$
>> ls -l spark-1.6.1-bin-hadoop2.4.tgz
>> -rw-r--r--. 1 hbase hadoop 323614720 Apr  5 19:25
>> spark-1.6.1-bin-hadoop2.4.tgz
>>
>> Thanks
>>
>> On Wed, Apr 6, 2016 at 12:19 PM, Josh Rosen 
>> wrote:
>>
>>> I downloaded the Spark 1.6.1 artifacts from the Apache mirror network
>>> and re-uploaded them to the spark-related-packages S3 bucket, so hopefully
>>> these packages should be fixed now.
>>>
>>> On Mon, Apr 4, 2016 at 3:37 PM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 Thanks, that was the command. :thumbsup:

 On Mon, Apr 4, 2016 at 6:28 PM Jakob Odersky  wrote:

> I just found out how the hash is calculated:
>
> gpg --print-md sha512 .tgz
>
> you can use that to check if the resulting output matches the contents
> of .tgz.sha
>
> On Mon, Apr 4, 2016 at 3:19 PM, Jakob Odersky 
> wrote:
> > The published hash is a SHA512.
> >
> > You can verify the integrity of the packages by running `sha512sum`
> on
> > the archive and comparing the computed hash with the published one.
> > Unfortunately however, I don't know what tool is used to generate the
> > hash and I can't reproduce the format, so I ended up manually
> > comparing the hashes.
> >
> > On Mon, Apr 4, 2016 at 2:39 PM, Nicholas Chammas
> >  wrote:
> >> An additional note: The Spark packages being served off of
> CloudFront (i.e.
> >> the “direct download” option on spark.apache.org) are also corrupt.
> >>
> >> Btw what’s the correct way to verify the SHA of a Spark package?
> I’ve tried
> >> a few commands on working packages downloaded from Apache mirrors,
> but I
> >> can’t seem to reproduce the published SHA for
> spark-1.6.1-bin-hadoop2.6.tgz.
> >>
> >>
> >> On Mon, Apr 4, 2016 at 11:45 AM Ted Yu  wrote:
> >>>
> >>> Maybe temporarily take out the artifacts on S3 before the root
> cause is
> >>> found.
> >>>
> >>> On Thu, Mar 24, 2016 at 7:25 AM, Nicholas Chammas
> >>>  wrote:
> 
>  Just checking in on this again as the builds on S3 are still
> broken. :/
> 
>  Could it have something to do with us moving release-build.sh?
> 
> 
>  On Mon, Mar 21, 2016 at 1:43 PM Nicholas Chammas
>   wrote:
> >
> > Is someone going to retry fixing these packages? It's still a
> problem.
> >
> > Also, it would be good to understand why this is happening.
> >
> > On Fri, Mar 18, 2016 at 6:49 PM Jakob Odersky 
> wrote:
> >>
> >> I just realized you're using a different download site. Sorry
> for the
> >> confusion, the link I get for a direct download of Spark 1.6.1 /
> >> Hadoop 2.6 is
> >>
> http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
> >>
> >> On Fri, Mar 18, 2016 at 3:20 PM, Nicholas Chammas
> >>  wrote:
> >> > I just retried the Spark 1.6.1 / Hadoop 2.6 download and got a
> >> > corrupt ZIP
> >> > file.
> >> >
> >> > Jakob, are you sure the ZIP unpacks correctly for you? Is it
> the same
> >> > Spark
> >> > 1.6.1/Hadoop 2.6 package you had a success with?
> >> >
> >> > On Fri, Mar 18, 2016 at 6:11 PM Jakob Odersky <
> ja...@odersky.com>
> >> > wrote:
> >> >>
> >> >> I just experienced the issue, however retrying the download
> a second
> >> >> time worked. Could it be that there is some load
> balancer/cache in
> >> >> front of the archive and some nodes still serve the corrupt
> >> >> packages?
> >> >>
> >> >> On Fri, Mar 18, 2016 at 8:00 AM, Nicholas Chammas
> >> >>  wrote:
> >> >> > I'm seeing the same. :(
> >> >> >
> >> >> > On Fri, Mar 18, 2016 at 10:57 AM Ted Yu <
> yuzhih...@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> I tried again this morning :
> >> >> >>
> >> >> >> $ wget
> >> >> >>
> >> >> >>
> >> >> >>
> 

Different maxBins value for categorical and continuous features in RandomForest implementation.

2016-04-11 Thread Rahul Tanwani
Hi,

Currently the RandomForest algo takes a single maxBins value to decide the
number of splits to take. This sometimes causes training time to go very
high when there is a single categorical column having sufficiently large
number of unique values. This single column impacts all the numeric
(continuous) columns even though such a high number of splits are not
required.

Encoding the  categorical column into features make the data very wide and
this requires us to increase the maxMemoryInMB and puts more pressure on the
GC as well.

Keeping the separate maxBins values for categorial and continuous features
should be useful in this regard.




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Different-maxBins-value-for-categorical-and-continuous-features-in-RandomForest-implementation-tp17099.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Fwd: spark graphx storage RDD memory leak

2016-04-11 Thread zhang juntao
thanks ted for replying ,
these three lines can’t release param graph cache, it only release g ( 
graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache() )
ConnectedComponents.scala param graph will cache in ccGraph and won’t be 
release in Pregel
  def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): Graph[VertexId, 
ED] = {
val ccGraph = graph.mapVertices { case (vid, _) => vid }
def sendMessage(edge: EdgeTriplet[VertexId, ED]): Iterator[(VertexId, 
VertexId)] = {
  if (edge.srcAttr < edge.dstAttr) {
Iterator((edge.dstId, edge.srcAttr))
  } else if (edge.srcAttr > edge.dstAttr) {
Iterator((edge.srcId, edge.dstAttr))
  } else {
Iterator.empty
  }
}
val initialMessage = Long.MaxValue
Pregel(ccGraph, initialMessage, activeDirection = EdgeDirection.Either)(
  vprog = (id, attr, msg) => math.min(attr, msg),
  sendMsg = sendMessage,
  mergeMsg = (a, b) => math.min(a, b))
  } // end of connectedComponents
}
thanks
juntao


> Begin forwarded message:
> 
> From: Ted Yu 
> Subject: Re: spark graphx storage RDD memory leak
> Date: April 11, 2016 at 1:15:23 AM GMT+8
> To: zhang juntao 
> Cc: "dev@spark.apache.org" 
> 
> I see the following code toward the end of the method:
> 
>   // Unpersist the RDDs hidden by newly-materialized RDDs
>   oldMessages.unpersist(blocking = false)
>   prevG.unpersistVertices(blocking = false)
>   prevG.edges.unpersist(blocking = false)
> 
> Wouldn't the above achieve same effect ?
> 
> On Sun, Apr 10, 2016 at 9:08 AM, zhang juntao  > wrote:
> hi experts,
> 
> I’m reporting a problem about spark graphx, I use zeppelin submit spark jobs, 
> note that scala environment shares the same SparkContext, SQLContext instance,
> and I call  Connected components algorithm to do some Business,  
> found that every time when the job finished, some graph storage RDDs weren’t 
> bean released, 
> after several times there would be a lot of  storage RDDs existing even 
> through all the jobs have finished . 
> 
> 
> 
> So I check the code of connectedComponents  and find that may be a problem in 
> Pregel.scala .
> when param graph has been cached, there isn’t any way to unpersist,  
> so I add red font code to solve the problem
> def apply[VD: ClassTag, ED: ClassTag, A: ClassTag]
>(graph: Graph[VD, ED],
> initialMsg: A,
> maxIterations: Int = Int.MaxValue,
> activeDirection: EdgeDirection = EdgeDirection.Either)
>(vprog: (VertexId, VD, A) => VD,
> sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
> mergeMsg: (A, A) => A)
>   : Graph[VD, ED] =
> {
>   ..
>   var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, 
> initialMsg)).cache()
>   graph.unpersistVertices(blocking = false)
>   graph.edges.unpersist(blocking = false)
>   ..
> 
> } // end of apply
> 
> I'm not sure if this is a bug, 
> and thank you for your time,
> juntao
> 
> 
>