Re: Broadcast Variable Life Cycle

2016-08-30 Thread Jerry Lam
Hi Sean,

Thank you for sharing the knowledge between unpersist and destroy.
Does that mean unpersist keeps the broadcast variable in the driver whereas
destroy will delete everything about the broadcast variable like it has
never existed?

Best Regards,

Jerry


On Tue, Aug 30, 2016 at 11:58 AM, Sean Owen <so...@cloudera.com> wrote:

> Yes, although there's a difference between unpersist and destroy,
> you'll hit the same type of question either way. You do indeed have to
> reason about when you know the broadcast variable is no longer needed
> in the face of lazy evaluation, and that's hard.
>
> Sometimes it's obvious and you can take advantage of this to
> proactively free resources. You may have to consider restructuring the
> computation to allow for more resources to be freed, if this is
> important to scale.
>
> Keep in mind that things that are computed and cached may be lost and
> recomputed even after their parent RDDs were definitely already
> computed and don't seem to be needed. This is why unpersist is often
> the better thing to call because it allows for variables to be
> rebroadcast if needed in this case. Destroy permanently closes the
> broadcast.
>
> On Tue, Aug 30, 2016 at 4:43 PM, Jerry Lam <chiling...@gmail.com> wrote:
> > Hi Sean,
> >
> > Thank you for the response. The only problem is that actively managing
> > broadcast variables require to return the broadcast variables to the
> caller
> > if the function that creates the broadcast variables does not contain any
> > action. That is the scope that uses the broadcast variables cannot
> destroy
> > the broadcast variables in many cases. For example:
> >
> > ==
> > def perfromTransformation(rdd: RDD[int]) = {
> >val sharedMap = sc.broadcast(map)
> >rdd.map{id =>
> >   val localMap = sharedMap.vlaue
> >   (id, localMap(id))
> >}
> > }
> >
> > def main = {
> > 
> > performTransformation(rdd).toDF("id",
> > "i").write.parquet("dummy_example")
> > }
> > ==
> >
> > In this example above, we cannot destroy the sharedMap before the
> > write.parquet is executed because RDD is evaluated lazily. We will get a
> > exception if I put sharedMap.destroy like this:
> >
> > ==
> > def perfromTransformation(rdd: RDD[int]) = {
> >val sharedMap = sc.broadcast(map)
> >val result = rdd.map{id =>
> >   val localMap = sharedMap.vlaue
> >   (id, localMap(id))
> >}
> >sharedMap.destroy
> >result
> > }
> > ==
> >
> > Am I missing something? Are there better way to do this without returning
> > the broadcast variables to the main function?
> >
> > Best Regards,
> >
> > Jerry
> >
> >
> >
> > On Mon, Aug 29, 2016 at 12:11 PM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> Yes you want to actively unpersist() or destroy() broadcast variables
> >> when they're no longer needed. They can eventually be removed when the
> >> reference on the driver is garbage collected, but you usually would
> >> not want to rely on that.
> >>
> >> On Mon, Aug 29, 2016 at 4:30 PM, Jerry Lam <chiling...@gmail.com>
> wrote:
> >> > Hello spark developers,
> >> >
> >> > Anyone can shed some lights on the life cycle of the broadcast
> >> > variables?
> >> > Basically, if I have a broadcast variable defined in a loop and for
> each
> >> > iteration, I provide a different value.
> >> > // For example:
> >> > for(i< 1 to 10) {
> >> > val bc = sc.broadcast(i)
> >> > sc.parallelize(Seq(1,2,3)).map{id => val i = bc.value; (id,
> >> > i)}.toDF("id", "i").write.parquet("/dummy_output")
> >> > }
> >> >
> >> > Do I need to active manage the broadcast variable in this case? I know
> >> > this
> >> > example is not real but please imagine this broadcast variable can
> hold
> >> > an
> >> > array of 1M Long.
> >> >
> >> > Regards,
> >> >
> >> > Jerry
> >> >
> >> >
> >> >
> >> > On Sun, Aug 21, 2016 at 1:07 PM, Jerry Lam <chiling...@gmail.com>
> wrote:
> >> >>
> >> >> Hello spark developers,
> >> >>
> >> >> Can someone explain to me what is the lifecycle of a broadcast
> >> >> variable?
> >> >> When a broadcast variable will be garbage-collected at the
> driver-side
> >> >> and
> >> >> at the executor-side? Does a spark application need to actively
> manage
> >> >> the
> >> >> broadcast variables to ensure that it will not run in OOM?
> >> >>
> >> >> Best Regards,
> >> >>
> >> >> Jerry
> >> >
> >> >
> >
> >
>


Re: Broadcast Variable Life Cycle

2016-08-30 Thread Jerry Lam
Hi Sean,

Thank you for the response. The only problem is that actively managing
broadcast variables require to return the broadcast variables to the caller
if the function that creates the broadcast variables does not contain any
action. That is the scope that uses the broadcast variables cannot destroy
the broadcast variables in many cases. For example:

==
def perfromTransformation(rdd: RDD[int]) = {
   val sharedMap = sc.broadcast(map)
   rdd.map{id =>
  val localMap = sharedMap.vlaue
  (id, localMap(id))
   }
}

def main = {

performTransformation(rdd).toDF("id",
"i").write.parquet("dummy_example")
}
==

In this example above, we cannot destroy the sharedMap before the
write.parquet is executed because RDD is evaluated lazily. We will get a
exception if I put sharedMap.destroy like this:

==
def perfromTransformation(rdd: RDD[int]) = {
   val sharedMap = sc.broadcast(map)
   val result = rdd.map{id =>
  val localMap = sharedMap.vlaue
  (id, localMap(id))
   }
   sharedMap.destroy
   result
}
==

Am I missing something? Are there better way to do this without returning
the broadcast variables to the main function?

Best Regards,

Jerry



On Mon, Aug 29, 2016 at 12:11 PM, Sean Owen <so...@cloudera.com> wrote:

> Yes you want to actively unpersist() or destroy() broadcast variables
> when they're no longer needed. They can eventually be removed when the
> reference on the driver is garbage collected, but you usually would
> not want to rely on that.
>
> On Mon, Aug 29, 2016 at 4:30 PM, Jerry Lam <chiling...@gmail.com> wrote:
> > Hello spark developers,
> >
> > Anyone can shed some lights on the life cycle of the broadcast variables?
> > Basically, if I have a broadcast variable defined in a loop and for each
> > iteration, I provide a different value.
> > // For example:
> > for(i< 1 to 10) {
> > val bc = sc.broadcast(i)
> > sc.parallelize(Seq(1,2,3)).map{id => val i = bc.value; (id,
> > i)}.toDF("id", "i").write.parquet("/dummy_output")
> > }
> >
> > Do I need to active manage the broadcast variable in this case? I know
> this
> > example is not real but please imagine this broadcast variable can hold
> an
> > array of 1M Long.
> >
> > Regards,
> >
> > Jerry
> >
> >
> >
> > On Sun, Aug 21, 2016 at 1:07 PM, Jerry Lam <chiling...@gmail.com> wrote:
> >>
> >> Hello spark developers,
> >>
> >> Can someone explain to me what is the lifecycle of a broadcast variable?
> >> When a broadcast variable will be garbage-collected at the driver-side
> and
> >> at the executor-side? Does a spark application need to actively manage
> the
> >> broadcast variables to ensure that it will not run in OOM?
> >>
> >> Best Regards,
> >>
> >> Jerry
> >
> >
>


Re: Broadcast Variable Life Cycle

2016-08-29 Thread Jerry Lam
Hello spark developers,

Anyone can shed some lights on the life cycle of the broadcast variables?
Basically, if I have a broadcast variable defined in a loop and for each
iteration, I provide a different value.
// For example:
for(i< 1 to 10) {
val bc = sc.broadcast(i)
sc.parallelize(Seq(1,2,3)).map{id => val i = bc.value; (id,
i)}.toDF("id", "i").write.parquet("/dummy_output")
}

Do I need to active manage the broadcast variable in this case? I know this
example is not real but please imagine this broadcast variable can hold an
array of 1M Long.

Regards,

Jerry



On Sun, Aug 21, 2016 at 1:07 PM, Jerry Lam <chiling...@gmail.com> wrote:

> Hello spark developers,
>
> Can someone explain to me what is the lifecycle of a broadcast variable?
> When a broadcast variable will be garbage-collected at the driver-side and
> at the executor-side? Does a spark application need to actively manage the
> broadcast variables to ensure that it will not run in OOM?
>
> Best Regards,
>
> Jerry
>


Broadcast Variable Life Cycle

2016-08-21 Thread Jerry Lam
Hello spark developers,

Can someone explain to me what is the lifecycle of a broadcast variable?
When a broadcast variable will be garbage-collected at the driver-side and
at the executor-side? Does a spark application need to actively manage the
broadcast variables to ensure that it will not run in OOM?

Best Regards,

Jerry


Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

2016-03-02 Thread Jerry Lam
Hi guys,

FYI... this wiki page (StreamSQL: https://en.wikipedia.org/wiki/StreamSQL)
has some histories related Event Stream Processing and SQL.

Hi Steve,

It is difficult to ask your customers that they should learn a new language
when they are not programmers :)
I don't know where/why they learn SQL-like languages. Do business schools
teach SQL??

Best Regards,

Jerry


On Wed, Mar 2, 2016 at 10:03 AM, Steve Loughran <ste...@hortonworks.com>
wrote:

>
> > On 1 Mar 2016, at 22:25, Jerry Lam <chiling...@gmail.com> wrote:
> >
> > Hi Reynold,
> >
> > You are right. It is about the audience. For instance, in many of my
> cases, the SQL style is very attractive if not mandatory for people with
> minimum programming knowledge.
>
> but SQL skills instead. Which is just relational set theory with a syntax,
> Structured English Query Language from the IBM R project of the mid 1970s
> (\cite{Gray et al, An evaluation of System R})
>
> If you look at why SQL snuck back in as a layer atop the "Post-SQL
> systems", it's
>
> (a) tooling
> (b) declarative queries can be optimised by query planners
> (c) a lot of people who do queries on existing systems can migrate to the
> new platforms. This is why FB wrote Hive; their PHP GUI teams didn't want
> to learn Java.
>
>
> > SQL has its place for communication. Last time I show someone spark
> dataframe-style, they immediately said it is too difficult to use. When I
> change it to SQL, they are suddenly happy and say how you do this. It
> sounds stupid but that's what it is for now.
> >
>
> try showing the python syntax. Python is an easier language to learn, and
> its list comprehensions are suspiciously close to applied set theory.
>
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

2016-03-01 Thread Jerry Lam
Hi Reynold,

You are right. It is about the audience. For instance, in many of my cases,
the SQL style is very attractive if not mandatory for people with minimum
programming knowledge. SQL has its place for communication. Last time I
show someone spark dataframe-style, they immediately said it is too
difficult to use. When I change it to SQL, they are suddenly happy and say
how you do this. It sounds stupid but that's what it is for now.

The following example will make some banks happy (copy from the Oracle
solution):

SELECT *
FROM Ticker MATCH_RECOGNIZE (
 PARTITION BY symbol
 ORDER BY tstamp
 MEASURES  STRT.tstamp AS start_tstamp,
   LAST(DOWN.tstamp) AS bottom_tstamp,
   LAST(UP.tstamp) AS end_tstamp
 ONE ROW PER MATCH
 AFTER MATCH SKIP TO LAST UP
 PATTERN (STRT DOWN+ UP+)
 DEFINE
DOWN AS DOWN.price < PREV(DOWN.price),
UP AS UP.price > PREV(UP.price)
 ) MR
ORDER BY MR.symbol, MR.start_tstamp;

Basically this query finds all cases where stock prices dipped to a bottom
price and then rose (the popular V-shape).  It might be confusing at first
but it is still readable for many users who know SQL. Note that the PATTERN
is interesting; it is a regular expression on the symbols defined (DOWN and
UP, STRT is not define so it means it matches any event).

Most CEP solutions have a SQL-like interface.

Best Regards,

Jerry



On Tue, Mar 1, 2016 at 4:44 PM, Reynold Xin <r...@databricks.com> wrote:

> There are definitely pros and cons for Scala vs SQL-style CEP. Scala might
> be more powerful, but the target audience is very different.
>
> How much usage is there for a CEP style SQL syntax in practice? I've never
> seen it coming up so far.
>
>
>
> On Tue, Mar 1, 2016 at 9:35 AM, Alex Kozlov <ale...@gmail.com> wrote:
>
>> Looked at the paper: while we can argue on the performance side, I think
>> semantically the Scala pattern matching is much more expressive.  The time
>> will decide.
>>
>> On Tue, Mar 1, 2016 at 9:07 AM, Jerry Lam <chiling...@gmail.com> wrote:
>>
>>> Hi Alex,
>>>
>>> We went through this path already :) This is the reason we try other
>>> approaches. The recursion makes it very inefficient for some cases.
>>> For details, this paper describes it very well:
>>> https://people.cs.umass.edu/%7Eyanlei/publications/sase-sigmod08.pdf
>>> which is the same paper references in Flink ticket.
>>>
>>> Please let me know if I overlook something. Thank you for sharing this!
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>> On Tue, Mar 1, 2016 at 11:58 AM, Alex Kozlov <ale...@gmail.com> wrote:
>>>
>>>> For the purpose of full disclosure, I think Scala offers a much more
>>>> efficient pattern matching paradigm.  Using nPath is like using assembler
>>>> to program distributed systems.  Cannot tell much here today, but the
>>>> pattern would look like:
>>>>
>>>>  | def matchSessions(h: Seq[Session[PageView]], id: String, p:
>>>> Seq[PageView]) :
>>>>
>>>> Seq[Session[PageView]] = {|   p match {
>>>>
>>>>  | case Nil => Nil
>>>>
>>>>  | case PageView(ts1, "company.com>homepage") ::
>>>> PageView(ts2,
>>>>
>>>> "company.com>plus>products landing") :: tail if ts2 > ts1 + 600 =>
>>>>
>>>>  |   matchSessions(h, id, tail).+:(new Session(id, p))
>>>>
>>>>  | case _ => matchSessions(h, id, p.tail)
>>>>
>>>>  |   }
>>>>
>>>> Look for Scala case statements with guards and upcoming book releases.
>>>>
>>>> http://docs.scala-lang.org/tutorials/tour/pattern-matching
>>>>
>>>> https://www.safaribooksonline.com/library/view/scala-cookbook/9781449340292/ch03s14.html
>>>>
>>>> On Tue, Mar 1, 2016 at 8:34 AM, Henri Dubois-Ferriere <
>>>> henr...@gmail.com> wrote:
>>>>
>>>>> fwiw Apache Flink just added CEP. Queries are constructed
>>>>> programmatically rather than in SQL, but the underlying functionality is
>>>>> similar.
>>>>>
>>>>> https://issues.apache.org/jira/browse/FLINK-3215
>>>>>
>>>>> On 1 March 2016 at 08:19, Jerry Lam <chiling...@gmail.com> wrote:
>>>>>
>>>>>> Hi Herman,
>>>>>>
>>>>>> Thank you for your reply!
>>>>>>

Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

2016-03-01 Thread Jerry Lam
Hi Henri,

Finally, there is a good reason for me to use Flink! Thanks for sharing
this information. This is exactly the solution I'm looking for especially
the ticket references a paper I was reading a week ago. It would be nice if
Flink adds support SQL because this makes business analyst (traders as
well) a way to express it.

Best Regards,

Jerry

On Tue, Mar 1, 2016 at 11:34 AM, Henri Dubois-Ferriere <henr...@gmail.com>
wrote:

> fwiw Apache Flink just added CEP. Queries are constructed programmatically
> rather than in SQL, but the underlying functionality is similar.
>
> https://issues.apache.org/jira/browse/FLINK-3215
>
> On 1 March 2016 at 08:19, Jerry Lam <chiling...@gmail.com> wrote:
>
>> Hi Herman,
>>
>> Thank you for your reply!
>> This functionality usually finds its place in financial services which
>> use CEP (complex event processing) for correlation and pattern matching.
>> Many commercial products have this including Oracle and Teradata Aster Data
>> MR Analytics. I do agree the syntax a bit awkward but after you understand
>> it, it is actually very compact for expressing something that is very
>> complex. Esper has this feature partially implemented (
>> http://www.espertech.com/esper/release-5.1.0/esper-reference/html/match-recognize.html
>> ).
>>
>> I found the Teradata Analytics documentation best to describe the usage
>> of it. For example (note npath is similar to match_recognize):
>>
>> SELECT last_pageid, MAX( count_page80 )
>>  FROM nPath(
>>  ON ( SELECT * FROM clicks WHERE category >= 0 )
>>  PARTITION BY sessionid
>>  ORDER BY ts
>>  PATTERN ( 'A.(B|C)*' )
>>  MODE ( OVERLAPPING )
>>  SYMBOLS ( pageid = 50 AS A,
>>pageid = 80 AS B,
>>pageid <> 80 AND category IN (9,10) AS C )
>>  RESULT ( LAST ( pageid OF ANY ( A,B,C ) ) AS last_pageid,
>>   COUNT ( * OF B ) AS count_page80,
>>   COUNT ( * OF ANY ( A,B,C ) ) AS count_any )
>>  )
>>  WHERE count_any >= 5
>>  GROUP BY last_pageid
>>  ORDER BY MAX( count_page80 )
>>
>> The above means:
>> Find user click-paths starting at pageid 50 and passing exclusively
>> through either pageid 80 or pages in category 9 or category 10. Find the
>> pageid of the last page in the path and count the number of times page 80
>> was visited. Report the maximum count for each last page, and sort the
>> output by the latter. Restrict to paths containing at least 5 pages. Ignore
>> pages in the sequence with category < 0.
>>
>> If this query is written in pure SQL (if possible at all), it requires
>> several self-joins. The interesting thing about this feature is that it
>> integrates SQL+Streaming+ML in one (perhaps potentially graph too).
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>> On Tue, Mar 1, 2016 at 9:39 AM, Herman van Hövell tot Westerflier <
>> hvanhov...@questtec.nl> wrote:
>>
>>> Hi Jerry,
>>>
>>> This is not on any roadmap. I (shortly) browsed through this; and this
>>> looks like some sort of a window function with very awkward syntax. I think
>>> spark provided better constructs for this using dataframes/datasets/nested
>>> data...
>>>
>>> Feel free to submit a PR.
>>>
>>> Kind regards,
>>>
>>> Herman van Hövell
>>>
>>> 2016-03-01 15:16 GMT+01:00 Jerry Lam <chiling...@gmail.com>:
>>>
>>>> Hi Spark developers,
>>>>
>>>> Will you consider to add support for implementing "Pattern matching in
>>>> sequences of rows"? More specifically, I'm referring to this:
>>>> http://web.cs.ucla.edu/classes/fall15/cs240A/notes/temporal/row-pattern-recogniton-11.pdf
>>>>
>>>> This is a very cool/useful feature to pattern matching over live
>>>> stream/archived data. It is sorted of related to machine learning because
>>>> this is usually used in clickstream analysis or path analysis. Also it is
>>>> related to streaming because of the nature of the processing (time series
>>>> data mostly). It is SQL because there is a good way to express and optimize
>>>> the query.
>>>>
>>>> Best Regards,
>>>>
>>>> Jerry
>>>>
>>>
>>>
>>
>


Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

2016-03-01 Thread Jerry Lam
Hi Herman,

Thank you for your reply!
This functionality usually finds its place in financial services which use
CEP (complex event processing) for correlation and pattern matching. Many
commercial products have this including Oracle and Teradata Aster Data MR
Analytics. I do agree the syntax a bit awkward but after you understand it,
it is actually very compact for expressing something that is very complex.
Esper has this feature partially implemented (
http://www.espertech.com/esper/release-5.1.0/esper-reference/html/match-recognize.html
).

I found the Teradata Analytics documentation best to describe the usage of
it. For example (note npath is similar to match_recognize):

SELECT last_pageid, MAX( count_page80 )
 FROM nPath(
 ON ( SELECT * FROM clicks WHERE category >= 0 )
 PARTITION BY sessionid
 ORDER BY ts
 PATTERN ( 'A.(B|C)*' )
 MODE ( OVERLAPPING )
 SYMBOLS ( pageid = 50 AS A,
   pageid = 80 AS B,
   pageid <> 80 AND category IN (9,10) AS C )
 RESULT ( LAST ( pageid OF ANY ( A,B,C ) ) AS last_pageid,
  COUNT ( * OF B ) AS count_page80,
  COUNT ( * OF ANY ( A,B,C ) ) AS count_any )
 )
 WHERE count_any >= 5
 GROUP BY last_pageid
 ORDER BY MAX( count_page80 )

The above means:
Find user click-paths starting at pageid 50 and passing exclusively through
either pageid 80 or pages in category 9 or category 10. Find the pageid of
the last page in the path and count the number of times page 80 was
visited. Report the maximum count for each last page, and sort the output
by the latter. Restrict to paths containing at least 5 pages. Ignore pages
in the sequence with category < 0.

If this query is written in pure SQL (if possible at all), it requires
several self-joins. The interesting thing about this feature is that it
integrates SQL+Streaming+ML in one (perhaps potentially graph too).

Best Regards,

Jerry


On Tue, Mar 1, 2016 at 9:39 AM, Herman van Hövell tot Westerflier <
hvanhov...@questtec.nl> wrote:

> Hi Jerry,
>
> This is not on any roadmap. I (shortly) browsed through this; and this
> looks like some sort of a window function with very awkward syntax. I think
> spark provided better constructs for this using dataframes/datasets/nested
> data...
>
> Feel free to submit a PR.
>
> Kind regards,
>
> Herman van Hövell
>
> 2016-03-01 15:16 GMT+01:00 Jerry Lam <chiling...@gmail.com>:
>
>> Hi Spark developers,
>>
>> Will you consider to add support for implementing "Pattern matching in
>> sequences of rows"? More specifically, I'm referring to this:
>> http://web.cs.ucla.edu/classes/fall15/cs240A/notes/temporal/row-pattern-recogniton-11.pdf
>>
>> This is a very cool/useful feature to pattern matching over live
>> stream/archived data. It is sorted of related to machine learning because
>> this is usually used in clickstream analysis or path analysis. Also it is
>> related to streaming because of the nature of the processing (time series
>> data mostly). It is SQL because there is a good way to express and optimize
>> the query.
>>
>> Best Regards,
>>
>> Jerry
>>
>
>


SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

2016-03-01 Thread Jerry Lam
Hi Spark developers,

Will you consider to add support for implementing "Pattern matching in
sequences of rows"? More specifically, I'm referring to this:
http://web.cs.ucla.edu/classes/fall15/cs240A/notes/temporal/row-pattern-recogniton-11.pdf

This is a very cool/useful feature to pattern matching over live
stream/archived data. It is sorted of related to machine learning because
this is usually used in clickstream analysis or path analysis. Also it is
related to streaming because of the nature of the processing (time series
data mostly). It is SQL because there is a good way to express and optimize
the query.

Best Regards,

Jerry


Re: [Spark SQL] SQLContext getOrCreate incorrect behaviour

2015-12-23 Thread Jerry Lam
Hi Kostas,

Thank you for the references of the 2 tickets. It helps me to understand
why I got some weird experiences lately.

Best Regards,

Jerry

On Wed, Dec 23, 2015 at 2:32 AM, kostas papageorgopoylos <p02...@gmail.com>
wrote:

> Hi
>
> Fyi
> The following 2 tickets are blocking currently (for releases up to 1.5.2)
> the pattern of Starting and Stopping a sparkContext inside the same driver
> program
>
> https://issues.apache.org/jira/browse/SPARK-11700 ->memory leak in
> SqlContext
> https://issues.apache.org/jira/browse/SPARK-11739
>
> In an application we have built we initially wanted to use the same
> pattern (start-stop-start.etc)
> in order to have a better usage of the spark cluster resources.
>
> I believe that the fixes in the above tickets will allow to safely stop and
> restart the sparkContext in the driver program in release 1.6.0
>
> Kind Regards
>
>
>
> 2015-12-22 21:00 GMT+02:00 Sean Owen <so...@cloudera.com>:
>
>> I think the original idea is that the life of the driver is the life
>> of the SparkContext: the context is stopped when the driver finishes.
>> Or: if for some reason the "context" dies or there's an unrecoverable
>> error, that's it for the driver.
>>
>> (There's nothing wrong with stop(), right? you have to call that when
>> the driver ends to shut down Spark cleanly. It's the re-starting
>> another context that's at issue.)
>>
>> This makes most sense in the context of a resource manager, which can
>> conceivably restart a driver if you like, but can't reach into your
>> program.
>>
>> That's probably still the best way to think of it. Still it would be
>> nice if SparkContext were friendlier to a restart just as a matter of
>> design. AFAIK it is; not sure about SQLContext though. If it's not a
>> priority it's just because this isn't a usual usage pattern, which
>> doesn't mean it's crazy, just not the primary pattern.
>>
>> On Tue, Dec 22, 2015 at 5:57 PM, Jerry Lam <chiling...@gmail.com> wrote:
>> > Hi Sean,
>> >
>> > What if the spark context stops for involuntary reasons (misbehavior of
>> some connections) then we need to programmatically handle the failures by
>> recreating spark context. Is there something I don't understand/know about
>> the assumptions on how to use spark context? I tend to think of it as a
>> resource manager/scheduler for spark jobs. Are you guys planning to
>> deprecate the stop method from spark?
>> >
>> > Best Regards,
>> >
>> > Jerry
>> >
>> > Sent from my iPhone
>> >
>> >> On 22 Dec, 2015, at 3:57 am, Sean Owen <so...@cloudera.com> wrote:
>> >>
>> >> Although in many cases it does work to stop and then start a second
>> >> context, it wasn't how Spark was originally designed, and I still see
>> >> gotchas. I'd avoid it. I don't think you should have to release some
>> >> resources; just keep the same context alive.
>> >>
>> >>> On Tue, Dec 22, 2015 at 5:13 AM, Jerry Lam <chiling...@gmail.com>
>> wrote:
>> >>> Hi Zhan,
>> >>>
>> >>> I'm illustrating the issue via a simple example. However it is not
>> difficult
>> >>> to imagine use cases that need this behaviour. For example, you want
>> to
>> >>> release all resources of spark when it does not use for longer than
>> an hour
>> >>> in  a job server like web services. Unless you can prevent people from
>> >>> stopping spark context, then it is reasonable to assume that people
>> can stop
>> >>> it and start it again in  later time.
>> >>>
>> >>> Best Regards,
>> >>>
>> >>> Jerry
>> >>>
>> >>>
>> >>>> On Mon, Dec 21, 2015 at 7:20 PM, Zhan Zhang <zzh...@hortonworks.com>
>> wrote:
>> >>>>
>> >>>> This looks to me is a very unusual use case. You stop the
>> SparkContext,
>> >>>> and start another one. I don’t think it is well supported. As the
>> >>>> SparkContext is stopped, all the resources are supposed to be
>> released.
>> >>>>
>> >>>> Is there any mandatory reason you have stop and restart another
>> >>>> SparkContext.
>> >>>>
>> >>>> Thanks.
>> >>>>
>> >>>> Zhan Zhang
>> >>>>
>> >>>> Note that when sc is stopped, all resources are

Re: [Spark SQL] SQLContext getOrCreate incorrect behaviour

2015-12-22 Thread Jerry Lam
Hi Sean,

What if the spark context stops for involuntary reasons (misbehavior of some 
connections) then we need to programmatically handle the failures by recreating 
spark context. Is there something I don't understand/know about the assumptions 
on how to use spark context? I tend to think of it as a resource 
manager/scheduler for spark jobs. Are you guys planning to deprecate the stop 
method from spark? 

Best Regards,

Jerry 

Sent from my iPhone

> On 22 Dec, 2015, at 3:57 am, Sean Owen <so...@cloudera.com> wrote:
> 
> Although in many cases it does work to stop and then start a second
> context, it wasn't how Spark was originally designed, and I still see
> gotchas. I'd avoid it. I don't think you should have to release some
> resources; just keep the same context alive.
> 
>> On Tue, Dec 22, 2015 at 5:13 AM, Jerry Lam <chiling...@gmail.com> wrote:
>> Hi Zhan,
>> 
>> I'm illustrating the issue via a simple example. However it is not difficult
>> to imagine use cases that need this behaviour. For example, you want to
>> release all resources of spark when it does not use for longer than an hour
>> in  a job server like web services. Unless you can prevent people from
>> stopping spark context, then it is reasonable to assume that people can stop
>> it and start it again in  later time.
>> 
>> Best Regards,
>> 
>> Jerry
>> 
>> 
>>> On Mon, Dec 21, 2015 at 7:20 PM, Zhan Zhang <zzh...@hortonworks.com> wrote:
>>> 
>>> This looks to me is a very unusual use case. You stop the SparkContext,
>>> and start another one. I don’t think it is well supported. As the
>>> SparkContext is stopped, all the resources are supposed to be released.
>>> 
>>> Is there any mandatory reason you have stop and restart another
>>> SparkContext.
>>> 
>>> Thanks.
>>> 
>>> Zhan Zhang
>>> 
>>> Note that when sc is stopped, all resources are released (for example in
>>> yarn
>>>> On Dec 20, 2015, at 2:59 PM, Jerry Lam <chiling...@gmail.com> wrote:
>>>> 
>>>> Hi Spark developers,
>>>> 
>>>> I found that SQLContext.getOrCreate(sc: SparkContext) does not behave
>>>> correctly when a different spark context is provided.
>>>> 
>>>> ```
>>>> val sc = new SparkContext
>>>> val sqlContext =SQLContext.getOrCreate(sc)
>>>> sc.stop
>>>> ...
>>>> 
>>>> val sc2 = new SparkContext
>>>> val sqlContext2 = SQLContext.getOrCreate(sc2)
>>>> sc2.stop
>>>> ```
>>>> 
>>>> The sqlContext2 will reference sc instead of sc2 and therefore, the
>>>> program will not work because sc has been stopped.
>>>> 
>>>> Best Regards,
>>>> 
>>>> Jerry
>> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [Spark SQL] SQLContext getOrCreate incorrect behaviour

2015-12-21 Thread Jerry Lam
Hi Zhan,

I'm illustrating the issue via a simple example. However it is not
difficult to imagine use cases that need this behaviour. For example, you
want to release all resources of spark when it does not use for longer than
an hour in  a job server like web services. Unless you can prevent people
from stopping spark context, then it is reasonable to assume that people
can stop it and start it again in  later time.

Best Regards,

Jerry


On Mon, Dec 21, 2015 at 7:20 PM, Zhan Zhang <zzh...@hortonworks.com> wrote:

> This looks to me is a very unusual use case. You stop the SparkContext,
> and start another one. I don’t think it is well supported. As the
> SparkContext is stopped, all the resources are supposed to be released.
>
> Is there any mandatory reason you have stop and restart another
> SparkContext.
>
> Thanks.
>
> Zhan Zhang
>
> Note that when sc is stopped, all resources are released (for example in
> yarn
> On Dec 20, 2015, at 2:59 PM, Jerry Lam <chiling...@gmail.com> wrote:
>
> > Hi Spark developers,
> >
> > I found that SQLContext.getOrCreate(sc: SparkContext) does not behave
> correctly when a different spark context is provided.
> >
> > ```
> > val sc = new SparkContext
> > val sqlContext =SQLContext.getOrCreate(sc)
> > sc.stop
> > ...
> >
> > val sc2 = new SparkContext
> > val sqlContext2 = SQLContext.getOrCreate(sc2)
> > sc2.stop
> > ```
> >
> > The sqlContext2 will reference sc instead of sc2 and therefore, the
> program will not work because sc has been stopped.
> >
> > Best Regards,
> >
> > Jerry
>
>


[Spark SQL] SQLContext getOrCreate incorrect behaviour

2015-12-20 Thread Jerry Lam
Hi Spark developers,

I found that SQLContext.getOrCreate(sc: SparkContext) does not behave
correctly when a different spark context is provided.

```
val sc = new SparkContext
val sqlContext =SQLContext.getOrCreate(sc)
sc.stop
...

val sc2 = new SparkContext
val sqlContext2 = SQLContext.getOrCreate(sc2)
sc2.stop
```

The sqlContext2 will reference sc instead of sc2 and therefore, the program
will not work because sc has been stopped.

Best Regards,

Jerry


Re: Removing the Mesos fine-grained mode

2015-11-23 Thread Jerry Lam
Hi Andrew,

Thank you for confirming this. I’m referring to this because I used 
fine-grained mode before and it was a headache because of the memory issue. 
Therefore, I switched to Yarn with dynamic allocation. I was thinking if I can 
switch back to Mesos with coarse-grained mode + dynamic allocation but from 
what you explain to me, I still cannot have more than 1 executor per slave. 
This sounds like a deal breaker for me because if I have a slave of 100GB of 
RAM and a slave of 30GB, I cannot utilize the instance of 100GB of RAM fully if 
I specify spark.executor.memory = 20GB. The two slaves will each consume 20GB 
in this case even though there is 80GB left for the bigger machine. If I 
specify 90GB for spark.executor.memory, the only active slave is the one with 
100GB. Therefore the slave with 30GB will be idled. 

Do you know the link to the JIRA that I can receive update for the feature you 
mention? We have intentions to use Mesos but it is proven difficult with our 
tight budget constraint. 

Best Regards,

Jerry


> On Nov 23, 2015, at 2:41 PM, Andrew Or <and...@databricks.com> wrote:
> 
> @Jerry Lam
> 
> Can someone confirm if it is true that dynamic allocation on mesos "is 
> designed to run one executor per slave with the configured amount of 
> resources." I copied this sentence from the documentation. Does this mean 
> there is at most 1 executor per node? Therefore,  if you have a big machine, 
> you need to allocate a fat executor on this machine in order to fully utilize 
> it?
> 
> Mesos inherently does not support multiple executors per slave currently. 
> This is actually not related to dynamic allocation. There is, however, an 
> outstanding patch to add support for multiple executors per slave. When that 
> feature is merged, it will work well with dynamic allocation.
>  
> 
> 2015-11-23 9:27 GMT-08:00 Adam McElwee <a...@mcelwee.me 
> <mailto:a...@mcelwee.me>>:
> 
> 
> On Mon, Nov 23, 2015 at 7:36 AM, Iulian Dragoș <iulian.dra...@typesafe.com 
> <mailto:iulian.dra...@typesafe.com>> wrote:
> 
> 
> On Sat, Nov 21, 2015 at 3:37 AM, Adam McElwee <a...@mcelwee.me 
> <mailto:a...@mcelwee.me>> wrote:
> I've used fine-grained mode on our mesos spark clusters until this week, 
> mostly because it was the default. I started trying coarse-grained because of 
> the recent chatter on the mailing list about wanting to move the mesos 
> execution path to coarse-grained only. The odd things is, coarse-grained vs 
> fine-grained seems to yield drastic cluster utilization metrics for any of 
> our jobs that I've tried out this week.
> 
> If this is best as a new thread, please let me know, and I'll try not to 
> derail this conversation. Otherwise, details below:
> 
> I think it's ok to discuss it here.
>  
> We monitor our spark clusters with ganglia, and historically, we maintain at 
> least 90% cpu utilization across the cluster. Making a single configuration 
> change to use coarse-grained execution instead of fine-grained consistently 
> yields a cpu utilization pattern that starts around 90% at the beginning of 
> the job, and then it slowly decreases over the next 1-1.5 hours to level out 
> around 65% cpu utilization on the cluster. Does anyone have a clue why I'd be 
> seeing such a negative effect of switching to coarse-grained mode? GC 
> activity is comparable in both cases. I've tried 1.5.2, as well as the 1.6.0 
> preview tag that's on github.
> 
> I'm not very familiar with Ganglia, and how it computes utilization. But one 
> thing comes to mind: did you enable dynamic allocation 
> <https://spark.apache.org/docs/latest/running-on-mesos.html#dynamic-resource-allocation-with-mesos>
>  on coarse-grained mode?
> 
> Dynamic allocation is definitely not enabled. The only delta between runs is 
> adding --conf "spark.mesos.coarse=true" the job submission. Ganglia is just 
> pulling stats from the procfs, and I've never seen it report bad results. If 
> I sample any of the 100-200 nodes in the cluster, dstat reflects the same 
> average cpu that I'm seeing reflected in ganglia.
> 
> iulian
> 
> 



Re: Removing the Mesos fine-grained mode

2015-11-23 Thread Jerry Lam
@Andrew Or

I assume you are referring to this ticket [SPARK-5095]: 
https://issues.apache.org/jira/browse/SPARK-5095 
<https://issues.apache.org/jira/browse/SPARK-5095> 
Thank you!

Best Regards,

Jerry

> On Nov 23, 2015, at 2:41 PM, Andrew Or <and...@databricks.com> wrote:
> 
> @Jerry Lam
> 
> Can someone confirm if it is true that dynamic allocation on mesos "is 
> designed to run one executor per slave with the configured amount of 
> resources." I copied this sentence from the documentation. Does this mean 
> there is at most 1 executor per node? Therefore,  if you have a big machine, 
> you need to allocate a fat executor on this machine in order to fully utilize 
> it?
> 
> Mesos inherently does not support multiple executors per slave currently. 
> This is actually not related to dynamic allocation. There is, however, an 
> outstanding patch to add support for multiple executors per slave. When that 
> feature is merged, it will work well with dynamic allocation.
>  
> 
> 2015-11-23 9:27 GMT-08:00 Adam McElwee <a...@mcelwee.me 
> <mailto:a...@mcelwee.me>>:
> 
> 
> On Mon, Nov 23, 2015 at 7:36 AM, Iulian Dragoș <iulian.dra...@typesafe.com 
> <mailto:iulian.dra...@typesafe.com>> wrote:
> 
> 
> On Sat, Nov 21, 2015 at 3:37 AM, Adam McElwee <a...@mcelwee.me 
> <mailto:a...@mcelwee.me>> wrote:
> I've used fine-grained mode on our mesos spark clusters until this week, 
> mostly because it was the default. I started trying coarse-grained because of 
> the recent chatter on the mailing list about wanting to move the mesos 
> execution path to coarse-grained only. The odd things is, coarse-grained vs 
> fine-grained seems to yield drastic cluster utilization metrics for any of 
> our jobs that I've tried out this week.
> 
> If this is best as a new thread, please let me know, and I'll try not to 
> derail this conversation. Otherwise, details below:
> 
> I think it's ok to discuss it here.
>  
> We monitor our spark clusters with ganglia, and historically, we maintain at 
> least 90% cpu utilization across the cluster. Making a single configuration 
> change to use coarse-grained execution instead of fine-grained consistently 
> yields a cpu utilization pattern that starts around 90% at the beginning of 
> the job, and then it slowly decreases over the next 1-1.5 hours to level out 
> around 65% cpu utilization on the cluster. Does anyone have a clue why I'd be 
> seeing such a negative effect of switching to coarse-grained mode? GC 
> activity is comparable in both cases. I've tried 1.5.2, as well as the 1.6.0 
> preview tag that's on github.
> 
> I'm not very familiar with Ganglia, and how it computes utilization. But one 
> thing comes to mind: did you enable dynamic allocation 
> <https://spark.apache.org/docs/latest/running-on-mesos.html#dynamic-resource-allocation-with-mesos>
>  on coarse-grained mode?
> 
> Dynamic allocation is definitely not enabled. The only delta between runs is 
> adding --conf "spark.mesos.coarse=true" the job submission. Ganglia is just 
> pulling stats from the procfs, and I've never seen it report bad results. If 
> I sample any of the 100-200 nodes in the cluster, dstat reflects the same 
> average cpu that I'm seeing reflected in ganglia.
> 
> iulian
> 
> 



Re: Removing the Mesos fine-grained mode

2015-11-23 Thread Jerry Lam
Hi guys,

Can someone confirm if it is true that dynamic allocation on mesos "is designed 
to run one executor per slave with the configured amount of resources." I 
copied this sentence from the documentation. Does this mean there is at most 1 
executor per node? Therefore,  if you have a big machine, you need to allocate 
a fat executor on this machine in order to fully utilize it?

Best Regards,

Sent from my iPhone

> On 23 Nov, 2015, at 8:36 am, Iulian Dragoș  wrote:
> 
> 
> 
>> On Sat, Nov 21, 2015 at 3:37 AM, Adam McElwee  wrote:
>> I've used fine-grained mode on our mesos spark clusters until this week, 
>> mostly because it was the default. I started trying coarse-grained because 
>> of the recent chatter on the mailing list about wanting to move the mesos 
>> execution path to coarse-grained only. The odd things is, coarse-grained vs 
>> fine-grained seems to yield drastic cluster utilization metrics for any of 
>> our jobs that I've tried out this week.
>> 
>> If this is best as a new thread, please let me know, and I'll try not to 
>> derail this conversation. Otherwise, details below:
> 
> I think it's ok to discuss it here.
>  
>> We monitor our spark clusters with ganglia, and historically, we maintain at 
>> least 90% cpu utilization across the cluster. Making a single configuration 
>> change to use coarse-grained execution instead of fine-grained consistently 
>> yields a cpu utilization pattern that starts around 90% at the beginning of 
>> the job, and then it slowly decreases over the next 1-1.5 hours to level out 
>> around 65% cpu utilization on the cluster. Does anyone have a clue why I'd 
>> be seeing such a negative effect of switching to coarse-grained mode? GC 
>> activity is comparable in both cases. I've tried 1.5.2, as well as the 1.6.0 
>> preview tag that's on github.
> 
> I'm not very familiar with Ganglia, and how it computes utilization. But one 
> thing comes to mind: did you enable dynamic allocation on coarse-grained mode?
> 
> iulian


Re: Unchecked contribution (JIRA and PR)

2015-11-03 Thread Jerry Lam
Sergio, you are not alone for sure. Check the RowSimilarity implementation
[SPARK-4823]. It has been there for 6 months. It is very likely those which
don't merge in the version of spark that it was developed will never merged
because spark changes quite significantly from version to version if the
algorithm depends a lot of internal api.

On Tue, Nov 3, 2015 at 10:24 AM, Reynold Xin  wrote:

> Sergio,
>
> Usually it takes a lot of effort to get something merged into Spark
> itself, especially for relatively new algorithms that might not have
> established itself yet. I will leave it to mllib maintainers to comment on
> the specifics of the individual algorithms proposed here.
>
> Just another general comment: we have been working on making packages be
> as easy to use as possible for Spark users. Right now it only requires a
> simple flag to pass to the spark-submit script to include a package.
>
>
> On Tue, Nov 3, 2015 at 2:49 AM, Sergio Ramírez  wrote:
>
>> Hello all:
>>
>> I developed two packages for MLlib in March. These have been also upload
>> to the spark-packages repository. Associated to these packages, I created
>> two JIRA's threads and the correspondent pull requests, which are listed
>> below:
>>
>> https://github.com/apache/spark/pull/5184
>> https://github.com/apache/spark/pull/5170
>>
>> https://issues.apache.org/jira/browse/SPARK-6531
>> https://issues.apache.org/jira/browse/SPARK-6509
>>
>> These remain unassigned in JIRA and unverified in GitHub.
>>
>> Could anyone explain why are they in this state yet? Is it normal?
>>
>> Thanks!
>>
>> Sergio R.
>>
>> --
>>
>> Sergio Ramírez Gallego
>> Research group on Soft Computing and Intelligent Information Systems,
>> Dept. Computer Science and Artificial Intelligence,
>> University of Granada, Granada, Spain.
>> Email: srami...@decsai.ugr.es
>> Research Group URL: http://sci2s.ugr.es/
>>
>> -
>>
>> Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
>> contiene información de carácter confidencial exclusivamente dirigida a
>> su destinatario o destinatarios. Si no es vd. el destinatario indicado,
>> queda notificado que la lectura, utilización, divulgación y/o copia sin
>> autorización está prohibida en virtud de la legislación vigente. En el
>> caso de haber recibido este correo electrónico por error, se ruega
>> notificar inmediatamente esta circunstancia mediante reenvío a la
>> dirección electrónica del remitente.
>> Evite imprimir este mensaje si no es estrictamente necesario.
>>
>> This email and any file attached to it (when applicable) contain(s)
>> confidential information that is exclusively addressed to its
>> recipient(s). If you are not the indicated recipient, you are informed
>> that reading, using, disseminating and/or copying it without
>> authorisation is forbidden in accordance with the legislation in effect.
>> If you have received this email by mistake, please immediately notify
>> the sender of the situation by resending it to their email address.
>> Avoid printing this message if it is not absolutely necessary.
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread Jerry Lam
We "used" Spark on Mesos to build interactive data analysis platform
because the interactive session could be long and might not use Spark for
the entire session. It is very wasteful of resources if we used the
coarse-grained mode because it keeps resource for the entire session.
Therefore, fine-grained mode was used.

Knowing that Spark now supports dynamic resource allocation with coarse
grained mode, we were thinking about using it. However, we decided to
switch to Yarn because in addition to dynamic allocation, it has better
supports on security.

On Tue, Nov 3, 2015 at 7:22 PM, Soren Macbeth  wrote:

> we use fine-grained mode. coarse-grained mode keeps JVMs around which
> often leads to OOMs, which in turn kill the entire executor, causing entire
> stages to be retried. In fine-grained mode, only the task fails and
> subsequently gets retried without taking out an entire stage or worse.
>
> On Tue, Nov 3, 2015 at 3:54 PM, Reynold Xin  wrote:
>
>> If you are using Spark with Mesos fine grained mode, can you please
>> respond to this email explaining why you use it over the coarse grained
>> mode?
>>
>> Thanks.
>>
>>
>


Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-21 Thread Jerry Lam
Hi guys,

There is another memory issue. Not sure if this is related to Tungsten this
time because I have it disable (spark.sql.tungsten.enabled=false). It
happens more there are too many tasks running (300). I need to limit the
number of task to avoid this. The executor has 6G. Spark 1.5.1 is been used.

Best Regards,

Jerry

org.apache.spark.SparkException: Task failed while writing rows.
at 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:393)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Unable to acquire 67108864 bytes of memory
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:74)
at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:56)
at 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:339)


On Tue, Oct 20, 2015 at 9:10 PM, Reynold Xin <r...@databricks.com> wrote:

> With Jerry's permission, sending this back to the dev list to close the
> loop.
>
>
> -- Forwarded message ----------
> From: Jerry Lam <chiling...@gmail.com>
> Date: Tue, Oct 20, 2015 at 3:54 PM
> Subject: Re: If you use Spark 1.5 and disabled Tungsten mode ...
> To: Reynold Xin <r...@databricks.com>
>
>
> Yup, coarse grained mode works just fine. :)
> The difference is that by default, coarse grained mode uses 1 core per
> task. If I constraint 20 cores in total, there can be only 20 tasks running
> at the same time. However, with fine grained, I cannot set the total number
> of cores and therefore, it could be +200 tasks running at the same time (It
> is dynamic). So it might be the calculation of how much memory to acquire
> fail when the number of cores cannot be known ahead of time because you
> cannot make the assumption that X tasks running in an executor? Just my
> guess...
>
>
> On Tue, Oct 20, 2015 at 6:24 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> Can you try coarse-grained mode and see if it is the same?
>>
>>
>> On Tue, Oct 20, 2015 at 3:20 PM, Jerry Lam <chiling...@gmail.com> wrote:
>>
>>> Hi Reynold,
>>>
>>> Yes, I'm using 1.5.1. I see them quite often. Sometimes it recovers but
>>> sometimes it does not. For one particular job, it failed all the time with
>>> the acquire-memory issue. I'm using spark on mesos with fine grained mode.
>>> Does it make a difference?
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>> On Tue, Oct 20, 2015 at 5:27 PM, Reynold Xin <r...@databricks.com>
>>> wrote:
>>>
>>>> Jerry - I think that's been fixed in 1.5.1. Do you still see it?
>>>>
>>>> On Tue, Oct 20, 2015 at 2:11 PM, Jerry Lam <chiling...@gmail.com>
>>>> wrote:
>>>>
>>>>> I disabled it because of the "Could not acquire 65536 bytes of
>>>>> memory". It happens to fail the job. So for now, I'm not touching it.
>>>>>
>>>>> On Tue, Oct 20, 2015 at 4:48 PM, charmee <charm...@gmail.com> wrote:
>>>>>
>>>>>> We had disabled tungsten after we found few performance issues, but
>>>>>> had to
>>>>>> enable it back because we found that when we had large number of
>>>>>> group by
>>>>>> fields, if tungsten is disabled the shuffle keeps failing.
>>>>>>
>>>>>> Here is an excerpt from one of our engineers with his analysis.
>>>>>>
>>>>>> With Tungsten

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-21 Thread Jerry Lam
Yes. The crazy thing about mesos running in fine grained mode is that there
is no way (correct me if I'm wrong) to set the number of cores per
executor. If one of my slaves on mesos has 32 cores, the fine grained mode
can allocate 32 cores on this executor for the job and if there are 32
tasks running on this executor at the same time, that is when the acquire
memory issue appears. Of course the 32 cores are dynamically allocated. So
mesos can take them back or put them in again depending on the cluster
utilization.

On Wed, Oct 21, 2015 at 5:13 PM, Reynold Xin <r...@databricks.com> wrote:

> Is this still Mesos fine grained mode?
>
>
> On Wed, Oct 21, 2015 at 1:16 PM, Jerry Lam <chiling...@gmail.com> wrote:
>
>> Hi guys,
>>
>> There is another memory issue. Not sure if this is related to Tungsten
>> this time because I have it disable (spark.sql.tungsten.enabled=false). It
>> happens more there are too many tasks running (300). I need to limit the
>> number of task to avoid this. The executor has 6G. Spark 1.5.1 is been used.
>>
>> Best Regards,
>>
>> Jerry
>>
>> org.apache.spark.SparkException: Task failed while writing rows.
>>  at 
>> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:393)
>>  at 
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>>  at 
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>  at org.apache.spark.scheduler.Task.run(Task.scala:88)
>>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>  at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.io.IOException: Unable to acquire 67108864 bytes of memory
>>  at 
>> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351)
>>  at 
>> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
>>  at 
>> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
>>  at 
>> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:74)
>>  at 
>> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:56)
>>  at 
>> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:339)
>>
>>
>> On Tue, Oct 20, 2015 at 9:10 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> With Jerry's permission, sending this back to the dev list to close the
>>> loop.
>>>
>>>
>>> -- Forwarded message --
>>> From: Jerry Lam <chiling...@gmail.com>
>>> Date: Tue, Oct 20, 2015 at 3:54 PM
>>> Subject: Re: If you use Spark 1.5 and disabled Tungsten mode ...
>>> To: Reynold Xin <r...@databricks.com>
>>>
>>>
>>> Yup, coarse grained mode works just fine. :)
>>> The difference is that by default, coarse grained mode uses 1 core per
>>> task. If I constraint 20 cores in total, there can be only 20 tasks running
>>> at the same time. However, with fine grained, I cannot set the total number
>>> of cores and therefore, it could be +200 tasks running at the same time (It
>>> is dynamic). So it might be the calculation of how much memory to acquire
>>> fail when the number of cores cannot be known ahead of time because you
>>> cannot make the assumption that X tasks running in an executor? Just my
>>> guess...
>>>
>>>
>>> On Tue, Oct 20, 2015 at 6:24 PM, Reynold Xin <r...@databricks.com>
>>> wrote:
>>>
>>>> Can you try coarse-grained mode and see if it is the same?
>>>>
>>>>
>>>> On Tue, Oct 20, 2015 at 3:20 PM, Jerry Lam <chiling...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Reynold,
>>>>>
>>>>> Yes, I'm using 1.5.1. I see them quite often. Sometimes it recovers
>>>>> but sometimes it does not. For one particular job, it failed all the time
>>>>>

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-20 Thread Jerry Lam
I disabled it because of the "Could not acquire 65536 bytes of memory". It
happens to fail the job. So for now, I'm not touching it.

On Tue, Oct 20, 2015 at 4:48 PM, charmee  wrote:

> We had disabled tungsten after we found few performance issues, but had to
> enable it back because we found that when we had large number of group by
> fields, if tungsten is disabled the shuffle keeps failing.
>
> Here is an excerpt from one of our engineers with his analysis.
>
> With Tungsten Enabled (default in spark 1.5):
> ~90 files of 0.5G each:
>
> Ingest (after applying broadcast lookups) : 54 min
> Aggregation (~30 fields in group by and another 40 in aggregation) : 18 min
>
> With Tungsten Disabled:
>
> Ingest : 30 min
> Aggregation : Erroring out
>
> On smaller tests we found that joins are slow with tungsten enabled. With
> GROUP BY, disabling tungsten is not working in the first place.
>
> Hope this helps.
>
> -Charmee
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/If-you-use-Spark-1-5-and-disabled-Tungsten-mode-tp14604p14711.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-20 Thread Jerry Lam
Hi Reynold,

Yes, I'm using 1.5.1. I see them quite often. Sometimes it recovers but
sometimes it does not. For one particular job, it failed all the time with
the acquire-memory issue. I'm using spark on mesos with fine grained mode.
Does it make a difference?

Best Regards,

Jerry

On Tue, Oct 20, 2015 at 5:27 PM, Reynold Xin <r...@databricks.com> wrote:

> Jerry - I think that's been fixed in 1.5.1. Do you still see it?
>
> On Tue, Oct 20, 2015 at 2:11 PM, Jerry Lam <chiling...@gmail.com> wrote:
>
>> I disabled it because of the "Could not acquire 65536 bytes of memory".
>> It happens to fail the job. So for now, I'm not touching it.
>>
>> On Tue, Oct 20, 2015 at 4:48 PM, charmee <charm...@gmail.com> wrote:
>>
>>> We had disabled tungsten after we found few performance issues, but had
>>> to
>>> enable it back because we found that when we had large number of group by
>>> fields, if tungsten is disabled the shuffle keeps failing.
>>>
>>> Here is an excerpt from one of our engineers with his analysis.
>>>
>>> With Tungsten Enabled (default in spark 1.5):
>>> ~90 files of 0.5G each:
>>>
>>> Ingest (after applying broadcast lookups) : 54 min
>>> Aggregation (~30 fields in group by and another 40 in aggregation) : 18
>>> min
>>>
>>> With Tungsten Disabled:
>>>
>>> Ingest : 30 min
>>> Aggregation : Erroring out
>>>
>>> On smaller tests we found that joins are slow with tungsten enabled. With
>>> GROUP BY, disabling tungsten is not working in the first place.
>>>
>>> Hope this helps.
>>>
>>> -Charmee
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/If-you-use-Spark-1-5-and-disabled-Tungsten-mode-tp14604p14711.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>


Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-28 Thread Jerry Lam
Hi Spark Developers,

The Spark 1.5.1 documentation is already publicly accessible (
https://spark.apache.org/docs/latest/index.html) but the release is not. Is
it intentional?

Best Regards,

Jerry

On Mon, Sep 28, 2015 at 9:21 AM, james  wrote:

> +1
>
> 1) Build binary instruction: ./make-distribution.sh --tgz --skip-java-test
> -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver
> -DskipTests
> 2) Run Spark SQL with YARN client mode
>
> This 1.5.1 RC1 package have better test results than previous 1.5.0 except
> for Spark-10484,Spark-4266 open issue.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-1-RC1-tp14310p14388.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
Hi Spark Developers,

I just ran some very simple operations on a dataset. I was surprise by the
execution plan of take(1), head() or first().

For your reference, this is what I did in pyspark 1.5:
df=sqlContext.read.parquet("someparquetfiles")
df.head()

The above lines take over 15 minutes. I was frustrated because I can do
better without using spark :) Since I like spark, so I tried to figure out
why. It seems the dataframe requires 3 stages to give me the first row. It
reads all data (which is about 1 billion rows) and run Limit twice.

Instead of head(), show(1) runs much faster. Not to mention that if I do:

df.rdd.take(1) //runs much faster.

Is this expected? Why head/first/take is so slow for dataframe? Is it a bug
in the optimizer? or I did something wrong?

Best Regards,

Jerry


Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
I just noticed you found 1.4 has the same issue. I added that as well in
the ticket.

On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam <chiling...@gmail.com> wrote:

> Hi Yin,
>
> You are right! I just tried the scala version with the above lines, it
> works as expected.
> I'm not sure if it happens also in 1.4 for pyspark but I thought the
> pyspark code just calls the scala code via py4j. I didn't expect that this
> bug is pyspark specific. That surprises me actually a bit. I created a
> ticket for this (SPARK-10731
> <https://issues.apache.org/jira/browse/SPARK-10731>).
>
> Best Regards,
>
> Jerry
>
>
> On Mon, Sep 21, 2015 at 1:01 PM, Yin Huai <yh...@databricks.com> wrote:
>
>> btw, does 1.4 has the same problem?
>>
>> On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai <yh...@databricks.com> wrote:
>>
>>> Hi Jerry,
>>>
>>> Looks like it is a Python-specific issue. Can you create a JIRA?
>>>
>>> Thanks,
>>>
>>> Yin
>>>
>>> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam <chiling...@gmail.com> wrote:
>>>
>>>> Hi Spark Developers,
>>>>
>>>> I just ran some very simple operations on a dataset. I was surprise by
>>>> the execution plan of take(1), head() or first().
>>>>
>>>> For your reference, this is what I did in pyspark 1.5:
>>>> df=sqlContext.read.parquet("someparquetfiles")
>>>> df.head()
>>>>
>>>> The above lines take over 15 minutes. I was frustrated because I can do
>>>> better without using spark :) Since I like spark, so I tried to figure out
>>>> why. It seems the dataframe requires 3 stages to give me the first row. It
>>>> reads all data (which is about 1 billion rows) and run Limit twice.
>>>>
>>>> Instead of head(), show(1) runs much faster. Not to mention that if I
>>>> do:
>>>>
>>>> df.rdd.take(1) //runs much faster.
>>>>
>>>> Is this expected? Why head/first/take is so slow for dataframe? Is it a
>>>> bug in the optimizer? or I did something wrong?
>>>>
>>>> Best Regards,
>>>>
>>>> Jerry
>>>>
>>>
>>>
>>
>


Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
Hi Yin,

You are right! I just tried the scala version with the above lines, it
works as expected.
I'm not sure if it happens also in 1.4 for pyspark but I thought the
pyspark code just calls the scala code via py4j. I didn't expect that this
bug is pyspark specific. That surprises me actually a bit. I created a
ticket for this (SPARK-10731
<https://issues.apache.org/jira/browse/SPARK-10731>).

Best Regards,

Jerry


On Mon, Sep 21, 2015 at 1:01 PM, Yin Huai <yh...@databricks.com> wrote:

> btw, does 1.4 has the same problem?
>
> On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai <yh...@databricks.com> wrote:
>
>> Hi Jerry,
>>
>> Looks like it is a Python-specific issue. Can you create a JIRA?
>>
>> Thanks,
>>
>> Yin
>>
>> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam <chiling...@gmail.com> wrote:
>>
>>> Hi Spark Developers,
>>>
>>> I just ran some very simple operations on a dataset. I was surprise by
>>> the execution plan of take(1), head() or first().
>>>
>>> For your reference, this is what I did in pyspark 1.5:
>>> df=sqlContext.read.parquet("someparquetfiles")
>>> df.head()
>>>
>>> The above lines take over 15 minutes. I was frustrated because I can do
>>> better without using spark :) Since I like spark, so I tried to figure out
>>> why. It seems the dataframe requires 3 stages to give me the first row. It
>>> reads all data (which is about 1 billion rows) and run Limit twice.
>>>
>>> Instead of head(), show(1) runs much faster. Not to mention that if I do:
>>>
>>> df.rdd.take(1) //runs much faster.
>>>
>>> Is this expected? Why head/first/take is so slow for dataframe? Is it a
>>> bug in the optimizer? or I did something wrong?
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>
>>
>


Re: [ANNOUNCE] Announcing Spark 1.5.0

2015-09-09 Thread Jerry Lam
Hi Spark Developers,

I'm eager to try it out! However, I got problems in resolving dependencies:
[warn] [NOT FOUND  ]
org.apache.spark#spark-core_2.10;1.5.0!spark-core_2.10.jar (0ms)
[warn]  jcenter: tried

When the package will be available?

Best Regards,

Jerry


On Wed, Sep 9, 2015 at 9:30 AM, Dimitris Kouzis - Loukas 
wrote:

> Yeii!
>
> On Wed, Sep 9, 2015 at 2:25 PM, Yu Ishikawa 
> wrote:
>
>> Great work, everyone!
>>
>>
>>
>> -
>> -- Yu Ishikawa
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Announcing-Spark-1-5-0-tp14013p14015.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-17 Thread Jerry Lam
Hi Nick,

I forgot to mention in the survey that ganglia is never installed properly
for some reasons.

I have this exception every time I launched the cluster:

Starting httpd: httpd: Syntax error on line 154 of
/etc/httpd/conf/httpd.conf: Cannot load
/etc/httpd/modules/mod_authz_core.so into server:
/etc/httpd/modules/mod_authz_core.so: cannot open shared object file: No
such file or directory

[FAILED]

Best Regards,

Jerry

On Mon, Aug 17, 2015 at 11:09 AM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Howdy folks!

 I’m interested in hearing about what people think of spark-ec2
 http://spark.apache.org/docs/latest/ec2-scripts.html outside of the
 formal JIRA process. Your answers will all be anonymous and public.

 If the embedded form below doesn’t work for you, you can use this link to
 get the same survey:

 http://goo.gl/forms/erct2s6KRR

 Cheers!
 Nick
 ​



Re: Spark Mesos Dispatcher

2015-07-19 Thread Jerry Lam
Yes. 

Sent from my iPhone

 On 19 Jul, 2015, at 10:52 pm, Jahagirdar, Madhu 
 madhu.jahagir...@philips.com wrote:
 
 All,
 
 Can we run different version of Spark using the same Mesos Dispatcher. For 
 example we can run drivers with Spark 1.3 and Spark 1.4 at the same time ?
 
 Regards,
 Madhu Jahagirdar
 
 The information contained in this message may be confidential and legally 
 protected under applicable law. The message is intended solely for the 
 addressee(s). If you are not the intended recipient, you are hereby notified 
 that any use, forwarding, dissemination, or reproduction of this message is 
 strictly prohibited and may be unlawful. If you are not the intended 
 recipient, please contact the sender by return e-mail and destroy all copies 
 of the original message.


Re: Spark Mesos Dispatcher

2015-07-19 Thread Jerry Lam
I only used client mode both 1.3 and 1.4 versions on mesos.
I skimmed through
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala.
I would actually backport the Cluster Mode feature. Sorry, I don't have an
answer for this.


On Sun, Jul 19, 2015 at 11:16 PM, Jahagirdar, Madhu 
madhu.jahagir...@philips.com wrote:

  1.3 does not have MesosDisptacher or does not have support for Mesos
 cluster mode , is it still possible to create a Dispatcher using 1.4 and
 run 1.3 using that dispatcher ?
  --
 *From:* Jerry Lam [chiling...@gmail.com]
 *Sent:* Monday, July 20, 2015 8:27 AM
 *To:* Jahagirdar, Madhu
 *Cc:* user; dev@spark.apache.org
 *Subject:* Re: Spark Mesos Dispatcher

   Yes.

 Sent from my iPhone

 On 19 Jul, 2015, at 10:52 pm, Jahagirdar, Madhu 
 madhu.jahagir...@philips.com wrote:

   All,

  Can we run different version of Spark using the same Mesos Dispatcher.
 For example we can run drivers with Spark 1.3 and Spark 1.4 at the same
 time ?

  Regards,
 Madhu Jahagirdar

 --
 The information contained in this message may be confidential and legally
 protected under applicable law. The message is intended solely for the
 addressee(s). If you are not the intended recipient, you are hereby
 notified that any use, forwarding, dissemination, or reproduction of this
 message is strictly prohibited and may be unlawful. If you are not the
 intended recipient, please contact the sender by return e-mail and destroy
 all copies of the original message.




Re: [PySpark DataFrame] When a Row is not a Row

2015-07-11 Thread Jerry Lam
Hi guys,

I just hit the same problem. It is very confusing when Row is not the same
Row type at runtime. The worst thing is that when I use Spark in local mode,
the Row is the same Row type! so it passes the test cases but it fails when
I deploy the application. 

Can someone suggest a workaround?

Best Regards,

Jerry



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-DataFrame-When-a-Row-is-not-a-Row-tp12210p13153.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org