ClosureCleaner slowing down Spark SQL queries

2015-05-27 Thread Nitin Goyal
Hi All,

I am running a SQL query (spark version 1.2) on a table created from
unionAll of 3 schema RDDs which gets executed in roughly 400ms (200ms at
driver and roughly 200ms at executors).

If I run same query on a table created from unionAll of 27 schema RDDS, I
see that executors time is same(because of concurrency and nature of my
query) but driver time shoots to 600ms (and total query time being = 600 +
200 = 800ms).

I attached JProfiler and found that ClosureCleaner clean method is taking
time at driver(some issue related to URLClassLoader) and it linearly
increases with number of RDDs being union-ed on which query is getting
fired. This is causing my query to take a huge amount of time where I expect
the query to be executed within 400ms irrespective of number of RDDs (since
I have executors available to cater my need). PFB the links of screenshots
from Jprofiler :-

http://pasteboard.co/MnQtB4o.png

http://pasteboard.co/MnrzHwJ.png

Any help/suggestion to fix this will be highly appreciated since this needs
to be fixed for production

Thanks in Advance,
Nitin



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tp12466.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: ClosureCleaner slowing down Spark SQL queries

2015-05-27 Thread Ted Yu
Can you try your query using Spark 1.4.0 RC2 ?

There have been some fixes since 1.2.0
e.g.
SPARK-7233 ClosureCleaner#clean blocks concurrent job submitter threads

Cheers

On Wed, May 27, 2015 at 10:38 AM, Nitin Goyal  wrote:

> Hi All,
>
> I am running a SQL query (spark version 1.2) on a table created from
> unionAll of 3 schema RDDs which gets executed in roughly 400ms (200ms at
> driver and roughly 200ms at executors).
>
> If I run same query on a table created from unionAll of 27 schema RDDS, I
> see that executors time is same(because of concurrency and nature of my
> query) but driver time shoots to 600ms (and total query time being = 600 +
> 200 = 800ms).
>
> I attached JProfiler and found that ClosureCleaner clean method is taking
> time at driver(some issue related to URLClassLoader) and it linearly
> increases with number of RDDs being union-ed on which query is getting
> fired. This is causing my query to take a huge amount of time where I
> expect
> the query to be executed within 400ms irrespective of number of RDDs (since
> I have executors available to cater my need). PFB the links of screenshots
> from Jprofiler :-
>
> http://pasteboard.co/MnQtB4o.png
>
> http://pasteboard.co/MnrzHwJ.png
>
> Any help/suggestion to fix this will be highly appreciated since this needs
> to be fixed for production
>
> Thanks in Advance,
> Nitin
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tp12466.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: ClosureCleaner slowing down Spark SQL queries

2015-05-27 Thread Nitin Goyal
Hi Ted,

Thanks a lot for replying. First of all, moving to 1.4.0 RC2 is not easy for
us as migration cost is big since lot has changed in Spark SQL since 1.2.

Regarding SPARK-7233, I had already looked at it few hours back and it
solves the problem for concurrent queries but my problem is just for a
single query. I also looked at the fix's code diff and it wasn't related to
the problem which seems to exist in Closure Cleaner code.

Thanks
-Nitin



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tp12466p12468.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: Spark 1.4.0 pyspark and pylint breaking

2015-05-27 Thread Michael Nazario
I've done some investigation into what work needed to be done to keep the 
_types module named types. This isn't a relative / absolute path problem, but 
actually a problem with the way the tests were run.

I've filed a jira ticket on it here: 
https://issues.apache.org/jira/browse/SPARK-7899)

I also have a pull request for fixing this here: 
https://github.com/apache/spark/pull/6439

Michael

From: Davies Liu [dav...@databricks.com]
Sent: Tuesday, May 26, 2015 4:18 PM
To: Punyashloka Biswal
Cc: Justin Uang; dev@spark.apache.org
Subject: Re: Spark 1.4.0 pyspark and pylint breaking

I think relative imports can not help in this case.

When you run scripts in pyspark/sql, it doesn't know anything about
pyspark.sql, it
just see types.py as a separate module.

On Tue, May 26, 2015 at 12:44 PM, Punyashloka Biswal
 wrote:
> Davies: Can we use relative imports (import .types) in the unit tests in
> order to disambiguate between the global and local module?
>
> Punya
>
> On Tue, May 26, 2015 at 3:09 PM Justin Uang  wrote:
>>
>> Thanks for clarifying! I don't understand python package and modules names
>> that well, but I thought that the package namespacing would've helped, since
>> you are in pyspark.sql.types. I guess not?
>>
>> On Tue, May 26, 2015 at 3:03 PM Davies Liu  wrote:
>>>
>>> There is a module called 'types' in python 3:
>>>
>>> davies@localhost:~/work/spark$ python3
>>> Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21)
>>> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
>>> Type "help", "copyright", "credits" or "license" for more information.
>>> >>> import types
>>> >>> types
>>> >>
>>> '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/types.py'>
>>>
>>> Without renaming, our `types.py` will conflict with it when you run
>>> unittests in pyspark/sql/ .
>>>
>>> On Tue, May 26, 2015 at 11:57 AM, Justin Uang 
>>> wrote:
>>> > In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was
>>> > renamed to pyspark/sql/_types.py and then some magic in
>>> > pyspark/sql/__init__.py dynamically renamed the module back to types. I
>>> > imagine that this is some naming conflict with Python 3, but what was
>>> > the
>>> > error that showed up?
>>> >
>>> > The reason why I'm asking about this is because it's messing with
>>> > pylint,
>>> > since pylint cannot now statically find the module. I tried also
>>> > importing
>>> > the package so that __init__ would be run in a init-hook, but that
>>> > isn't
>>> > what the discovery mechanism is using. I imagine it's probably just
>>> > crawling
>>> > the directory structure.
>>> >
>>> > One way to work around this would be something akin to this
>>> >
>>> > (https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_questions_9602811_how-2Dto-2Dtell-2Dpylint-2Dto-2Dignore-2Dcertain-2Dimports&d=BQIBaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=yN4Yj1JskMkGMKoYoLUUIQViRLGShPc1wislP1YdU4g&m=8-Bnuaq-HaKXNQYsouzyQuyrj1GH9MbO6JQWXBMqa_Q&s=uireqIdh4TOSVaj4QM0tNIbPIWKQ_sFQE-M32_3Q-ek&e=
>>> >  ),
>>> > where I would have to create a fake module, but I would probably be
>>> > missing
>>> > a ton of pylint features on users of that module, and it's pretty
>>> > hacky.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[build system] jenkins downtime tomorrow morning ~730am PDT

2015-05-27 Thread shane knapp
i'm going to be performing system, jenkins, and plugin updates tomorrow
morning beginning at 730am PDT.

0700:  pause build queue
0800:  kill off any errant jobs (retrigger when everything comes back up)
0800-0900:  system and plugin updates
0900-1000:  final debugging, roll back versions of plugins if thing get
borked

i'll post updates as things progress, and will be hoping to have full
service restored by 10am

shane


Re: [VOTE] Release Apache Spark 1.4.0 (RC2)

2015-05-27 Thread jameszhouyi
-1 , SPARK-7119 blocker issue



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-4-0-RC2-tp12420p12472.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SparkR and RDDs

2015-05-27 Thread Shivaram Venkataraman
Sorry for the delay in getting back on this. So the RDD interface is
private in the 1.4 release but as Alek mentioned you can still use it by
prefixing `SparkR:::`.

Regarding design direction -- there are two JIRAs which cover major
features we plan to work on for 1.5. SPARK-6805 tracks porting high-level
machine learning operations like `glm` and `kmeans` to SparkR using the ML
Pipeline implementation in Scala as the backend.

We are also planning to develop a parallel API where users can run native R
functions in a distributed setting and SPARK-7264 tracks this effort. If
you have specific use cases feel free to chime in on the JIRA or on the dev
mailing list.

Thanks
Shivaram

On Tue, May 26, 2015 at 11:40 AM, Reynold Xin  wrote:

> You definitely don't want to implement kmeans in R, since it would be very
> slow. Just providing R wrappers for the MLlib implementation is the way to
> go. I believe one of the major items in SparkR next is the MLlib wrappers.
>
>
>
> On Tue, May 26, 2015 at 7:46 AM, Andrew Psaltis 
> wrote:
>
>> Hi Alek,
>> Thanks for the info. You are correct ,that using the three colons does
>> work. Admittedly I am a R novice, but since the three colons is used to
>> access hidden methods, it seems pretty dirty.
>>
>> Can someone shed light on the design direction being taken with SparkR?
>> Should I really be accessing hidden methods or will better approach
>> prevail? For instance, it feels like the k-means sample should really use
>> MLlib and not just be a port the k-means sample using hidden methods. Am I
>> looking at this incorrectly?
>>
>> Thanks,
>> Andrew
>>
>> On Tue, May 26, 2015 at 6:56 AM, Eskilson,Aleksander <
>> alek.eskil...@cerner.com> wrote:
>>
>>>  From the changes to the namespace file, that appears to be correct,
>>> all methods of the RDD API have been made private, which in R means that
>>> you may still access them by using the namespace prefix SparkR with three
>>> colons, e.g. SparkR:::func(foo, bar).
>>>
>>>  So a starting place for porting old SparkR scripts from before the
>>> merge could be to identify those methods in the script belonging to the RDD
>>> class and be sure they have the namespace identifier tacked on the front. I
>>> hope that helps.
>>>
>>>  Regards,
>>> Alek Eskilson
>>>
>>>   From: Andrew Psaltis 
>>> Date: Monday, May 25, 2015 at 6:25 PM
>>> To: "dev@spark.apache.org" 
>>> Subject: SparkR and RDDs
>>>
>>>   Hi,
>>> I understand from SPARK-6799[1] and the respective merge commit [2]
>>>  that the RDD class is private in Spark 1.4 . If I wanted to modify the old
>>> Kmeans and/or LR examples so that the computation happened in Spark what is
>>> the best direction to go? Sorry if I am missing something obvious, but
>>> based on the NAMESPACE file [3] in the SparkR codebase I am having trouble
>>> seeing the obvious direction to go.
>>>
>>>  Thanks in advance,
>>> Andrew
>>>
>>>  [1] https://issues.apache.org/jira/browse/SPARK-6799
>>> 
>>> [2]
>>> https://github.com/apache/spark/commit/4b91e18d9b7803dbfe1e1cf20b46163d8cb8716c
>>> 
>>> [3] https://github.com/apache/spark/blob/branch-1.4/R/pkg/NAMESPACE
>>> 
>>>
>>>CONFIDENTIALITY NOTICE This message and any included attachments are
>>> from Cerner Corporation and are intended only for the addressee. The
>>> information contained in this message is confidential and may constitute
>>> inside or non-public information under international, federal, or state
>>> securities laws. Unauthorized forwarding, printing, copying, distribution,
>>> or use of such information is strictly prohibited and may be unlawful. If
>>> you are not the addressee, please promptly delete this message and notify
>>> the sender of the delivery error by e-mail or you may call Cerner's
>>> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>>>
>>
>>
>


Re: Available Functions in SparkR

2015-05-27 Thread Shivaram Venkataraman
For the 1.4 release the DataFrame API will be publicly available and the
documentation at
http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc2-docs/sql-programming-guide.html
(Click on the R tab) provides a good summary of the available functions.

As I described in the other email to the dev list, we are still collecting
feedback on a parallel API for SparkR as we feel the RDD API is too
low-level. We would like to hear any use-cases you have as it will be
valuable in designing the API.

Thanks
Shivaram

On Fri, May 22, 2015 at 7:34 AM, Eskilson,Aleksander <
alek.eskil...@cerner.com> wrote:

>  I’ve built Spark 1.4.0 for Hadoop 2.6 in a CDH5.4 and am testing SparkR.
> I’ve loaded up SparkR using the executable in /bin. The library import
> library(SparkR) seems to no longer import some of the same functions as it
> did for SparkR before the merge, e.g. textFile, lapply, etc. but it does
> include sparkR.init, take, and other original functions. How is it planned
> to access the full set of functions in the repl with the coming version of
> SparkR?
>
>  Thanks,
> Alek Eskilson
>  CONFIDENTIALITY NOTICE This message and any included attachments are from
> Cerner Corporation and are intended only for the addressee. The information
> contained in this message is confidential and may constitute inside or
> non-public information under international, federal, or state securities
> laws. Unauthorized forwarding, printing, copying, distribution, or use of
> such information is strictly prohibited and may be unlawful. If you are not
> the addressee, please promptly delete this message and notify the sender of
> the delivery error by e-mail or you may call Cerner's corporate offices in
> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>


Re: SparkR and RDDs

2015-05-27 Thread Andrew Psaltis
Hi Shivaram,
Thanks for the details, it is greatly appreciated.

Thanks

On Wed, May 27, 2015 at 7:25 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> Sorry for the delay in getting back on this. So the RDD interface is
> private in the 1.4 release but as Alek mentioned you can still use it by
> prefixing `SparkR:::`.
>
> Regarding design direction -- there are two JIRAs which cover major
> features we plan to work on for 1.5. SPARK-6805 tracks porting high-level
> machine learning operations like `glm` and `kmeans` to SparkR using the ML
> Pipeline implementation in Scala as the backend.
>
> We are also planning to develop a parallel API where users can run native
> R functions in a distributed setting and SPARK-7264 tracks this effort. If
> you have specific use cases feel free to chime in on the JIRA or on the dev
> mailing list.
>
> Thanks
> Shivaram
>
> On Tue, May 26, 2015 at 11:40 AM, Reynold Xin  wrote:
>
>> You definitely don't want to implement kmeans in R, since it would be
>> very slow. Just providing R wrappers for the MLlib implementation is the
>> way to go. I believe one of the major items in SparkR next is the MLlib
>> wrappers.
>>
>>
>>
>> On Tue, May 26, 2015 at 7:46 AM, Andrew Psaltis > > wrote:
>>
>>> Hi Alek,
>>> Thanks for the info. You are correct ,that using the three colons does
>>> work. Admittedly I am a R novice, but since the three colons is used to
>>> access hidden methods, it seems pretty dirty.
>>>
>>> Can someone shed light on the design direction being taken with SparkR?
>>> Should I really be accessing hidden methods or will better approach
>>> prevail? For instance, it feels like the k-means sample should really use
>>> MLlib and not just be a port the k-means sample using hidden methods. Am I
>>> looking at this incorrectly?
>>>
>>> Thanks,
>>> Andrew
>>>
>>> On Tue, May 26, 2015 at 6:56 AM, Eskilson,Aleksander <
>>> alek.eskil...@cerner.com> wrote:
>>>
  From the changes to the namespace file, that appears to be correct,
 all methods of the RDD API have been made private, which in R means that
 you may still access them by using the namespace prefix SparkR with three
 colons, e.g. SparkR:::func(foo, bar).

  So a starting place for porting old SparkR scripts from before the
 merge could be to identify those methods in the script belonging to the RDD
 class and be sure they have the namespace identifier tacked on the front. I
 hope that helps.

  Regards,
 Alek Eskilson

   From: Andrew Psaltis 
 Date: Monday, May 25, 2015 at 6:25 PM
 To: "dev@spark.apache.org" 
 Subject: SparkR and RDDs

   Hi,
 I understand from SPARK-6799[1] and the respective merge commit [2]
  that the RDD class is private in Spark 1.4 . If I wanted to modify the old
 Kmeans and/or LR examples so that the computation happened in Spark what is
 the best direction to go? Sorry if I am missing something obvious, but
 based on the NAMESPACE file [3] in the SparkR codebase I am having trouble
 seeing the obvious direction to go.

  Thanks in advance,
 Andrew

  [1] https://issues.apache.org/jira/browse/SPARK-6799
 
 [2]
 https://github.com/apache/spark/commit/4b91e18d9b7803dbfe1e1cf20b46163d8cb8716c
 
 [3] https://github.com/apache/spark/blob/branch-1.4/R/pkg/NAMESPACE
 

CONFIDENTIALITY NOTICE This message and any included attachments
 are from Cerner Corporation and are intended only for the addressee. The
 information contained in this message is confidential and may constitute
 inside or non-public information under international, federal, or state
 securities laws. Unauthorized forwarding, printing, copying, distribution,
 or use of such information is strictly prohibited and may be unlawful. If
 you are not the addressee, please promptly delete this message and notify
 the sender of the delivery error by e-mail or you may call Cerner's
 corporate offices in Kansas City, Missouri, U.S.A at 

Re: [VOTE] Release Apache Spark 1.4.0 (RC2)

2015-05-27 Thread Patrick Wendell
Hi James,

As I said before that is not a blocker issue for this release, thanks.
Separately, there are some comments in this code review that indicate
you may be facing a bug in your own code rather than with Spark:

https://github.com/apache/spark/pull/5688#issuecomment-104491410

Please follow up on that issue outside of the vote thread.

Thanks!

On Wed, May 27, 2015 at 5:22 PM, jameszhouyi  wrote:
> -1 , SPARK-7119 blocker issue
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-4-0-RC2-tp12420p12472.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org