dockerhub does not contain apache/spark-py 3.4.1

2023-08-09 Thread Mark Elliot
Hello,

I noticed that the apache/spark-py image for Spark's 3.4.1 release is not
available (apache/spark@3.4.1 is available). Would it be possible to get
the 3.4.1 release build for the apache/spark-py image published?

Thanks,

Mark

-- 










This communication, together with any attachments, is intended 
only for the addressee(s) and may contain confidential, privileged or 
proprietary information of Theorem Partners LLC ("Theorem"). By accepting 
this communication you agree to keep confidential all information contained 
in this communication, as well as any information derived by you from the 
confidential information contained in this communication. Theorem does not 
waive any confidentiality by misdelivery.

If you receive this 
communication in error, any use, dissemination, printing or copying of all 
or any part of it is strictly prohibited; please destroy all electronic and 
paper copies and notify the sender immediately. Nothing in this email is 
intended to constitute (1) investment, legal or tax advice, (2) any 
recommendation to purchase or sell any security, (3) any advertisement or 
offer of advisory services or (4) any offer to sell or solicitation of an 
offer to buy any securities or other financial instrument in any 
jurisdiction.

Theorem, including its agents or affiliates, reserves the 
right to intercept, archive, monitor and review all communications to and 
from its network, including this email and any email response to it.

Theorem makes no representation as to the accuracy or completeness of the 
information in this communication and does not accept liability for any 
errors or omissions in this communication, including any liability 
resulting from its transmission by email, and undertakes no obligation to 
update any information in this email or its attachments.


Spark DataFrame Creation

2020-07-22 Thread Mark Bidewell
Sorry if this is the wrong place for this.  I am trying to debug an issue
with this library:
https://github.com/springml/spark-sftp

When I attempt to create a dataframe:

spark.read.
format("com.springml.spark.sftp").
option("host", "...").
option("username", "...").
option("password", "...").
option("fileType", "csv").
option("inferSchema", "true").
option("tempLocation","/srv/spark/tmp").
option("hdfsTempLocation","/srv/spark/tmp");
 .load("...")

What I am seeing is that the download is occurring on the spark driver not
the spark worker,  This leads to a failure when spark tries to create the
DataFrame on the worker.

I'm confused by the behavior.  my understanding was that load() was lazily
executed on the Spark worker.  Why would some elements be executing on the
driver?

Thanks for your help
-- 
Mark Bidewell
http://www.linkedin.com/in/markbidewell


Can I set the Alluxio WriteType in Spark applications?

2019-09-17 Thread Mark Zhao
Hi,

If Spark applications write data into alluxio, can WriteType be configured?

Thanks,
Mark


What is directory "/path/_spark_metadata" for?

2019-08-28 Thread Mark Zhao
 Hey,

 When running Spark on Alluxio-1.8.2, I encounter the following exception:
“alluxio.exception.FileDoseNotExistException: Path
“/test-data/_spark_metadata” does not exist” in Alluxio master.log. What
exactly is the directory "_spark_metadata" used for? And how can I fix this
problem?

Thanks.

Mark


Re: RE - Apache Spark compatibility with Hadoop 2.9.2

2019-06-23 Thread Mark Bidewell
Note that we selected Spark 2.2.2 because we were trying to align with DSE
Search 6.  A new version might have fewer issues.

On Sun, Jun 23, 2019 at 10:56 AM Bipul kumar 
wrote:

> Hi Mark,
>
> Thanks for your wonderful  suggestion.
> I look forward to try that version.
>
> Respectfully,
> Bipul
> PUBLIC KEY <http://ix.io/1nWf>
> 97F0 2E08 7DE7 D538 BDFA  B708 86D8 BE27 8196 D466
> ** Please excuse brevity and typos. **
>
>
> On Sun, Jun 23, 2019 at 8:06 PM Mark Bidewell  wrote:
>
>> I have done a setup with Hadoop 2.9.2 and Spark 2.2.2.  Apache Zeppelin
>> is fine but some our internally developed apps need work on dependencies
>>
>> On Sun, Jun 23, 2019, 07:50 Bipul kumar 
>> wrote:
>>
>>> Hello People !
>>>
>>> I am new to Apache Spark , and just started learning it.
>>> Few questions i have in my mind which i am seeking here for
>>> 1 . Is there any compatibility with Apache Spark while using Hadoop.?
>>> Let say i am running Hadoop 2.9.2, which Apache Spark should i use?
>>>
>>> 2.  As mentioned , i am using Hadoop 2.9.2 with single node cluster
>>> installed.What would be the  preferred ways ( standalone | YARN | SIMR ) to
>>> install Apache Spark  on Hadoop (single node cluster)
>>>
>>> Thank you.
>>> Respectfully,
>>> Bipul
>>> PUBLIC KEY <http://ix.io/1nWf>
>>> 97F0 2E08 7DE7 D538 BDFA  B708 86D8 BE27 8196 D466
>>> ** Please excuse brevity and typos. **
>>>
>>

-- 
Mark Bidewell
http://www.linkedin.com/in/markbidewell


Re: RE - Apache Spark compatibility with Hadoop 2.9.2

2019-06-23 Thread Mark Bidewell
I have done a setup with Hadoop 2.9.2 and Spark 2.2.2.  Apache Zeppelin is
fine but some our internally developed apps need work on dependencies

On Sun, Jun 23, 2019, 07:50 Bipul kumar  wrote:

> Hello People !
>
> I am new to Apache Spark , and just started learning it.
> Few questions i have in my mind which i am seeking here for
> 1 . Is there any compatibility with Apache Spark while using Hadoop.?
> Let say i am running Hadoop 2.9.2, which Apache Spark should i use?
>
> 2.  As mentioned , i am using Hadoop 2.9.2 with single node cluster
> installed.What would be the  preferred ways ( standalone | YARN | SIMR ) to
> install Apache Spark  on Hadoop (single node cluster)
>
> Thank you.
> Respectfully,
> Bipul
> PUBLIC KEY 
> 97F0 2E08 7DE7 D538 BDFA  B708 86D8 BE27 8196 D466
> ** Please excuse brevity and typos. **
>


Re: Multiple sessions in one application?

2018-12-21 Thread Mark Hamstra
On the contrary, it is a common occurrence in a Spark Jobserver style of
application with multiple users.


On Thu, Dec 20, 2018 at 6:09 PM Jiaan Geng  wrote:

> This scene is rare.
> When you provide a web server for spark. maybe you need it.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: How to track batch jobs in spark ?

2018-12-05 Thread Mark Hamstra
That will kill an entire Spark application, not a batch Job.

On Wed, Dec 5, 2018 at 3:07 PM Priya Matpadi  wrote:

> if you are deploying your spark application on YARN cluster,
> 1. ssh into master node
> 2. List the currently running application and retreive the application_id
> yarn application --list
> 3. Kill the application using application_id of the form
> application_x_ from output of list command
> yarn application --kill 
>
> On Wed, Dec 5, 2018 at 1:42 PM kant kodali  wrote:
>
>> Hi All,
>>
>> How to track batch jobs in spark? For example, is there some id or token
>> i can get after I spawn a batch job and use it to track the progress or to
>> kill the batch job itself?
>>
>> For Streaming, we have StreamingQuery.id()
>>
>> Thanks!
>>
>


Spark and Zookeeper HA failures

2018-11-29 Thread Mark Bidewell
I am trying to set up a Spark cluster with multi-master HA.  I have 3 spark
nodes connecting to a single zookeeper node running on a separate server.
When running in this configuration, Over the course of 1-2 hours each node
ends its session because it is not receving any messages from the server.
The standby nodes reconnect, but if a leader encounters it, it immediately
exits.

The net result is that the cluster slowly dies as each master ends its
session and terminates.

The spark cluster is not in use so I don't think this is a GC issue.
Pings, etc seem reliable.  I have tried adjusting timeouts but that doesn't
work either.

Any ideas how to resolve this?

Thanks!

-- 
Mark Bidewell
http://www.linkedin.com/in/markbidewell


Re: Scala: The Util is not accessible in def main

2018-11-11 Thread Mark Hamstra
It is intentionally not accessible in your code since Utils is internal
Spark code, not part of the public API. Changing Spark to make that private
code public would be inviting trouble, or at least future headaches. If you
don't already know how to build and maintain your own custom fork of Spark
with those private Utils made public, then you probably shouldn't be
thinking about doing so.

On Sun, Nov 11, 2018 at 2:13 AM Soheil Pourbafrani 
wrote:

> Hi,
> I want to use org.apache.spark.util.Utils library in def main but I got
> the error:
>
> Symbole Util is not accessible from this place. Here is the code:
>
> val temp = tokens.map(word => Utils.nonNegativeMod(x, y))
>
> How can I make it accessible?
>


Re: Custom SparkListener

2018-09-20 Thread Mark Hamstra
What do you mean? Spark Jobs don't have names.

On Thu, Sep 20, 2018 at 9:40 PM Priya Ch 
wrote:

> Hello All,
>
> I am trying to extend SparkListener and post job ends trying to retrieve
> job name to check the status of either success/failure and write to log
> file.
>
> I couldn't find a way where I could fetch job name in the onJobEnd method.
>
> Thanks,
> Padma CH
>


Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Mark Hamstra
What is the disadvantage to deprecating now in 2.4.0? I mean, it doesn't
change the code at all; it's just a notification that we will eventually
cease supporting Py2. Wouldn't users prefer to get that notification sooner
rather than later?

On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia 
wrote:

> I’d like to understand the maintenance burden of Python 2 before
> deprecating it. Since it is not EOL yet, it might make sense to only
> deprecate it once it’s EOL (which is still over a year from now).
> Supporting Python 2+3 seems less burdensome than supporting, say, multiple
> Scala versions in the same codebase, so what are we losing out?
>
> The other thing is that even though Python core devs might not support 2.x
> later, it’s quite possible that various Linux distros will if moving from 2
> to 3 remains painful. In that case, we may want Apache Spark to continue
> releasing for it despite the Python core devs not supporting it.
>
> Basically, I’d suggest to deprecate this in Spark 3.0 and then remove it
> later in 3.x instead of deprecating it in 2.4. I’d also consider looking at
> what other data science tools are doing before fully removing it: for
> example, if Pandas and TensorFlow no longer support Python 2 past some
> point, that might be a good point to remove it.
>
> Matei
>
> > On Sep 17, 2018, at 11:01 AM, Mark Hamstra 
> wrote:
> >
> > If we're going to do that, then we need to do it right now, since 2.4.0
> is already in release candidates.
> >
> > On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson 
> wrote:
> > I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem
> like a ways off but even now there may be some spark versions supporting
> Py2 past the point where Py2 is no longer receiving security patches
> >
> >
> > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra 
> wrote:
> > We could also deprecate Py2 already in the 2.4.0 release.
> >
> > On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson 
> wrote:
> > In case this didn't make it onto this thread:
> >
> > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and
> remove it entirely on a later 3.x release.
> >
> > On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
> wrote:
> > On a separate dev@spark thread, I raised a question of whether or not
> to support python 2 in Apache Spark, going forward into Spark 3.0.
> >
> > Python-2 is going EOL at the end of 2019. The upcoming release of Spark
> 3.0 is an opportunity to make breaking changes to Spark's APIs, and so it
> is a good time to consider support for Python-2 on PySpark.
> >
> > Key advantages to dropping Python 2 are:
> >   • Support for PySpark becomes significantly easier.
> >   • Avoid having to support Python 2 until Spark 4.0, which is
> likely to imply supporting Python 2 for some time after it goes EOL.
> > (Note that supporting python 2 after EOL means, among other things, that
> PySpark would be supporting a version of python that was no longer
> receiving security patches)
> >
> > The main disadvantage is that PySpark users who have legacy python-2
> code would have to migrate their code to python 3 to take advantage of
> Spark 3.0
> >
> > This decision obviously has large implications for the Apache Spark
> community and we want to solicit community feedback.
> >
> >
>
>


Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Mark Hamstra
If we're going to do that, then we need to do it right now, since 2.4.0 is
already in release candidates.

On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson  wrote:

> I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem
> like a ways off but even now there may be some spark versions supporting
> Py2 past the point where Py2 is no longer receiving security patches
>
>
> On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra 
> wrote:
>
>> We could also deprecate Py2 already in the 2.4.0 release.
>>
>> On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson 
>> wrote:
>>
>>> In case this didn't make it onto this thread:
>>>
>>> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and
>>> remove it entirely on a later 3.x release.
>>>
>>> On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
>>> wrote:
>>>
>>>> On a separate dev@spark thread, I raised a question of whether or not
>>>> to support python 2 in Apache Spark, going forward into Spark 3.0.
>>>>
>>>> Python-2 is going EOL <https://github.com/python/devguide/pull/344> at
>>>> the end of 2019. The upcoming release of Spark 3.0 is an opportunity to
>>>> make breaking changes to Spark's APIs, and so it is a good time to consider
>>>> support for Python-2 on PySpark.
>>>>
>>>> Key advantages to dropping Python 2 are:
>>>>
>>>>- Support for PySpark becomes significantly easier.
>>>>- Avoid having to support Python 2 until Spark 4.0, which is likely
>>>>to imply supporting Python 2 for some time after it goes EOL.
>>>>
>>>> (Note that supporting python 2 after EOL means, among other things,
>>>> that PySpark would be supporting a version of python that was no longer
>>>> receiving security patches)
>>>>
>>>> The main disadvantage is that PySpark users who have legacy python-2
>>>> code would have to migrate their code to python 3 to take advantage of
>>>> Spark 3.0
>>>>
>>>> This decision obviously has large implications for the Apache Spark
>>>> community and we want to solicit community feedback.
>>>>
>>>>
>>>


Re: Should python-2 be supported in Spark 3.0?

2018-09-16 Thread Mark Hamstra
We could also deprecate Py2 already in the 2.4.0 release.

On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson  wrote:

> In case this didn't make it onto this thread:
>
> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove
> it entirely on a later 3.x release.
>
> On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
> wrote:
>
>> On a separate dev@spark thread, I raised a question of whether or not to
>> support python 2 in Apache Spark, going forward into Spark 3.0.
>>
>> Python-2 is going EOL  at
>> the end of 2019. The upcoming release of Spark 3.0 is an opportunity to
>> make breaking changes to Spark's APIs, and so it is a good time to consider
>> support for Python-2 on PySpark.
>>
>> Key advantages to dropping Python 2 are:
>>
>>- Support for PySpark becomes significantly easier.
>>- Avoid having to support Python 2 until Spark 4.0, which is likely
>>to imply supporting Python 2 for some time after it goes EOL.
>>
>> (Note that supporting python 2 after EOL means, among other things, that
>> PySpark would be supporting a version of python that was no longer
>> receiving security patches)
>>
>> The main disadvantage is that PySpark users who have legacy python-2 code
>> would have to migrate their code to python 3 to take advantage of Spark 3.0
>>
>> This decision obviously has large implications for the Apache Spark
>> community and we want to solicit community feedback.
>>
>>
>


Re: [SPARK on MESOS] Avoid re-fetching Spark binary

2018-07-10 Thread Mark Hamstra
It's been done many times before by many organizations. Use Spark Job
Server or Livy or create your own implementation of a similar long-running
Spark Application. Creating a new Application for every Job is not the way
to achieve low-latency performance.

On Tue, Jul 10, 2018 at 4:18 AM  wrote:

> Dear,
>
> Our jobs are triggered by users on demand.
> And new job will be submitted to Spark server via REST API. The 2-4
> seconds of latency is mainly because of the initialization of SparkContext
> every time new job is submitted, as you have mentioned.
>
> If you are aware of a way to avoid this initialization, could you please
> share it. That would be perfect for our case.
>
> Best
> Tien Dat
>
> 
> Essentially correct. The latency to start a Spark Job is nowhere close to
> 2-4 seconds under typical conditions. Creating a new Spark Application
> every time instead of running multiple Jobs in one Application is not going
> to lead to acceptable interactive or real-time performance, nor is that an
> execution model that Spark is ever likely to support in trying to meet
> low-latency requirements. As such, reducing Application startup time (not
> Job startup time) is not a priority.
>
> On Fri, Jul 6, 2018 at 4:06 PM Timothy Chen  wrote:
>
> > I know there are some community efforts shown in Spark summits before,
> > mostly around reusing the same Spark context with multiple “jobs”.
> >
> > I don’t think reducing Spark job startup time is a community priority
> > afaik.
> >
> > Tim
> > On Fri, Jul 6, 2018 at 7:12 PM Tien Dat  wrote:
> >
> >> Dear Timothy,
> >>
> >> It works like a charm now.
> >>
> >> BTW (don't judge me if I am to greedy :-)), the latency to start a Spark
> >> job
> >> is around 2-4 seconds, unless I am not aware of some awesome
> optimization
> >> on
> >> Spark. Do you know if Spark community is working on reducing this
> >> latency?
> >>
> >> Best
> >>
> >>
> >>
> >> --
> >> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> >>
> >> -
> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>
> >>
>
> 
> Quoted from:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-on-MESOS-Avoid-re-fetching-Spark-binary-tp32849p32865.html
>
>
> _
> Sent from http://apache-spark-user-list.1001560.n3.nabble.com
>
>


Re: [SPARK on MESOS] Avoid re-fetching Spark binary

2018-07-07 Thread Mark Hamstra
Essentially correct. The latency to start a Spark Job is nowhere close to
2-4 seconds under typical conditions. Creating a new Spark Application
every time instead of running multiple Jobs in one Application is not going
to lead to acceptable interactive or real-time performance, nor is that an
execution model that Spark is ever likely to support in trying to meet
low-latency requirements. As such, reducing Application startup time (not
Job startup time) is not a priority.

On Fri, Jul 6, 2018 at 4:06 PM Timothy Chen  wrote:

> I know there are some community efforts shown in Spark summits before,
> mostly around reusing the same Spark context with multiple “jobs”.
>
> I don’t think reducing Spark job startup time is a community priority
> afaik.
>
> Tim
> On Fri, Jul 6, 2018 at 7:12 PM Tien Dat  wrote:
>
>> Dear Timothy,
>>
>> It works like a charm now.
>>
>> BTW (don't judge me if I am to greedy :-)), the latency to start a Spark
>> job
>> is around 2-4 seconds, unless I am not aware of some awesome optimization
>> on
>> Spark. Do you know if Spark community is working on reducing this latency?
>>
>> Best
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: [SPARK on MESOS] Avoid re-fetching Spark binary

2018-07-06 Thread Mark Hamstra
The latency to start a Spark Job is nowhere close to 2-4 seconds under
typical conditions. You appear to be creating a new Spark Application
everytime instead of running multiple Jobs in one Application.

On Fri, Jul 6, 2018 at 3:12 AM Tien Dat  wrote:

> Dear Timothy,
>
> It works like a charm now.
>
> BTW (don't judge me if I am to greedy :-)), the latency to start a Spark
> job
> is around 2-4 seconds, unless I am not aware of some awesome optimization
> on
> Spark. Do you know if Spark community is working on reducing this latency?
>
> Best
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark horizontal scaling is not supported in which cluster mode? Ask

2018-05-21 Thread Mark Hamstra
Horizontal scaling is scaling across multiple, distributed computers (or at
least OS instances). Local mode is, therefore, by definition not
horizontally scalable since it just uses a configurable number of local
threads. If the question actually asked "which cluster manager...?", then I
have a small issue with it. Local mode isn't really a cluster manager,
since there is no cluster to manage. What it is is one of Spark's
scheduling modes.

On Mon, May 21, 2018 at 9:29 AM unk1102  wrote:

> Hi I came by one Spark question which was about which spark cluster manager
> does not support horizontal scalability? Answer options were Mesos, Yarn,
> Standalone and local mode. I believe all cluster managers are horizontal
> scalable please correct if I am wrong. And I think answer is local mode. Is
> it true? Please guide. Thanks in advance.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: [Spark 2.x Core] .collect() size limit

2018-04-28 Thread Mark Hamstra
spark.driver.maxResultSize

http://spark.apache.org/docs/latest/configuration.html

On Sat, Apr 28, 2018 at 8:41 AM, klrmowse  wrote:

> i am currently trying to find a workaround for the Spark application i am
> working on so that it does not have to use .collect()
>
> but, for now, it is going to have to use .collect()
>
> what is the size limit (memory for the driver) of RDD file that .collect()
> can work with?
>
> i've been scouring google-search - S.O., blogs, etc, and everyone is
> cautioning about .collect(), but does not specify how huge is huge... are
> we
> talking about a few gigabytes? terabytes?? petabytes???
>
>
>
> thank you
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark with Scala 2.12

2018-04-21 Thread Mark Hamstra
Even more to the point:
http://apache-spark-developers-list.1001551.n3.nabble.com/Scala-2-12-support-td23833.html

tldr; It's an item of discussion, but there is no imminent release of Spark
that will use Scala 2.12.

On Sat, Apr 21, 2018 at 2:44 AM, purijatin  wrote:

> I see a discussion post on the dev mailing list:
> http://apache-spark-developers-list.1001551.n3.nabble.com/time-for-Apache-
> Spark-3-0-td23755.html#a23830
>
> Thanks
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Performance of Spark when the compute and storage are separated

2018-04-15 Thread Mark Hamstra
Keep forgetting to reply to user list...

On Sun, Apr 15, 2018 at 1:58 PM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> Sure, data locality all the way at the basic storage layer is the easy way
> to avoid paying the costs of remote I/O. My point, though, is that that
> kind of storage locality isn't necessarily the only way to get acceptable
> performance -- it really does depend heavily on your use case and on your
> performance expectations/requirements. In some cases, it can even be
> acceptable to do query federation between data centers, where some of the
> storage is really remote and the costs to access it are quite high; but if
> you're not doing something like trying to bring over all of the remote
> data, and if you are reusing many times the bit of data that you did bring
> in with the very expensive I/O and then cached, overall performance can be
> quite acceptable.
>
> On Sun, Apr 15, 2018 at 1:46 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Thanks Mark,
>>
>> I guess this may be broadened to the concept of separate compute from
>> storage. Your point on " ... can kind of disappear after the data is
>> first read from the storage layer." reminds of performing Logical IOs as
>> opposed to Physical IOs. But again as you correctly pointed out on the
>> amount of available cache and concurrency that can saturate the hits on the
>> storage. I personally believe that Data locality helps by avoiding these
>> remote IO calls
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 15 April 2018 at 21:22, Mark Hamstra <m...@clearstorydata.com> wrote:
>>
>>> This is a sort of your mileage varies type question.
>>>>
>>>
>>> Yes, it really does. Not only does it depend heavily on the
>>> configuration of your compute and storage, but it also depends a lot on any
>>> caching that you are doing between compute and storage and on the nature of
>>> your Spark queries/Jobs. If you are mostly doing cold full scans, then
>>> you're going to see a big performance hit. If you are reusing a lot of
>>> prior or intermediate results, then you are frequently not going all the
>>> way back to a slow storage layer, but rather to a Spark CachedTable, some
>>> other cache, or even the OS buffer cache for shuffle files -- or to local
>>> disk spillage. All of that is typically going to be local to your compute
>>> nodes, so the data locality issue can kind of disappear after the data is
>>> first read from the storage layer.
>>>
>>>
>>> On Sat, Apr 14, 2018 at 12:17 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> This is a sort of your mileage varies type question.
>>>>
>>>> In a classic Hadoop cluster, one has data locality when each node
>>>> includes the Spark libraries and HDFS data. this helps certain queries like
>>>> interactive BI.
>>>>
>>>> However running Spark over remote storage say Isilon scaled out NAS
>>>> instead of LOCAL HDFS becomes problematic. The full-scan Spark needs
>>>> to do will take much longer when it is done over the network (access the
>>>> remote Isilon storage) instead of local I/O request to HDFS.
>>>>
>>>> Has anyone done some comparative studies on this?
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>
>>>
>>
>


Which metrics would be best to alert on?

2018-04-05 Thread Mark Bonetti
Hi,
I'm building a monitoring system for Apache Spark and want to set up
default alerts (threshold or anomaly) on 2-3 key metrics everyone who uses
Spark typically wants to alert on, but I don't yet have production-grade
experience with Spark.

Importantly, alert rules have to be generally useful, so can't be on
metrics whose values vary wildly based on the size of deployment.

In other words, which metrics would be most significant indicators that
something went wrong with your Spark:
 - master
 - worker
 - driver
 - executor
 - streaming


I thought the best place to find experienced Spark users, who would find
answering this question trivial, would be here.

Thanks very much,
Mark Scott


Re: Databricks Serverless

2017-11-13 Thread Mark Hamstra
This is not a Databricks forum.

On Mon, Nov 13, 2017 at 3:18 PM, Benjamin Kim  wrote:

> I have a question about this. The documentation compares the concept
> similar to BigQuery. Does this mean that we will no longer need to deal
> with instances and just pay for execution duration and amount of data
> processed? I’m just curious about how this will be priced.
>
> Also, when will it be ready for production?
>
> Cheers.
>
>


Re: Dependency error due to scala version mismatch in SBT and Spark 2.1

2017-10-16 Thread Mark Hamstra
The canonical build of Spark is done using maven, not sbt. Maven and sbt do
things a bit differently. In order to get maven and sbt to each build Spark
quite similar to the way the other does, the builds are each driven through
a customization script -- build/mvn and build/sbt respectively. A lot of
work has gone into those scripts, configurations, etc. They work. Using an
arbitrary sbt version outside of the defined Spark build process is going
to lead to a lot of issues that are completely avoided by using the
documented Spark build procedure. So, back to my original question: Why?
Why do something that doesn't work when there is a well-defined,
well-maintained, documented way to build Spark with either maven or sbt?
https://spark.apache.org/docs/latest/building-spark.html

On Mon, Oct 16, 2017 at 12:26 AM, patel kumar <patel.kumar...@gmail.com>
wrote:

>  This is not the correct way to build Spark with sbt. Why ?
>
>
> On Sun, Oct 15, 2017 at 11:54 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> I am building Spark using build.sbt.
>>
>>
>> Which just gets me back to my original question: Why? This is not the
>> correct way to build Spark with sbt.
>>
>> On Sun, Oct 15, 2017 at 11:40 PM, patel kumar <patel.kumar...@gmail.com>
>> wrote:
>>
>>> I am building Spark using build.sbt. Details are mentioned in the
>>> original mail chain.
>>>
>>> When I was using Spark1.6 using scala 2.10, everything was working fine.
>>>
>>> Issue arises when I am updating my code to make it compatible with spark
>>> 2.1.
>>>
>>> It is failing while doing sbt assembly.
>>>
>>>
>>>
>>> On Sun, Oct 15, 2017 at 11:29 PM, Mark Hamstra <m...@clearstorydata.com>
>>> wrote:
>>>
>>>> I don't understand. Are you building Spark or something else?
>>>>
>>>> If you are trying to build Spark, it doesn't look like you are doing it
>>>> the right way. As I mentioned before, and is explained in detail in the
>>>> link I provided, building Spark with sbt is done via build/sbt, not by
>>>> directly invoking your choice of an sbt version.
>>>>
>>>> On Sun, Oct 15, 2017 at 11:26 PM, patel kumar <patel.kumar...@gmail.com
>>>> > wrote:
>>>>
>>>>> Because earlier, I was using sbt 0.13.13.1, and I was getting another
>>>>> version conflict i.e.
>>>>>
>>>>> *[error] Modules were resolved with conflicting cross-version suffixes
>>>>> in {file:/D:/Tools/scala_ide/test_workspace/test/NewSp*
>>>>> *arkTest/}newsparktest:*
>>>>> *[error]org.json4s:json4s-ast _2.10, _2.11*
>>>>> *[error]org.json4s:json4s-core _2.10, _2.11*
>>>>> *[trace] Stack trace suppressed: run last *:update for the full
>>>>> output.*
>>>>> *[error] (*:update) Conflicting cross-version suffixes in:
>>>>> org.json4s:json4s-ast, org.json4s:json4s-core*
>>>>>
>>>>>
>>>>> Actually issue is because of SBT compilation. sbt 0.13.x is originally
>>>>> compiled with scala 2.10 and gives support to 2.11.x.
>>>>>
>>>>> When I execute sbt , all the default jars are automatically downloaded
>>>>> with suffuix _10 (for scala 2.10). When execute assembly json4s jars for
>>>>> 2.11 is also downloaded.
>>>>>
>>>>> I upgraded the sbt with the hope that this issue might be fixed with
>>>>> new version of sbt.
>>>>>
>>>>> Sbt 1.02 introduced the same problem for XML related jars as sbt 1.0.2
>>>>> originally compiled with scala 2.12.
>>>>>
>>>>> I looked up the sbt version which would by default download 2.11.x
>>>>> jars, but failed to find any such version.
>>>>>
>>>>> So, now I am looking for a solution by which I could override the jars
>>>>> that are downloaded by default during sbt.
>>>>>
>>>>>
>>>>> On Sun, Oct 15, 2017 at 11:03 PM, Mark Hamstra <
>>>>> m...@clearstorydata.com> wrote:
>>>>>
>>>>>> sbt version is 1.0.2.
>>>>>>
>>>>>>
>>>>>> Why?
>>>>>>
>>>>>> Building Spark with sbt is done via build/sbt, which will give you
>>>>>> sbt 0.13.11 when building Spark 2.1.0.
>>>>>>
>>>>>> https://spark.apache.org/docs/2

Re: SVD computation limit

2017-09-19 Thread Mark Bittmann
I've run into this before. The EigenValueDecomposition creates a Java Array
with 2*k*n elements. The Java Array is indexed with a native integer type,
so 2*k*n cannot exceed Integer.MAX_VALUE values.

The array is created here:
https://github.com/apache/spark/blob/master/mllib/src/
main/scala/org/apache/spark/mllib/linalg/EigenValueDecomposition.scala#L84

If you remove the requirement that 2*k*nhttps://issues.apache.org/jira/browse/SPARK-5656

On Tue, Sep 19, 2017 at 9:49 AM, Alexander Ovcharenko 
wrote:

> Hello guys,
>
> While trying to compute SVD using computeSVD() function, i am getting the
> following warning with the follow up exception:
> 17/09/14 12:29:02 WARN RowMatrix: computing svd with k=49865 and n=191077,
> please check necessity
> IllegalArgumentException: u'requirement failed: k = 49865 and/or n =
> 191077 are too large to compute an eigendecomposition'
>
> When I try to compute first 3000 singular values, I'm getting several
> following warnings every second:
> 17/09/14 13:43:38 WARN TaskSetManager: Stage 4802 contains a task of very
> large size (135 KB). The maximum recommended task size is 100 KB.
>
> The matrix size is 49865 x 191077 and all the singular values are needed.
>
> Is there a way to lift that limit and be able to compute whatever number
> of singular values?
>
> Thank you.
>
>
>


Re: How can i remove the need for calling cache

2017-08-01 Thread Mark Hamstra
Very likely, much of the potential duplication is already being avoided
even without calling cache/persist. When running the above code without
`myrdd.cache`, have you looked at the Spark web UI for the Jobs? For at
least one of them you will likely see that many Stages are marked as
"skipped", which means that prior shuffle files that cover the results of
those Stages were still available, so Spark did not recompute those
results. Spark will eventually clean up those shuffle files (unless you
hold onto a reference to them), but if your Jobs using myrdd run fairly
close together in time, then duplication is already minimized even without
an explicit cache call.

On Tue, Aug 1, 2017 at 11:05 AM, jeff saremi  wrote:

> Calling cache/persist fails all our jobs (i have  posted 2 threads on
> this).
>
> And we're giving up hope in finding a solution.
> So I'd like to find a workaround for that:
>
> If I save an RDD to hdfs and read it back, can I use it in more than one
> operation?
>
> Example: (using cache)
> // do a whole bunch of transformations on an RDD
>
> myrdd.cache()
>
> val result1 = myrdd.map(op1(_))
>
> val result2 = myrdd.map(op2(_))
>
> // in the above I am assuming that a call to cache will prevent all
> previous transformation from being calculated twice
>
> I'd like to somehow get result1 and result2 without duplicating work. How
> can I do that?
>
> thanks
>
> Jeff
>


Re: Spark on Cloudera Configuration (Scheduler Mode = FAIR)

2017-07-20 Thread Mark Hamstra
The fair scheduler doesn't have anything to do with reallocating resource
across Applications.

https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-across-applications
https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

On Thu, Jul 20, 2017 at 2:02 PM, Gokula Krishnan D <email2...@gmail.com>
wrote:

> Mark, Thanks for the response.
>
> Let me rephrase my statements.
>
> "I am submitting a Spark application(*Application*#A) with scheduler.mode
> as FAIR and dynamicallocation=true and it got all the available executors.
>
> In the meantime, submitting another Spark Application (*Application* # B)
> with the scheduler.mode as FAIR and dynamicallocation=true but it got only
> one executor. "
>
> Thanks & Regards,
> Gokula Krishnan* (Gokul)*
>
> On Thu, Jul 20, 2017 at 4:56 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> First, Executors are not allocated to Jobs, but rather to Applications.
>> If you run multiple Jobs within a single Application, then each of the
>> Tasks associated with Stages of those Jobs has the potential to run on any
>> of the Application's Executors. Second, once a Task starts running on an
>> Executor, it has to complete before another Task can be scheduled using the
>> prior Task's resources -- the fair scheduler is not preemptive of running
>> Tasks.
>>
>> On Thu, Jul 20, 2017 at 1:45 PM, Gokula Krishnan D <email2...@gmail.com>
>> wrote:
>>
>>> Hello All,
>>>
>>> We are having cluster with 50 Executors each with 4 Cores so can avail
>>> max. 200 Executors.
>>>
>>> I am submitting a Spark application(JOB A) with scheduler.mode as FAIR
>>> and dynamicallocation=true and it got all the available executors.
>>>
>>> In the meantime, submitting another Spark Application (JOB B) with the
>>> scheduler.mode as FAIR and dynamicallocation=true but it got only one
>>> executor.
>>>
>>> Normally this situation occurs when any of the JOB runs with the
>>> Scheduler.mode= FIFO.
>>>
>>> 1) Have your ever faced this issue if so how to overcome this?.
>>>
>>> I was in the impression that as soon as I submit the JOB B the Spark
>>> Scheduler should distribute/release few resources from the JOB A and share
>>> it with the JOB A in the Round Robin fashion?.
>>>
>>> Appreciate your response !!!.
>>>
>>>
>>> Thanks & Regards,
>>> Gokula Krishnan* (Gokul)*
>>>
>>
>>
>


Re: Spark on Cloudera Configuration (Scheduler Mode = FAIR)

2017-07-20 Thread Mark Hamstra
First, Executors are not allocated to Jobs, but rather to Applications. If
you run multiple Jobs within a single Application, then each of the Tasks
associated with Stages of those Jobs has the potential to run on any of the
Application's Executors. Second, once a Task starts running on an Executor,
it has to complete before another Task can be scheduled using the prior
Task's resources -- the fair scheduler is not preemptive of running Tasks.

On Thu, Jul 20, 2017 at 1:45 PM, Gokula Krishnan D 
wrote:

> Hello All,
>
> We are having cluster with 50 Executors each with 4 Cores so can avail
> max. 200 Executors.
>
> I am submitting a Spark application(JOB A) with scheduler.mode as FAIR and
> dynamicallocation=true and it got all the available executors.
>
> In the meantime, submitting another Spark Application (JOB B) with the
> scheduler.mode as FAIR and dynamicallocation=true but it got only one
> executor.
>
> Normally this situation occurs when any of the JOB runs with the
> Scheduler.mode= FIFO.
>
> 1) Have your ever faced this issue if so how to overcome this?.
>
> I was in the impression that as soon as I submit the JOB B the Spark
> Scheduler should distribute/release few resources from the JOB A and share
> it with the JOB A in the Round Robin fashion?.
>
> Appreciate your response !!!.
>
>
> Thanks & Regards,
> Gokula Krishnan* (Gokul)*
>


Re: a stage can belong to more than one job please?

2017-06-06 Thread Mark Hamstra
Yes, a Stage can be part of more than one Job. The jobIds field of Stage is
used repeatedly in the DAGScheduler.

On Tue, Jun 6, 2017 at 5:04 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:

> Hi all,
>
> I read same code of spark about stage.
>
> The constructor of stage keep the first job  ID the stage was part of.
> does that means a stage can belong to more than one job
> please? And I find the member jobIds is never used. It looks strange.
>
>
> thanks adv
>


Re: Using SparkContext in Executors

2017-05-28 Thread Mark Hamstra
You can't do that. SparkContext and SparkSession can exist only on the
Driver.

On Sun, May 28, 2017 at 6:56 AM, Abdulfattah Safa 
wrote:

> How can I use SparkContext (to create Spark Session or Cassandra Sessions)
> in executors?
> If I pass it as parameter to the foreach or foreachpartition, then it will
> have a null value.
> Shall I create a new SparkContext in each executor?
>
> Here is what I'm trying to do:
> Read a dump directory with millions of dump files as follows:
>
> dumpFiles = Directory.listFiles(dumpDirectory)
> dumpFilesRDD = sparkContext.parallize(dumpFiles, numOfSlices)
> dumpFilesRDD.foreachPartition(dumpFilePath->parse(dumpFilePath))
> .
> .
> .
>
> In parse(), each dump file is parsed and inserted into database using
> SparlSQL. In order to do that, SparkContext is needed in the function parse
> to use the sql() method.
>


Re: 2.2. release date ?

2017-05-23 Thread Mark Hamstra
I heard that once we reach release candidates it's not a question of time
or a target date, but only whether blockers are resolved and the code is
ready to release.

On Tue, May 23, 2017 at 11:07 AM, kant kodali  wrote:

> Heard its end of this month (May)
>
> On Tue, May 23, 2017 at 9:41 AM, mojhaha kiklasds  > wrote:
>
>> Hello,
>>
>> I could see a RC2 candidate for Spark 2.2, but not sure about the
>> expected release timeline on that.
>> Would be great if somebody can confirm it.
>>
>> Thanks,
>> Mhojaha
>>
>
>


Re: scalastyle violation on mvn install but not on mvn package

2017-05-23 Thread Mark Hamstra
On Tue, May 23, 2017 at 7:48 AM, Xiangyu Li <yisky...@gmail.com> wrote:

> Thank you for the answer.
>
> So basically it is not recommended to install Spark to your local maven
> repository? I thought if they wanted to enforce scalastyle for better open
> source contributions, they would have fixed all the scalastyle warnings.
>

That isn't a valid conclusion. There is nothing wrong with using maven's
"install" with Spark. There shouldn't be any scalastyle violations.


> On a side note, my posts on Nabble never got accepted by the mailing list
> for some reason (I am subscribed to the mail list), and your reply does not
> show as a reply to my question on Nabble probably for the same reason.
> Sorry for the late reply but is using email the only way to communicate on
> the mail list? I got another reply to this question through email but the
> two replies are not even in the same "email conversation".
>

I don't know the mechanics of why posts do or don't show up via Nabble, but
Nabble is neither the canonical archive nor the system of record for Apache
mailing lists.


> On Thu, May 4, 2017 at 8:11 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> The check goal of the scalastyle plugin runs during the "verify" phase,
>> which is between "package" and "install"; so running just "package" will
>> not run scalastyle:check.
>>
>> On Thu, May 4, 2017 at 7:45 AM, yiskylee <yisky...@gmail.com> wrote:
>>
>>> ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
>>> package
>>> works, but
>>> ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
>>> install
>>> triggers scalastyle violation error.
>>>
>>> Is the scalastyle check not used on package but only on install? To
>>> install,
>>> should I turn off "failOnViolation" in the pom?
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/scalastyle-violation-on-mvn-install-bu
>>> t-not-on-mvn-package-tp28653.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>
>
> --
> Sincerely
> Xiangyu Li
>
> <yisky...@gmail.com>
>


Re: spark ML Recommender program

2017-05-17 Thread Mark Vervuurt
If you are running locally try increasing driver memory to for example 4G en 
executor memory to 3G.
Regards, Mark

> On 18 May 2017, at 05:15, Arun <arunbm...@gmail.com 
> <mailto:arunbm...@gmail.com>> wrote:
> 
> hi
> 
> I am writing spark ML Movie Recomender program on Intelij on windows10
> Dataset is 2MB with 10 datapoints, My Laptop has 8gb Memory
> 
> When I set number of iteration 10 works fine
> When I set number of Iteration 20 I get StackOverFlow error..
> Whats the solution?..
> 
> thanks
> 
> 
> 
> 
> 
> Sent from Samsung tablet







Re: Jupyter spark Scala notebooks

2017-05-17 Thread Mark Vervuurt
Hi Upendra, 

I got toree to work and I described it in the following JIRA issue. 
See the last comment of the issue.
https://issues.apache.org/jira/browse/TOREE-336 
<https://issues.apache.org/jira/browse/TOREE-336> 

Mark

> On 18 May 2017, at 04:22, upendra 1991 <upendra1...@yahoo.com.INVALID> wrote:
> 
> What's the best way to use jupyter with Scala spark. I tried Apache toree and 
> created a kernel but did not get it working. I believe there is a better way.
> 
> Please suggest any best practices.
> 
> Sent from Yahoo Mail on Android 
> <https://overview.mail.yahoo.com/mobile/?.src=Android>



Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread Mark Hamstra
Looks to me like it is a conflict between a Databricks library and Spark
2.1. That's an issue for Databricks to resolve or provide guidance.

On Tue, May 9, 2017 at 2:36 PM, lucas.g...@gmail.com <lucas.g...@gmail.com>
wrote:

> I'm a bit confused by that answer, I'm assuming it's spark deciding which
> lib to use.
>
> On 9 May 2017 at 14:30, Mark Hamstra <m...@clearstorydata.com> wrote:
>
>> This looks more like a matter for Databricks support than spark-user.
>>
>> On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com <
>> lucas.g...@gmail.com> wrote:
>>
>>> df = spark.sqlContext.read.csv('out/df_in.csv')
>>>>
>>>
>>>
>>>> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
>>>> metastore. hive.metastore.schema.verification is not enabled so
>>>> recording the schema version 1.2.0
>>>> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
>>>> returning NoSuchObjectException
>>>> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database global_temp,
>>>> returning NoSuchObjectException
>>>>
>>>
>>>
>>>> Py4JJavaError: An error occurred while calling o72.csv.
>>>> : java.lang.RuntimeException: Multiple sources found for csv 
>>>> (*com.databricks.spark.csv.DefaultSource15,
>>>> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*), please
>>>> specify the fully qualified class name.
>>>> at scala.sys.package$.error(package.scala:27)
>>>> at org.apache.spark.sql.execution.datasources.DataSource$.looku
>>>> pDataSource(DataSource.scala:591)
>>>> at org.apache.spark.sql.execution.datasources.DataSource.provid
>>>> ingClass$lzycompute(DataSource.scala:86)
>>>> at org.apache.spark.sql.execution.datasources.DataSource.provid
>>>> ingClass(DataSource.scala:86)
>>>> at org.apache.spark.sql.execution.datasources.DataSource.resolv
>>>> eRelation(DataSource.scala:325)
>>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>>>> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>>>> ssorImpl.java:57)
>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>>>> thodAccessorImpl.java:43)
>>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>>>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>>>> at py4j.Gateway.invoke(Gateway.java:280)
>>>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>>>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>>> at py4j.GatewayConnection.run(GatewayConnection.java:214) at
>>>> java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>> When I change our call to:
>>>
>>> df = spark.hiveContext.read \
>>> .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
>>> \
>>> .load('df_in.csv)
>>>
>>> No such issue, I was under the impression (obviously wrongly) that spark
>>> would automatically pick the local lib.  We have the databricks library
>>> because other jobs still explicitly call it.
>>>
>>> Is the 'correct answer' to go through and modify so as to remove the
>>> databricks lib / remove it from our deploy?  Or should this just work?
>>>
>>> One of the things I find less helpful in the spark docs are when there's
>>> multiple ways to do it but no clear guidance on what those methods are
>>> intended to accomplish.
>>>
>>> Thanks!
>>>
>>
>>
>


Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread Mark Hamstra
This looks more like a matter for Databricks support than spark-user.

On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com 
wrote:

> df = spark.sqlContext.read.csv('out/df_in.csv')
>>
>
>
>> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
>> metastore. hive.metastore.schema.verification is not enabled so
>> recording the schema version 1.2.0
>> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
>> returning NoSuchObjectException
>> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database global_temp,
>> returning NoSuchObjectException
>>
>
>
>> Py4JJavaError: An error occurred while calling o72.csv.
>> : java.lang.RuntimeException: Multiple sources found for csv 
>> (*com.databricks.spark.csv.DefaultSource15,
>> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*), please
>> specify the fully qualified class name.
>> at scala.sys.package$.error(package.scala:27)
>> at org.apache.spark.sql.execution.datasources.
>> DataSource$.lookupDataSource(DataSource.scala:591)
>> at org.apache.spark.sql.execution.datasources.DataSource.providingClass$
>> lzycompute(DataSource.scala:86)
>> at org.apache.spark.sql.execution.datasources.DataSource.providingClass(
>> DataSource.scala:86)
>> at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(
>> DataSource.scala:325)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(
>> NativeMethodAccessorImpl.java:57)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
>> DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>> at py4j.Gateway.invoke(Gateway.java:280)
>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>> at py4j.GatewayConnection.run(GatewayConnection.java:214) at
>> java.lang.Thread.run(Thread.java:745)
>
>
> When I change our call to:
>
> df = spark.hiveContext.read \
> .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
> \
> .load('df_in.csv)
>
> No such issue, I was under the impression (obviously wrongly) that spark
> would automatically pick the local lib.  We have the databricks library
> because other jobs still explicitly call it.
>
> Is the 'correct answer' to go through and modify so as to remove the
> databricks lib / remove it from our deploy?  Or should this just work?
>
> One of the things I find less helpful in the spark docs are when there's
> multiple ways to do it but no clear guidance on what those methods are
> intended to accomplish.
>
> Thanks!
>


Re: scalastyle violation on mvn install but not on mvn package

2017-05-04 Thread Mark Hamstra
The check goal of the scalastyle plugin runs during the "verify" phase,
which is between "package" and "install"; so running just "package" will
not run scalastyle:check.

On Thu, May 4, 2017 at 7:45 AM, yiskylee  wrote:

> ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
> package
> works, but
> ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
> install
> triggers scalastyle violation error.
>
> Is the scalastyle check not used on package but only on install? To
> install,
> should I turn off "failOnViolation" in the pom?
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/scalastyle-violation-on-mvn-install-but-not-on-mvn-
> package-tp28653.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Securing Spark Job on Cluster

2017-04-28 Thread Mark Hamstra
spark.local.dir

http://spark.apache.org/docs/latest/configuration.html

On Fri, Apr 28, 2017 at 8:51 AM, Shashi Vishwakarma <
shashi.vish...@gmail.com> wrote:

> Yes I am using HDFS .Just trying to understand couple of point.
>
> There would be two kind of encryption which would be required.
>
> 1. Data in Motion - This could be achieved by enabling SSL -
> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.
> 0/bk_spark-component-guide/content/spark-encryption.html
>
> 2. Data at Rest - HDFS Encryption can be applied.
>
> Apart from this when spark executes a job , each disk available in all
> node needs to be encrypted .
>
> I can have multiple disk on each node and encrypting all of them could be
> costly operation - Therefore I was trying to identify during job execution
> what are possible folders where spark can spill data .
>
> Once these items are identified those specific disk can be encrypted.
>
> Thanks
> Shashi
>
>
>
>
> On Fri, Apr 28, 2017 at 4:34 PM, Jörn Franke  wrote:
>
>> Why don't you use whole disk encryption?
>> Are you using HDFS?
>>
>> On 28. Apr 2017, at 16:57, Shashi Vishwakarma 
>> wrote:
>>
>> Agreed Jorn. Disk encryption is one option that will help to secure data
>> but how do I know at which location Spark is spilling temp file, shuffle
>> data and application data ?
>>
>> Thanks
>> Shashi
>>
>> On Fri, Apr 28, 2017 at 3:54 PM, Jörn Franke 
>> wrote:
>>
>>> You can use disk encryption as provided by the operating system.
>>> Additionally, you may think about shredding disks after they are not used
>>> anymore.
>>>
>>> > On 28. Apr 2017, at 14:45, Shashi Vishwakarma <
>>> shashi.vish...@gmail.com> wrote:
>>> >
>>> > Hi All
>>> >
>>> > I was dealing with one the spark requirement here where Client (like
>>> Banking Client where security is major concern) needs all spark processing
>>> should happen securely.
>>> >
>>> > For example all communication happening between spark client and
>>> server ( driver & executor communication) should be on secure channel. Even
>>> when spark spills on disk based on storage level (Mem+Disk), it should not
>>> be written in un-encrypted format on local disk or there should be some
>>> workaround to prevent spill.
>>> >
>>> > I did some research  but could not get any concrete solution.Let me
>>> know if someone has done this.
>>> >
>>> > Any guidance would be a great help.
>>> >
>>> > Thanks
>>> > Shashi
>>>
>>
>>
>


Re: Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Mark Hamstra
`spark-submit` creates a new Application that will need to get resources
from YARN. Spark's scheduler pools will determine how those resources are
allocated among whatever Jobs run within the new Application.

Spark's scheduler pools are only relevant when you are submitting multiple
Jobs within a single Application (i.e., you are using the same SparkContext
to launch multiple Jobs) and you have used SparkContext#setLocalProperty to
set "spark.scheduler.pool" to something other than the default pool before
a particular Job intended to use that pool is started via that SparkContext.

On Wed, Apr 5, 2017 at 1:11 PM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> Hmm, so when I submit an application with `spark-submit`, I need to
> guarantee it resources using YARN queues and not Spark's scheduler pools.
> Is that correct?
>
> When are Spark's scheduler pools relevant/useful in this context?
>
> On Wed, Apr 5, 2017 at 3:54 PM Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> grrr... s/your/you're/
>>
>> On Wed, Apr 5, 2017 at 12:54 PM, Mark Hamstra <m...@clearstorydata.com>
>> wrote:
>>
>> Your mixing up different levels of scheduling. Spark's fair scheduler
>> pools are about scheduling Jobs, not Applications; whereas YARN queues with
>> Spark are about scheduling Applications, not Jobs.
>>
>> On Wed, Apr 5, 2017 at 12:27 PM, Nick Chammas <nicholas.cham...@gmail.com
>> > wrote:
>>
>> I'm having trouble understanding the difference between Spark fair
>> scheduler pools
>> <https://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools>
>> and YARN queues
>> <https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html>.
>> Do they conflict? Does one override the other?
>>
>> I posted a more detailed question about an issue I'm having with this on
>> Stack Overflow: http://stackoverflow.com/q/43239921/877069
>>
>> Nick
>>
>>
>> --
>> View this message in context: Spark fair scheduler pools vs. YARN queues
>> <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-fair-scheduler-pools-vs-YARN-queues-tp28572.html>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>>
>>
>>


Re: Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Mark Hamstra
grrr... s/your/you're/

On Wed, Apr 5, 2017 at 12:54 PM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> Your mixing up different levels of scheduling. Spark's fair scheduler
> pools are about scheduling Jobs, not Applications; whereas YARN queues with
> Spark are about scheduling Applications, not Jobs.
>
> On Wed, Apr 5, 2017 at 12:27 PM, Nick Chammas <nicholas.cham...@gmail.com>
> wrote:
>
>> I'm having trouble understanding the difference between Spark fair
>> scheduler pools
>> <https://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools>
>> and YARN queues
>> <https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html>.
>> Do they conflict? Does one override the other?
>>
>> I posted a more detailed question about an issue I'm having with this on
>> Stack Overflow: http://stackoverflow.com/q/43239921/877069
>>
>> Nick
>>
>>
>> --
>> View this message in context: Spark fair scheduler pools vs. YARN queues
>> <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-fair-scheduler-pools-vs-YARN-queues-tp28572.html>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>
>


Re: Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Mark Hamstra
Your mixing up different levels of scheduling. Spark's fair scheduler pools
are about scheduling Jobs, not Applications; whereas YARN queues with Spark
are about scheduling Applications, not Jobs.

On Wed, Apr 5, 2017 at 12:27 PM, Nick Chammas 
wrote:

> I'm having trouble understanding the difference between Spark fair
> scheduler pools
> 
> and YARN queues
> .
> Do they conflict? Does one override the other?
>
> I posted a more detailed question about an issue I'm having with this on
> Stack Overflow: http://stackoverflow.com/q/43239921/877069
>
> Nick
>
>
> --
> View this message in context: Spark fair scheduler pools vs. YARN queues
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


Re: Spark shuffle files

2017-03-27 Thread Mark Hamstra
When the RDD using them goes out of scope.

On Mon, Mar 27, 2017 at 3:13 PM, Ashwin Sai Shankar <ashan...@netflix.com>
wrote:

> Thanks Mark! follow up question, do you know when shuffle files are
> usually un-referenced?
>
> On Mon, Mar 27, 2017 at 2:35 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> Shuffle files are cleaned when they are no longer referenced. See
>> https://github.com/apache/spark/blob/master/core/src/mai
>> n/scala/org/apache/spark/ContextCleaner.scala
>>
>> On Mon, Mar 27, 2017 at 12:38 PM, Ashwin Sai Shankar <
>> ashan...@netflix.com.invalid> wrote:
>>
>>> Hi!
>>>
>>> In spark on yarn, when are shuffle files on local disk removed? (Is it
>>> when the app completes or
>>> once all the shuffle files are fetched or end of the stage?)
>>>
>>> Thanks,
>>> Ashwin
>>>
>>
>>
>


Re: Spark shuffle files

2017-03-27 Thread Mark Hamstra
Shuffle files are cleaned when they are no longer referenced. See
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala

On Mon, Mar 27, 2017 at 12:38 PM, Ashwin Sai Shankar <
ashan...@netflix.com.invalid> wrote:

> Hi!
>
> In spark on yarn, when are shuffle files on local disk removed? (Is it
> when the app completes or
> once all the shuffle files are fetched or end of the stage?)
>
> Thanks,
> Ashwin
>


Re: Can't transform RDD for the second time

2017-02-28 Thread Mark Hamstra
foreachPartition is not a transformation; it is an action. If you want to
transform an RDD using an iterator in each partition, then use
mapPartitions.

On Tue, Feb 28, 2017 at 8:17 PM, jeremycod  wrote:

> Hi,
>
> I'm trying to transform one RDD two times. I'm using foreachParition and
> embedded I have two map transformations on it. First time, it works fine
> and
> I get results, but second time I call map on it, it behaves like RDD has no
> elements.
> This is my code:
>
> val credentialsIdsScala: Seq[java.lang.Long] =
> credentialsIds.asScala.toSeq
> println("ALL CREDENTIALS:" + credentialsIdsScala.mkString(","))
>
>
> val credentialsRDD: RDD[Long] = sc.parallelize(credentialsIdsScala.map
> {
> Long2long })
> val connector = CassandraConnector(sc.getConf)
> credentialsRDD.foreachPartition {
>   credentials => {
> val userCourseKMeansProfiles: Iterator[Iterable[Tuple5[Long,
> String,
> Long, Long, String]]] = credentials.map { credentialid =>
>   println("RUNNING USER PROFILE CLUSTERING FOR CREDENTIAL:" +
> credentialid)
>   val userCourseProfile: Iterable[Tuple5[Long, String, Long, Long,
> String]] = runPeriodicalKMeansClustering(dbName, days, numClusters,
> numFeatures, credentialid)
>   userCourseProfile
> }
> userCourseKMeansProfiles.foreach(userProfile => {
>   val query = "INSERT INTO " + dbName + "." +
> TablesNames.PROFILE_USERQUARTILE_FEATURES_BYPROFILE + "(course,
> profile,date, userid, sequence) VALUES (?, ?, ?,?,?) ";
>   connector.withSessionDo {
> session => {
>   userProfile.foreach(record => {
> println("USER PROFILE RECORD:" + record._1 + " " +
> record._2
> + " " + record._3 + " " + record._4 + " " + record._5)
> session.execute(query,
> record._1.asInstanceOf[java.lang.Long], record._2.asInstanceOf[String],
> record._3.asInstanceOf[java.lang.Long],
> record._4.asInstanceOf[java.lang.Long], record._5.asInstanceOf[String])
>   })
> }
>   }
> })
> val secondMapping = credentials.map {
>   credentialid =>
> println("credential id:" + credentialid)
> credentialid
> }
> secondMapping.foreach(cid=>println("credentialid:"+cid))
> println("Second mapping:" + secondMapping.length)
>   }
>
> Could someone explain me what is wrong with my code and how to fix it?
>
> Thanks,
> Zoran
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Can-t-transform-RDD-for-the-second-time-tp28441.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Fwd: Will Spark ever run the same task at the same time

2017-02-20 Thread Mark Hamstra
First, the word you are looking for is "straggler", not "strangler" -- very
different words. Second, "idempotent" doesn't mean "only happens once", but
rather "if it does happen more than once, the effect is no different than
if it only happened once".

It is possible to insert a nearly limitless variety of side-effecting code
into Spark Tasks, and there is no guarantee from Spark that such code will
execute idempotently. Speculation is one way that a Task can run more than
once, but it is not the only way. A simple FetchFailure (from a lost
Executor or another reason) will mean that a Task has to be re-run in order
to re-compute the missing outputs from a prior execution. In general, Spark
will run a Task as many times as needed to satisfy the requirements of the
Jobs it is requested to fulfill, and you can assume neither that a Task
will run only once nor that it will execute idempotently (unless, of
course, it is side-effect free). Guaranteeing idempotency requires a higher
level coordinator with access to information on all Task executions. The
OutputCommitCoordinator handles that guarantee for HDFS writes, and the
JIRA discussion associated with the introduction of
the OutputCommitCoordinator covers most of the design issues:
https://issues.apache.org/jira/browse/SPARK-4879

On Thu, Feb 16, 2017 at 10:34 AM, Ji Yan  wrote:

> Dear spark users,
>
> Is there any mechanism in Spark that does not guarantee the idempotent
> nature? For example, for stranglers, the framework might start another task
> assuming the strangler is slow while the strangler is still running. This
> would be annoying sometime when say the task is writing to a file, but have
> the same tasks running at the same time may corrupt the file. From the
> documentation page, I know that Spark's speculative execution mode is
> turned off by default. Does anyone know any other mechanism in Spark that
> may cause problem in scenario like this?
>
> Thanks
> Ji
>
> The information in this email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful.
>


Re: is dataframe thread safe?

2017-02-13 Thread Mark Hamstra
If you update the data, then you don't have the same DataFrame anymore. If
you don't do like Assaf did, caching and forcing evaluation of the
DataFrame before using that DataFrame concurrently, then you'll still get
consistent and correct results, but not necessarily efficient results. If
the fully materialized, cached are not yet available when multiple
concurrent Jobs try to use the DataFrame, then you can end up with more
than one Job doing the same work to generate what needs to go in the cache.
To avoid that kind of work duplication you need some mechanism to ensure
that only one action/Job is run to populate the cache before multiple
actions/Jobs can then use the cached results efficiently.

On Mon, Feb 13, 2017 at 9:15 AM, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> How about having a thread that update and cache a dataframe in-memory next
> to other threads requesting this dataframe, is it thread safe ?
>
> 2017-02-13 9:02 GMT+01:00 Reynold Xin :
>
>> Yes your use case should be fine. Multiple threads can transform the same
>> data frame in parallel since they create different data frames.
>>
>>
>> On Sun, Feb 12, 2017 at 9:07 AM Mendelson, Assaf 
>> wrote:
>>
>>> Hi,
>>>
>>> I was wondering if dataframe is considered thread safe. I know the spark
>>> session and spark context are thread safe (and actually have tools to
>>> manage jobs from different threads) but the question is, can I use the same
>>> dataframe in both threads.
>>>
>>> The idea would be to create a dataframe in the main thread and then in
>>> two sub threads do different transformations and actions on it.
>>>
>>> I understand that some things might not be thread safe (e.g. if I
>>> unpersist in one thread it would affect the other. Checkpointing would
>>> cause similar issues), however, I can’t find any documentation as to what
>>> operations (if any) are thread safe.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Assaf.
>>>
>>
>


Re: can I use Spark Standalone with HDFS but no YARN

2017-02-03 Thread Mark Hamstra
yes

On Fri, Feb 3, 2017 at 10:08 PM, kant kodali  wrote:

> can I use Spark Standalone with HDFS but no YARN?
>
> Thanks!
>


Re: Having multiple spark context

2017-01-29 Thread Mark Hamstra
More than one Spark Context in a single Application is not supported.

On Sun, Jan 29, 2017 at 9:08 PM,  wrote:

> Hi,
>
>
>
> I have a requirement in which, my application creates one Spark context in
> Distributed mode whereas another Spark context in local mode.
>
> When I am creating this, my complete application is working on only one
> SparkContext (created in Distributed mode). Second spark context is not
> getting created.
>
>
>
> Can you please help me out in how to create two spark contexts.
>
>
>
> Regards,
>
> Jasbir singh
>
> --
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
> 
> __
>
> www.accenture.com
>


Re: DAG Visualization option is missing on Spark Web UI

2017-01-28 Thread Mark Hamstra
Try selecting a particular Job instead of looking at the summary page for
all Jobs.

On Sat, Jan 28, 2017 at 4:25 PM, Md. Rezaul Karim <
rezaul.ka...@insight-centre.org> wrote:

> Hi Jacek,
>
> I tried accessing Spark web UI on both Firefox and Google Chrome browsers
> with ad blocker enabled. I do see other options like* User, Total Uptime,
> Scheduling Mode, **Active Jobs, Completed Jobs and* Event Timeline.
> However, I don't see an option for DAG visualization.
>
> Please note that I am experiencing the same issue with Spark 2.x (i.e.
> 2.0.0, 2.0.1, 2.0.2 and 2.1.0). Refer the attached screenshot of the UI
> that I am seeing on my machine:
>
> [image: Inline images 1]
>
>
> Please suggest.
>
>
>
>
> Regards,
> _
> *Md. Rezaul Karim*, BSc, MSc
> PhD Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web: http://www.reza-analytics.eu/index.html
> 
>
> On 28 January 2017 at 18:51, Jacek Laskowski  wrote:
>
>> Hi,
>>
>> Wonder if you have any adblocker enabled in your browser? Is this the
>> only version giving you this behavior? All Spark jobs have no
>> visualization?
>>
>> Jacek
>>
>> On 28 Jan 2017 7:03 p.m., "Md. Rezaul Karim" <
>> rezaul.ka...@insight-centre.org> wrote:
>>
>> Hi All,
>>
>> I am running a Spark job on my local machine written in Scala with Spark
>> 2.1.0. However, I am not seeing any option of "*DAG Visualization*" at 
>> http://localhost:4040/jobs/
>>
>>
>> Suggestion, please.
>>
>>
>>
>>
>> Regards,
>> _
>> *Md. Rezaul Karim*, BSc, MSc
>> PhD Researcher, INSIGHT Centre for Data Analytics
>> National University of Ireland, Galway
>> IDA Business Park, Dangan, Galway, Ireland
>> Web: http://www.reza-analytics.eu/index.html
>> 
>>
>>
>>
>


Re:

2017-01-21 Thread Mark Hamstra
I wouldn't say that Executors are dumb, but there are some pretty clear
divisions of concepts and responsibilities across the different pieces of
the Spark architecture. A Job is a concept that is completely unknown to an
Executor, which deals instead with just the Tasks that it is given.  So you
are correct, Jacek, that any notification of a Job end has to come from the
Driver.

On Sat, Jan 21, 2017 at 2:10 AM, Jacek Laskowski  wrote:

> Executors are "dumb", i.e. they execute TaskRunners for tasks and...that's
> it.
>
> Your logic should be on the driver that can intercept events
> and...trigger cleanup.
>
> I don't think there's another way to do it.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Fri, Jan 20, 2017 at 10:47 PM, Keith Chapman 
> wrote:
> > Hi Jacek,
> >
> > I've looked at SparkListener and tried it, I see it getting fired on the
> > master but I don't see it getting fired on the workers in a cluster.
> >
> > Regards,
> > Keith.
> >
> > http://keith-chapman.com
> >
> > On Fri, Jan 20, 2017 at 11:09 AM, Jacek Laskowski 
> wrote:
> >>
> >> Hi,
> >>
> >> (redirecting to users as it has nothing to do with Spark project
> >> development)
> >>
> >> Monitor jobs and stages using SparkListener and submit cleanup jobs
> where
> >> a condition holds.
> >>
> >> Jacek
> >>
> >> On 20 Jan 2017 3:57 a.m., "Keith Chapman" 
> wrote:
> >>>
> >>> Hi ,
> >>>
> >>> Is it possible for an executor (or slave) to know when an actual job
> >>> ends? I'm running spark on a cluster (with yarn) and my workers create
> some
> >>> temporary files that I would like to clean up once the job ends. Is
> there a
> >>> way for the worker to detect that a job has finished? I tried doing it
> in
> >>> the JobProgressListener but it does not seem to work in a cluster. The
> event
> >>> is not triggered in the worker.
> >>>
> >>> Regards,
> >>> Keith.
> >>>
> >>> http://keith-chapman.com
> >
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark and Kafka integration

2017-01-12 Thread Mark Hamstra
See "API compatibility" in http://spark.apache.org/versioning-policy.html

While code that is annotated as Experimental is still a good faith effort
to provide a stable and useful API, the fact is that we're not yet
confident enough that we've got the public API in exactly the form that we
want to commit to maintaining until at least the next major release.  That
means that the API may change in the next minor/feature-level release (but
it shouldn't in a patch/bugfix-level release), which would require that
your source code be rewritten to use the new API.  In the most extreme
case, we may decide that the experimental code didn't work out the way we
wanted, so it could be withdrawn entirely.  Complete withdrawal of the
Kafka code is unlikely, but it may well change in incompatible way with
future releases even before Spark 3.0.0.

On Thu, Jan 12, 2017 at 5:57 AM, Phadnis, Varun 
wrote:

> Hello,
>
>
>
> We are using  Spark 2.0 with Kafka 0.10.
>
>
>
> As I understand, much of the API packaged in the following dependency we
> are targeting is marked as “@Experimental”
>
>
>
> 
>
> org.apache.spark
>
> spark-streaming-kafka-0-10_2.11
>
> 2.0.0
>
> 
>
>
>
> What are implications of this being marked as experimental? Are they
> stable enough for production?
>
>
>
> Thanks,
>
> Varun
>
>
>


Re: Is there any scheduled release date for Spark 2.1.0?

2016-12-28 Thread Mark Hamstra
A SNAPSHOT build is not a stable artifact, but rather floats to the top of
commits that are intended for the next release.  So, 2.1.1-SNAPSHOT comes
after the 2.1.0 release and contains any code at the time that the artifact
was built that was committed to the branch-2.1 maintenance branch and is,
therefore, intended for the eventual 2.1.1 maintenance release.  Once a
release is tagged and stable artifacts for it can be built, there is no
purpose for s SNAPSHOT of that release -- e.g. there is no longer any
purpose for a 2.1.0-SNAPSHOT release; if you want 2.1.0, then you should be
using stable artifacts now, not SNAPSHOTs.

The existence of a SNAPSHOT doesn't imply anything about the release date
of the associated finished version.  Rather, it only indicates a name that
is attached to all of the code that is currently intended for the
associated release number.

On Wed, Dec 28, 2016 at 3:09 PM, Justin Miller <
justin.mil...@protectwise.com> wrote:

> It looks like the jars for 2.1.0-SNAPSHOT are gone?
>
> https://repository.apache.org/content/groups/snapshots/org/
> apache/spark/spark-sql_2.11/2.1.0-SNAPSHOT/
>
> Also:
>
> 2.1.0-SNAPSHOT/
> <https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.1.0-SNAPSHOT/>
>  Fri
> Dec 23 16:31:42 UTC 2016
> 2.1.1-SNAPSHOT/
> <https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.1.1-SNAPSHOT/>
>  Wed
> Dec 28 20:01:10 UTC 2016
> 2.2.0-SNAPSHOT/
> <https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.2.0-SNAPSHOT/>
>  Wed
> Dec 28 19:12:38 UTC 2016
>
> What's with 2.1.1-SNAPSHOT? Is that version about to be released as well?
>
> Thanks!
> Justin
>
> On Dec 28, 2016, at 12:53 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
> The v2.1.0 tag is there: https://github.com/apache/spark/tree/v2.1.0
>
> On Wed, Dec 28, 2016 at 2:04 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> seems like the artifacts are on maven central but the website is not yet
>> updated.
>>
>> strangely the tag v2.1.0 is not yet available on github. i assume its
>> equal to v2.1.0-rc5
>>
>> On Fri, Dec 23, 2016 at 10:52 AM, Justin Miller <
>> justin.mil...@protectwise.com> wrote:
>>
>>> I'm curious about this as well. Seems like the vote passed.
>>>
>>> > On Dec 23, 2016, at 2:00 AM, Aseem Bansal <asmbans...@gmail.com>
>>> wrote:
>>> >
>>> >
>>>
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>
>


Re: Is there any scheduled release date for Spark 2.1.0?

2016-12-28 Thread Mark Hamstra
The v2.1.0 tag is there: https://github.com/apache/spark/tree/v2.1.0

On Wed, Dec 28, 2016 at 2:04 PM, Koert Kuipers  wrote:

> seems like the artifacts are on maven central but the website is not yet
> updated.
>
> strangely the tag v2.1.0 is not yet available on github. i assume its
> equal to v2.1.0-rc5
>
> On Fri, Dec 23, 2016 at 10:52 AM, Justin Miller <
> justin.mil...@protectwise.com> wrote:
>
>> I'm curious about this as well. Seems like the vote passed.
>>
>> > On Dec 23, 2016, at 2:00 AM, Aseem Bansal  wrote:
>> >
>> >
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Infinite Loop in Spark

2016-10-27 Thread Mark Hamstra
Using a single SparkContext for an extended period of time is how
long-running Spark Applications such as the Spark Job Server work (
https://github.com/spark-jobserver/spark-jobserver).  It's an established
pattern.

On Thu, Oct 27, 2016 at 11:46 AM, Gervásio Santos  wrote:

> Hi guys!
>
> I'm developing an application in Spark that I'd like to run continuously.
> It would execute some actions, sleep for a while and go again. I was
> thinking of doing it in a standard infinite loop way.
>
> val sc = 
> while (true) {
>   doStuff(...)
>   sleep(...)
> }
>
> I would be running this (fairly light weight) application on a cluster,
> that would also run other (significantly heavier) jobs. However, I fear
> that this kind of code might lead to unexpected beahavior; I don't know if
> keeping the same SparkContext active continuously for a very long time
> might lead to some weird stuff happening.
>
> Can anyone tell me if there is some problem with not "renewing" the Spark
> context or is aware of any problmes with this approach that I might be
> missing?
>
> Thanks!
>


Re: previous stage results are not saved?

2016-10-17 Thread Mark Hamstra
There is no need to do that if 1) the stage that you are concerned with
either made use of or produced MapOutputs/shuffle files; 2) reuse of those
shuffle files (which may very well be in the OS buffer cache of the worker
nodes) is sufficient for your needs; 3) the relevant Stage objects haven't
gone out of scope, which would allow the shuffle files to be removed; 4)
you reuse the exact same Stage objects that were used previously.  If all
of that is true, then Spark will re-use the prior stage with performance
very similar to if you had explicitly cached an equivalent RDD.

On Mon, Oct 17, 2016 at 4:53 PM, ayan guha  wrote:

> You can use cache or persist.
>
> On Tue, Oct 18, 2016 at 10:11 AM, Yang  wrote:
>
>> I'm trying out 2.0, and ran a long job with 10 stages, in spark-shell
>>
>> it seems that after all 10 finished successfully, if I run the last, or
>> the 9th again,
>> spark reruns all the previous stages from scratch, instead of utilizing
>> the partial results.
>>
>> this is quite serious since I can't experiment while making small changes
>> to the code.
>>
>> any idea what part of the spark framework might have caused this ?
>>
>> thanks
>> Yang
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: Is executor computing time affected by network latency?

2016-09-23 Thread Mark Hamstra
>
> The best network results are achieved when Spark nodes share the same
> hosts as Hadoop or they happen to be on the same subnet.
>

That's only true for those portions of a Spark execution pipeline that are
actually reading from HDFS.  If you're re-using an RDD for which the needed
shuffle files are already available on Executor nodes or are looking at
stages of a Spark SQL query execution later than those reading from HDFS,
then data locality and network utilization concerns don't really have
anything to do with co-location of Executors and HDFS data nodes.

On Fri, Sep 23, 2016 at 1:31 PM, Mich Talebzadeh 
wrote:

> Does this assume that Spark is running on the same hosts as HDFS? Hence
> does increasing the latency affects the network latency on Hadoop nodes as
> well in your tests?
>
> The best network results are achieved when Spark nodes share the same
> hosts as Hadoop or they happen to be on the same subnet.
>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 22 September 2016 at 14:54, gusiri  wrote:
>
>> Hi,
>>
>> When I increase the network latency among spark nodes,
>>
>> I see compute time (=executor computing time in Spark Web UI) also
>> increases.
>>
>> In the graph attached, left = latency 1ms vs right = latency 500ms.
>>
>> Is there any communication between worker and driver/master even 'during'
>> executor computing? or any idea on this result?
>>
>>
>> > n27779/Screen_Shot_2016-09-21_at_5.png>
>>
>>
>>
>>
>>
>> Thank you very much in advance.
>>
>> //gusiri
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Is-executor-computing-time-affected-
>> by-network-latency-tp27779.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Character encoding corruption in Spark JDBC connector

2016-09-13 Thread Mark Bittmann
Hello Spark community,

I'm reading from a MySQL database into a Spark dataframe using the JDBC
connector functionality, and I'm experiencing some character encoding
issues. The default encoding for MySQL strings is latin1, but the mysql
JDBC connector implementation of "ResultSet.getString()" will return an
mangled unicode encoding of the data for certain characters such as the
"all rights reserved" char. Instead, you can use "new
String(ResultSet.getBytes())" which will return the correctly encoded
string. I've confirmed this behavior with the mysql connector classes
(i.e., without using the Spark wrapper).

I can see here that the Spark JDBC connector uses getString(), though there
is a note to move to getBytes() for performance reasons:

https://github.com/apache/spark/blob/master/sql/core/
src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.
scala#L389

For some special chars, I can reverse the behavior with a UDF that applies
new String(badString.getBytes("Cp1252") , "UTF-8"), however for some
foreign characters the underlying byte array is irreversibly changed and
the data is corrupted.

I can submit an issue/PR to fix it going forward if "new
String(ResultSet.getBytes())" is the correct approach.

Meanwhile, can anyone offer any recommendations on how to correct this
behavior prior to it getting to a dataframe? I've tried every permutation
of the settings in the JDBC connection url (characterSetResults,
characterEncoding).

I'm on Spark 1.6.

Thanks!


Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Mark Hamstra
It sounds like you should be writing an application and not trying to force
the spark-shell to do more than what it was intended for.

On Tue, Sep 13, 2016 at 11:53 AM, Kevin Burton  wrote:

> I sort of agree but the problem is that some of this should be code.
>
> Some of our ES indexes have 100-200 columns.
>
> Defining which ones are arrays on the command line is going to get ugly
> fast.
>
>
>
> On Tue, Sep 13, 2016 at 11:50 AM, Sean Owen  wrote:
>
>> You would generally use --conf to set this on the command line if using
>> the shell.
>>
>>
>> On Tue, Sep 13, 2016, 19:22 Kevin Burton  wrote:
>>
>>> The problem is that without a new spark context, with a custom conf,
>>> elasticsearch-hadoop is refusing to read in settings about the ES setup...
>>>
>>> if I do a sc.stop() , then create a new one, it seems to work fine.
>>>
>>> But it isn't really documented anywhere and all the existing
>>> documentation is now invalid because you get an exception when you try to
>>> create a new spark context.
>>>
>>> On Tue, Sep 13, 2016 at 11:13 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 I think this works in a shell but you need to allow multiple spark
 contexts

 Spark context Web UI available at http://50.140.197.217:5
 Spark context available as 'sc' (master = local, app id =
 local-1473789661846).
 Spark session available as 'spark'.
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 2.0.0
   /_/
 Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
 1.8.0_77)
 Type in expressions to have them evaluated.
 Type :help for more information.

 scala> import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext
 scala>  val conf = new SparkConf().setMaster("local[2
 ]").setAppName("CountingSheep").
 *set("spark.driver.allowMultipleContexts", "true")*conf:
 org.apache.spark.SparkConf = org.apache.spark.SparkConf@bb5f9d
 scala> val sc = new SparkContext(conf)
 sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@
 4888425d


 HTH


 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 13 September 2016 at 18:57, Sean Owen  wrote:

> But you're in the shell there, which already has a SparkContext for
> you as sc.
>
> On Tue, Sep 13, 2016 at 6:49 PM, Kevin Burton 
> wrote:
>
>> I'm rather confused here as to what to do about creating a new
>> SparkContext.
>>
>> Spark 2.0 prevents it... (exception included below)
>>
>> yet a TON of examples I've seen basically tell you to create a new
>> SparkContext as standard practice:
>>
>> http://spark.apache.org/docs/latest/configuration.html#dynam
>> ically-loading-spark-properties
>>
>> val conf = new SparkConf()
>>  .setMaster("local[2]")
>>  .setAppName("CountingSheep")val sc = new SparkContext(conf)
>>
>>
>> I'm specifically running into a problem in that ES hadoop won't work
>> with its settings and I think its related to this problme.
>>
>> Do we have to call sc.stop() first and THEN create a new spark
>> context?
>>
>> That works,, but I can't find any documentation anywhere telling us
>> the right course of action.
>>
>>
>>
>> scala> val sc = new SparkContext();
>> org.apache.spark.SparkException: Only one SparkContext may be
>> running in this JVM (see SPARK-2243). To ignore this error, set
>> spark.driver.allowMultipleContexts = true. The currently running
>> SparkContext was created at:
>> org.apache.spark.sql.SparkSession$Builder.getOrCreate(
>> SparkSession.scala:823)
>> org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
>> (:15)
>> (:31)
>> (:33)
>> .(:37)
>> .()
>> .$print$lzycompute(:7)
>> .$print(:6)
>> $print()
>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>> ssorImpl.java:62)
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>> 

Character encoding corruption in Spark JDBC connector

2016-09-13 Thread Mark Bittmann
Hello Spark community,

I'm reading from a MySQL database into a Spark dataframe using the JDBC
connector functionality, and I'm experiencing some character encoding
issues. The default encoding for MySQL stings is latin1, but the mysql JDBC
connector implementation of "ResultSet.getString()" will return an
incorrect encoding of the data for certain characters such as the "all
rights reserved" char. Instead, you can use "new
String(ResultSet.getBytes())" which will return the correctly encoded
string. I've confirmed this behavior with the mysql connector classes
(i.e., without using the Spark wrapper).

I can see here that the Spark JDBC connector uses getString(), though there
is a note to move to getBytes() for performance reasons:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L389

For some special chars, I can reverse the behavior with a UDF that applies
new String(badString.getBytes("Cp1252") , "UTF-8"), however for some
languages the underlying byte array is irreversibly changed and the data is
corrupted.

I can submit an issue/PR to fix it going forward if "new
String(ResultSet.getBytes())" is the correct approach.

Meanwhile, can anyone offer any recommendations on how to correct this
behavior prior to it getting to a dataframe? I've tried every permutation
of the settings in the JDBC connection url (characterSetResults,
characterEncoding).

I'm on Spark 1.6.

Thanks!


Re: Spark scheduling mode

2016-09-02 Thread Mark Hamstra
And, no, Spark's scheduler will not preempt already running Tasks.  In
fact, just killing running Tasks for any reason is trickier than we'd like
it to be, so it isn't done by default:
https://issues.apache.org/jira/browse/SPARK-17064

On Fri, Sep 2, 2016 at 11:34 AM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> The comparator is used in `Pool#getSortedTaskSetQueue`.  The
> `TaskSchedulerImpl` calls that on the rootPool when the TaskScheduler needs
> to handle `resourceOffers` for available Executor cores.  Creation of the
> `sortedTaskSets` is a recursive, nested sorting of the `Schedulable`
> entities -- you can have pools within pools within pools within... if you
> really want to, but they eventually bottom out in TaskSetManagers.  The
> `sortedTaskSets` is a flattened queue of the TaskSets, and the available
> cores are offered to those TaskSets in that queued order until the next
> time the scheduler backend handles the available resource offers and a new
> `sortedTaskSets` is generated.
>
> On Fri, Sep 2, 2016 at 2:37 AM, enrico d'urso <e.du...@live.com> wrote:
>
>> Thank you.
>>
>> May I know when that comparator is called?
>> It looks like spark scheduler has not any form of preemption, am I right?
>>
>> Thank you
>> --
>> *From:* Mark Hamstra <m...@clearstorydata.com>
>> *Sent:* Thursday, September 1, 2016 8:44:10 PM
>>
>> *To:* enrico d'urso
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Spark scheduling mode
>>
>> Spark's FairSchedulingAlgorithm is not round robin:
>> https://github.com/apache/spark/blob/master/core/src/
>> main/scala/org/apache/spark/scheduler/SchedulingAlgorithm.scala#L43
>>
>> When at the scope of fair scheduling Jobs within a single Pool, the
>> Schedulable entities being handled (s1 and s2) are TaskSetManagers, which
>> are at the granularity of Stages, not Jobs.  Since weight is 1 and minShare
>> is 0 for TaskSetManagers, the FairSchedulingAlgorithm for TaskSetManagers
>> just boils down to prioritizing TaskSets (i.e. Stages) with the fewest
>> number of runningTasks.
>>
>> On Thu, Sep 1, 2016 at 11:23 AM, enrico d'urso <e.du...@live.com> wrote:
>>
>>> I tried it before, but still I am not able to see a proper round robin
>>> across the jobs I submit.
>>> Given this:
>>>
>>> 
>>> FAIR
>>> 1
>>> 2
>>>   
>>>
>>> Each jobs inside production pool should be scheduled in round robin way,
>>> am I right?
>>>
>>> --
>>> *From:* Mark Hamstra <m...@clearstorydata.com>
>>> *Sent:* Thursday, September 1, 2016 8:19:44 PM
>>> *To:* enrico d'urso
>>> *Cc:* user@spark.apache.org
>>> *Subject:* Re: Spark scheduling mode
>>>
>>> The default pool (``) can be configured like any
>>> other pool: https://spark.apache.org/docs/latest/job-scheduling.ht
>>> ml#configuring-pool-properties
>>>
>>> On Thu, Sep 1, 2016 at 11:11 AM, enrico d'urso <e.du...@live.com> wrote:
>>>
>>>> Is there a way to force scheduling to be fair *inside* the default
>>>> pool?
>>>> I mean, round robin for the jobs that belong to the default pool.
>>>>
>>>> Cheers,
>>>> --
>>>> *From:* Mark Hamstra <m...@clearstorydata.com>
>>>> *Sent:* Thursday, September 1, 2016 7:24:54 PM
>>>> *To:* enrico d'urso
>>>> *Cc:* user@spark.apache.org
>>>> *Subject:* Re: Spark scheduling mode
>>>>
>>>> Just because you've flipped spark.scheduler.mode to FAIR, that doesn't
>>>> mean that Spark can magically configure and start multiple scheduling pools
>>>> for you, nor can it know to which pools you want jobs assigned.  Without
>>>> doing any setup of additional scheduling pools or assigning of jobs to
>>>> pools, you're just dumping all of your jobs into the one available default
>>>> pool (which is now being fair scheduled with an empty set of other pools)
>>>> and the scheduling of jobs within that pool is still the default intra-pool
>>>> scheduling, FIFO -- i.e., you've effectively accomplished nothing by only
>>>> flipping spark.scheduler.mode to FAIR.
>>>>
>>>> On Thu, Sep 1, 2016 at 7:10 AM, enrico d'urso <e.du...@live.com> wrote:
>>>>
>>>>> I am building a Spark App, in which I submit several jobs (pyspark). I
>>>>> am using threads to run them in parallel, and also I am setting:
>>>>> conf.set("spark.scheduler.mode", "FAIR") Still, I see the jobs run
>>>>> serially in FIFO way. Am I missing something?
>>>>>
>>>>> Cheers,
>>>>>
>>>>>
>>>>> Enrico
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Spark scheduling mode

2016-09-02 Thread Mark Hamstra
The comparator is used in `Pool#getSortedTaskSetQueue`.  The
`TaskSchedulerImpl` calls that on the rootPool when the TaskScheduler needs
to handle `resourceOffers` for available Executor cores.  Creation of the
`sortedTaskSets` is a recursive, nested sorting of the `Schedulable`
entities -- you can have pools within pools within pools within... if you
really want to, but they eventually bottom out in TaskSetManagers.  The
`sortedTaskSets` is a flattened queue of the TaskSets, and the available
cores are offered to those TaskSets in that queued order until the next
time the scheduler backend handles the available resource offers and a new
`sortedTaskSets` is generated.

On Fri, Sep 2, 2016 at 2:37 AM, enrico d'urso <e.du...@live.com> wrote:

> Thank you.
>
> May I know when that comparator is called?
> It looks like spark scheduler has not any form of preemption, am I right?
>
> Thank you
> --
> *From:* Mark Hamstra <m...@clearstorydata.com>
> *Sent:* Thursday, September 1, 2016 8:44:10 PM
>
> *To:* enrico d'urso
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark scheduling mode
>
> Spark's FairSchedulingAlgorithm is not round robin: https://github.com/
> apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/
> SchedulingAlgorithm.scala#L43
>
> When at the scope of fair scheduling Jobs within a single Pool, the
> Schedulable entities being handled (s1 and s2) are TaskSetManagers, which
> are at the granularity of Stages, not Jobs.  Since weight is 1 and minShare
> is 0 for TaskSetManagers, the FairSchedulingAlgorithm for TaskSetManagers
> just boils down to prioritizing TaskSets (i.e. Stages) with the fewest
> number of runningTasks.
>
> On Thu, Sep 1, 2016 at 11:23 AM, enrico d'urso <e.du...@live.com> wrote:
>
>> I tried it before, but still I am not able to see a proper round robin
>> across the jobs I submit.
>> Given this:
>>
>> 
>> FAIR
>> 1
>> 2
>>   
>>
>> Each jobs inside production pool should be scheduled in round robin way,
>> am I right?
>>
>> --
>> *From:* Mark Hamstra <m...@clearstorydata.com>
>> *Sent:* Thursday, September 1, 2016 8:19:44 PM
>> *To:* enrico d'urso
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Spark scheduling mode
>>
>> The default pool (``) can be configured like any
>> other pool: https://spark.apache.org/docs/latest/job-scheduling.
>> html#configuring-pool-properties
>>
>> On Thu, Sep 1, 2016 at 11:11 AM, enrico d'urso <e.du...@live.com> wrote:
>>
>>> Is there a way to force scheduling to be fair *inside* the default pool?
>>> I mean, round robin for the jobs that belong to the default pool.
>>>
>>> Cheers,
>>> --
>>> *From:* Mark Hamstra <m...@clearstorydata.com>
>>> *Sent:* Thursday, September 1, 2016 7:24:54 PM
>>> *To:* enrico d'urso
>>> *Cc:* user@spark.apache.org
>>> *Subject:* Re: Spark scheduling mode
>>>
>>> Just because you've flipped spark.scheduler.mode to FAIR, that doesn't
>>> mean that Spark can magically configure and start multiple scheduling pools
>>> for you, nor can it know to which pools you want jobs assigned.  Without
>>> doing any setup of additional scheduling pools or assigning of jobs to
>>> pools, you're just dumping all of your jobs into the one available default
>>> pool (which is now being fair scheduled with an empty set of other pools)
>>> and the scheduling of jobs within that pool is still the default intra-pool
>>> scheduling, FIFO -- i.e., you've effectively accomplished nothing by only
>>> flipping spark.scheduler.mode to FAIR.
>>>
>>> On Thu, Sep 1, 2016 at 7:10 AM, enrico d'urso <e.du...@live.com> wrote:
>>>
>>>> I am building a Spark App, in which I submit several jobs (pyspark). I
>>>> am using threads to run them in parallel, and also I am setting:
>>>> conf.set("spark.scheduler.mode", "FAIR") Still, I see the jobs run
>>>> serially in FIFO way. Am I missing something?
>>>>
>>>> Cheers,
>>>>
>>>>
>>>> Enrico
>>>>
>>>
>>>
>>
>


Re: Spark scheduling mode

2016-09-01 Thread Mark Hamstra
Spark's FairSchedulingAlgorithm is not round robin:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SchedulingAlgorithm.scala#L43

When at the scope of fair scheduling Jobs within a single Pool, the
Schedulable entities being handled (s1 and s2) are TaskSetManagers, which
are at the granularity of Stages, not Jobs.  Since weight is 1 and minShare
is 0 for TaskSetManagers, the FairSchedulingAlgorithm for TaskSetManagers
just boils down to prioritizing TaskSets (i.e. Stages) with the fewest
number of runningTasks.

On Thu, Sep 1, 2016 at 11:23 AM, enrico d'urso <e.du...@live.com> wrote:

> I tried it before, but still I am not able to see a proper round robin
> across the jobs I submit.
> Given this:
>
> 
> FAIR
> 1
> 2
>   
>
> Each jobs inside production pool should be scheduled in round robin way,
> am I right?
>
> --
> *From:* Mark Hamstra <m...@clearstorydata.com>
> *Sent:* Thursday, September 1, 2016 8:19:44 PM
> *To:* enrico d'urso
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark scheduling mode
>
> The default pool (``) can be configured like any
> other pool: https://spark.apache.org/docs/latest/job-
> scheduling.html#configuring-pool-properties
>
> On Thu, Sep 1, 2016 at 11:11 AM, enrico d'urso <e.du...@live.com> wrote:
>
>> Is there a way to force scheduling to be fair *inside* the default pool?
>> I mean, round robin for the jobs that belong to the default pool.
>>
>> Cheers,
>> --
>> *From:* Mark Hamstra <m...@clearstorydata.com>
>> *Sent:* Thursday, September 1, 2016 7:24:54 PM
>> *To:* enrico d'urso
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Spark scheduling mode
>>
>> Just because you've flipped spark.scheduler.mode to FAIR, that doesn't
>> mean that Spark can magically configure and start multiple scheduling pools
>> for you, nor can it know to which pools you want jobs assigned.  Without
>> doing any setup of additional scheduling pools or assigning of jobs to
>> pools, you're just dumping all of your jobs into the one available default
>> pool (which is now being fair scheduled with an empty set of other pools)
>> and the scheduling of jobs within that pool is still the default intra-pool
>> scheduling, FIFO -- i.e., you've effectively accomplished nothing by only
>> flipping spark.scheduler.mode to FAIR.
>>
>> On Thu, Sep 1, 2016 at 7:10 AM, enrico d'urso <e.du...@live.com> wrote:
>>
>>> I am building a Spark App, in which I submit several jobs (pyspark). I
>>> am using threads to run them in parallel, and also I am setting:
>>> conf.set("spark.scheduler.mode", "FAIR") Still, I see the jobs run
>>> serially in FIFO way. Am I missing something?
>>>
>>> Cheers,
>>>
>>>
>>> Enrico
>>>
>>
>>
>


Re: Spark scheduling mode

2016-09-01 Thread Mark Hamstra
The default pool (``) can be configured like any
other pool:
https://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties

On Thu, Sep 1, 2016 at 11:11 AM, enrico d'urso <e.du...@live.com> wrote:

> Is there a way to force scheduling to be fair *inside* the default pool?
> I mean, round robin for the jobs that belong to the default pool.
>
> Cheers,
> --
> *From:* Mark Hamstra <m...@clearstorydata.com>
> *Sent:* Thursday, September 1, 2016 7:24:54 PM
> *To:* enrico d'urso
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark scheduling mode
>
> Just because you've flipped spark.scheduler.mode to FAIR, that doesn't
> mean that Spark can magically configure and start multiple scheduling pools
> for you, nor can it know to which pools you want jobs assigned.  Without
> doing any setup of additional scheduling pools or assigning of jobs to
> pools, you're just dumping all of your jobs into the one available default
> pool (which is now being fair scheduled with an empty set of other pools)
> and the scheduling of jobs within that pool is still the default intra-pool
> scheduling, FIFO -- i.e., you've effectively accomplished nothing by only
> flipping spark.scheduler.mode to FAIR.
>
> On Thu, Sep 1, 2016 at 7:10 AM, enrico d'urso <e.du...@live.com> wrote:
>
>> I am building a Spark App, in which I submit several jobs (pyspark). I am
>> using threads to run them in parallel, and also I am setting:
>> conf.set("spark.scheduler.mode", "FAIR") Still, I see the jobs run
>> serially in FIFO way. Am I missing something?
>>
>> Cheers,
>>
>>
>> Enrico
>>
>
>


Re: Spark scheduling mode

2016-09-01 Thread Mark Hamstra
Just because you've flipped spark.scheduler.mode to FAIR, that doesn't mean
that Spark can magically configure and start multiple scheduling pools for
you, nor can it know to which pools you want jobs assigned.  Without doing
any setup of additional scheduling pools or assigning of jobs to pools,
you're just dumping all of your jobs into the one available default pool
(which is now being fair scheduled with an empty set of other pools) and
the scheduling of jobs within that pool is still the default intra-pool
scheduling, FIFO -- i.e., you've effectively accomplished nothing by only
flipping spark.scheduler.mode to FAIR.

On Thu, Sep 1, 2016 at 7:10 AM, enrico d'urso  wrote:

> I am building a Spark App, in which I submit several jobs (pyspark). I am
> using threads to run them in parallel, and also I am setting:
> conf.set("spark.scheduler.mode", "FAIR") Still, I see the jobs run
> serially in FIFO way. Am I missing something?
>
> Cheers,
>
>
> Enrico
>


Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mark Hamstra
That's often not as important as you might think.  It really only affects
the loading of data by the first Stage.  Subsequent Stages (in the same Job
or even in other Jobs if you do it right) will use the map outputs, and
will do so with good data locality.

On Thu, Aug 25, 2016 at 3:36 PM, ayan guha <guha.a...@gmail.com> wrote:

> At the core of it map reduce relies heavily on data locality. You would
> lose the ability to process data closest to where it resides if you do not
> use hdfs.
> S3 or NFS will not able to provide that.
> On 26 Aug 2016 07:49, "kant kodali" <kanth...@gmail.com> wrote:
>
>> yeah so its seems like its work in progress. At very least Mesos took the
>> initiative to provide alternatives to ZK. I am just really looking forward
>> for this.
>>
>> https://issues.apache.org/jira/browse/MESOS-3797
>>
>>
>>
>> On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt mgumm...@mesosphere.io
>> wrote:
>>
>>> Mesos also uses ZK for leader election.  There seems to be some effort
>>> in supporting etcd, but it's in progress: https://issues.apache.org/jira
>>> /browse/MESOS-1806
>>>
>>> On Thu, Aug 25, 2016 at 1:55 PM, kant kodali <kanth...@gmail.com> wrote:
>>>
>>> @Ofir @Sean very good points.
>>>
>>> @Mike We dont use Kafka or Hive and I understand that Zookeeper can do
>>> many things but for our use case all we need is for high availability and
>>> given the devops people frustrations here in our company who had extensive
>>> experience managing large clusters in the past we would be very happy to
>>> avoid Zookeeper. I also heard that Mesos can provide High Availability
>>> through etcd and consul and if that is true I will be left with the
>>> following stack
>>>
>>> Spark + Mesos scheduler + Distributed File System or to be precise I
>>> should say Distributed Storage since S3 is an object store so I guess this
>>> will be HDFS for us + etcd & consul. Now the big question for me is how do
>>> I set all this up
>>>
>>>
>>>
>>> On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote:
>>>
>>> Just to add one concrete example regarding HDFS dependency.
>>> Have a look at checkpointing https://spark.ap
>>> ache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing
>>> For example, for Spark Streaming, you can not do any window operation in
>>> a cluster without checkpointing to HDFS (or S3).
>>>
>>> Ofir Manor
>>>
>>> Co-Founder & CTO | Equalum
>>>
>>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>>
>>> On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>> Hi Kant,
>>>
>>> I trust the following would be of use.
>>>
>>> Big Data depends on Hadoop Ecosystem from whichever angle one looks at
>>> it.
>>>
>>> In the heart of it and with reference to points you raised about HDFS,
>>> one needs to have a working knowledge of Hadoop Core System including HDFS,
>>> Map-reduce algorithm and Yarn whether one uses them or not. After all Big
>>> Data is all about horizontal scaling with master and nodes (as opposed to
>>> vertical scaling like SQL Server running on a Host). and distributed data
>>> (by default data is replicated three times on different nodes for
>>> scalability and availability).
>>>
>>> Other members including Sean provided the limits on how far one operate
>>> Spark in its own space. If you are going to deal with data (data in motion
>>> and data at rest), then you will need to interact with some form of storage
>>> and HDFS and compatible file systems like S3 are the natural choices.
>>>
>>> Zookeeper is not just about high availability. It is used in Spark
>>> Streaming with Kafka, it is also used with Hive for concurrency. It is also
>>> a distributed locking system.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage o

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mark Hamstra
s/playing a role/paying a role/

On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> One way you can start to make this make more sense, Sean, is if you
> exploit the code/data duality so that the non-distributed data that you are
> sending out from the driver is actually paying a role more like code (or at
> least parameters.)  What is sent from the driver to an Executer is then
> used (typically as seeds or parameters) to execute some procedure on the
> Worker node that generates the actual data on the Workers.  After that, you
> proceed to execute in a more typical fashion with Spark using the
> now-instantiated distributed data.
>
> But I don't get the sense that this meta-programming-ish style is really
> what the OP was aiming at.
>
> On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> Without a distributed storage system, your application can only create
>> data on the driver and send it out to the workers, and collect data back
>> from the workers. You can't read or write data in a distributed way. There
>> are use cases for this, but pretty limited (unless you're running on 1
>> machine).
>>
>> I can't really imagine a serious use of (distributed) Spark without
>> (distribute) storage, in a way I don't think many apps exist that don't
>> read/write data.
>>
>> The premise here is not just replication, but partitioning data across
>> compute resources. With a distributed file system, your big input exists
>> across a bunch of machines and you can send the work to the pieces of data.
>>
>> On Thu, Aug 25, 2016 at 7:57 PM, kant kodali <kanth...@gmail.com> wrote:
>>
>>> @Mich I understand why I would need Zookeeper. It is there for fault
>>> tolerance given that spark is a master-slave architecture and when a mater
>>> goes down zookeeper will run a leader election algorithm to elect a new
>>> leader however DevOps hate Zookeeper they would be much happier to go with
>>> etcd & consul and looks like if we mesos scheduler we should be able to
>>> drop Zookeeper.
>>>
>>> HDFS I am still trying to understand why I would need for spark. I
>>> understand the purpose of distributed file systems in general but I don't
>>> understand in the context of spark since many people say you can run a
>>> spark distributed cluster in a stand alone mode but I am not sure what are
>>> its pros/cons if we do it that way. In a hadoop world I understand that one
>>> of the reasons HDFS is there is for replication other words if we write
>>> some data to a HDFS it will store that block across different nodes such
>>> that if one of nodes goes down it can still retrieve that block from other
>>> nodes. In the context of spark I am not really sure because 1) I am new 2)
>>> Spark paper says it doesn't replicate data instead it stores the
>>> lineage(all the transformations) such that it can reconstruct it.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Aug 25, 2016 9:18 AM, Mich Talebzadeh mich.talebza...@gmail.com
>>> wrote:
>>>
>>>> You can use Spark on Oracle as a query tool.
>>>>
>>>> It all depends on the mode of the operation.
>>>>
>>>> If you running Spark with yarn-client/cluster then you will need yarn.
>>>> It comes as part of Hadoop core (HDFS, Map-reduce and Yarn).
>>>>
>>>> I have not gone and installed Yarn without installing Hadoop.
>>>>
>>>> What is the overriding reason to have the Spark on its own?
>>>>
>>>>  You can use Spark in Local or Standalone mode if you do not want
>>>> Hadoop core.
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 24 August 2016 at 21:54, kant kodali <kanth...@gmail.com> wrote:
>>>>
>>>> What do I loose if I run spark without using HDFS or Zookeper ? which
>>>> of them is almost a must in practice?
>>>>
>>>>
>>>>
>>
>


Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mark Hamstra
One way you can start to make this make more sense, Sean, is if you exploit
the code/data duality so that the non-distributed data that you are sending
out from the driver is actually paying a role more like code (or at least
parameters.)  What is sent from the driver to an Executer is then used
(typically as seeds or parameters) to execute some procedure on the Worker
node that generates the actual data on the Workers.  After that, you
proceed to execute in a more typical fashion with Spark using the
now-instantiated distributed data.

But I don't get the sense that this meta-programming-ish style is really
what the OP was aiming at.

On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen  wrote:

> Without a distributed storage system, your application can only create
> data on the driver and send it out to the workers, and collect data back
> from the workers. You can't read or write data in a distributed way. There
> are use cases for this, but pretty limited (unless you're running on 1
> machine).
>
> I can't really imagine a serious use of (distributed) Spark without
> (distribute) storage, in a way I don't think many apps exist that don't
> read/write data.
>
> The premise here is not just replication, but partitioning data across
> compute resources. With a distributed file system, your big input exists
> across a bunch of machines and you can send the work to the pieces of data.
>
> On Thu, Aug 25, 2016 at 7:57 PM, kant kodali  wrote:
>
>> @Mich I understand why I would need Zookeeper. It is there for fault
>> tolerance given that spark is a master-slave architecture and when a mater
>> goes down zookeeper will run a leader election algorithm to elect a new
>> leader however DevOps hate Zookeeper they would be much happier to go with
>> etcd & consul and looks like if we mesos scheduler we should be able to
>> drop Zookeeper.
>>
>> HDFS I am still trying to understand why I would need for spark. I
>> understand the purpose of distributed file systems in general but I don't
>> understand in the context of spark since many people say you can run a
>> spark distributed cluster in a stand alone mode but I am not sure what are
>> its pros/cons if we do it that way. In a hadoop world I understand that one
>> of the reasons HDFS is there is for replication other words if we write
>> some data to a HDFS it will store that block across different nodes such
>> that if one of nodes goes down it can still retrieve that block from other
>> nodes. In the context of spark I am not really sure because 1) I am new 2)
>> Spark paper says it doesn't replicate data instead it stores the
>> lineage(all the transformations) such that it can reconstruct it.
>>
>>
>>
>>
>>
>>
>> On Thu, Aug 25, 2016 9:18 AM, Mich Talebzadeh mich.talebza...@gmail.com
>> wrote:
>>
>>> You can use Spark on Oracle as a query tool.
>>>
>>> It all depends on the mode of the operation.
>>>
>>> If you running Spark with yarn-client/cluster then you will need yarn.
>>> It comes as part of Hadoop core (HDFS, Map-reduce and Yarn).
>>>
>>> I have not gone and installed Yarn without installing Hadoop.
>>>
>>> What is the overriding reason to have the Spark on its own?
>>>
>>>  You can use Spark in Local or Standalone mode if you do not want Hadoop
>>> core.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 24 August 2016 at 21:54, kant kodali  wrote:
>>>
>>> What do I loose if I run spark without using HDFS or Zookeper ? which of
>>> them is almost a must in practice?
>>>
>>>
>>>
>


Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Mark Hamstra
That's not going to happen on the user list, since that is against ASF
policy (http://www.apache.org/dev/release.html):

During the process of developing software and preparing a release, various
> packages are made available to the developer community for testing
> purposes. Do not include any links on the project website that might
> encourage non-developers to download and use nightly builds, snapshots,
> release candidates, or any other similar package. The only people who are
> supposed to know about such packages are the people following the dev list
> (or searching its archives) and thus aware of the conditions placed on the
> package. If you find that the general public are downloading such test
> packages, then remove them.
>

On Tue, Aug 9, 2016 at 11:32 AM, Chris Fregly <ch...@fregly.com> wrote:

> this is a valid question.  there are many people building products and
> tooling on top of spark and would like access to the latest snapshots and
> such.  today's ink is yesterday's news to these people - including myself.
>
> what is the best way to get snapshot releases including nightly and
> specially-blessed "preview" releases so that we, too, can say "try the
> latest release in our product"?
>
> there was a lot of chatter during the 2.0.0/2.0.1 release that i largely
> ignored because of conflicting/confusing/changing responses.  and i'd
> rather not dig through jenkins builds to figure this out as i'll likely get
> it wrong.
>
> please provide the relevant snapshot/preview/nightly/whatever repos (or
> equivalent) that we need to include in our builds to have access to the
> absolute latest build assets for every major and minor release.
>
> thanks!
>
> -chris
>
>
> On Tue, Aug 9, 2016 at 10:00 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> LOL
>>
>> Ink has not dried on Spark 2 yet so to speak :)
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 9 August 2016 at 17:56, Mark Hamstra <m...@clearstorydata.com> wrote:
>>
>>> What are you expecting to find?  There currently are no releases beyond
>>> Spark 2.0.0.
>>>
>>> On Tue, Aug 9, 2016 at 9:55 AM, Jestin Ma <jestinwith.a...@gmail.com>
>>> wrote:
>>>
>>>> If we want to use versions of Spark beyond the official 2.0.0 release,
>>>> specifically on Maven + Java, what steps should we take to upgrade? I can't
>>>> find the newer versions on Maven central.
>>>>
>>>> Thank you!
>>>> Jestin
>>>>
>>>
>>>
>>
>
>
> --
> *Chris Fregly*
> Research Scientist @ PipelineIO
> San Francisco, CA
> pipeline.io
> advancedspark.com
>
>


Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Mark Hamstra
What are you expecting to find?  There currently are no releases beyond
Spark 2.0.0.

On Tue, Aug 9, 2016 at 9:55 AM, Jestin Ma  wrote:

> If we want to use versions of Spark beyond the official 2.0.0 release,
> specifically on Maven + Java, what steps should we take to upgrade? I can't
> find the newer versions on Maven central.
>
> Thank you!
> Jestin
>


Re: how to order data in descending order in spark dataset

2016-07-30 Thread Mark Wusinich
> ts.groupBy("b").count().orderBy(col("count"), ascending=False)

Sent from my iPhone

> On Jul 30, 2016, at 2:54 PM, Don Drake  wrote:
> 
> Try:
> 
> ts.groupBy("b").count().orderBy(col("count").desc());
> 
> -Don
> 
>> On Sat, Jul 30, 2016 at 1:30 PM, Tony Lane  wrote:
>> just to clarify I am try to do this in java
>> 
>> ts.groupBy("b").count().orderBy("count");
>> 
>> 
>> 
>>> On Sun, Jul 31, 2016 at 12:00 AM, Tony Lane  wrote:
>>> ts.groupBy("b").count().orderBy("count");
>>> 
>>> how can I order this data in descending order of count
>>> Any suggestions
>>> 
>>> -Tony 
> 
> 
> 
> -- 
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> https://twitter.com/dondrake
> 800-733-2143


Re: SPARK Exception thrown in awaitResult

2016-07-28 Thread Mark Hamstra
Don't use Spark 2.0.0-preview.  That was a preview release with known
issues, and was intended to be used only for early, pre-release testing
purpose.  Spark 2.0.0 is now released, and you should be using that.

On Thu, Jul 28, 2016 at 3:48 AM, Carlo.Allocca 
wrote:

> and, of course I am using
>
>  
> org.apache.spark
> spark-core_2.11
> 2.0.0-preview
> 
>
>
> 
> org.apache.spark
> spark-sql_2.11
> 2.0.0-preview
> jar
> 
>
>
> Is the below problem/issue related to the experimental version of SPARK
> 2.0.0.
>
> Many Thanks for your help and support.
>
> Best Regards,
> carlo
>
> On 28 Jul 2016, at 11:14, Carlo.Allocca  wrote:
>
> I have also found the following two related links:
>
> 1)
> https://github.com/apache/spark/commit/947b9020b0d621bc97661a0a056297e6889936d3
> 2) https://github.com/apache/spark/pull/12433
>
> which both explain why it happens but nothing about what to do to solve
> it.
>
> Do you have any suggestion/recommendation?
>
> Many thanks.
> Carlo
>
> On 28 Jul 2016, at 11:06, carlo allocca  wrote:
>
> Hi Rui,
>
> Thanks for the promptly reply.
> No, I am not using Mesos.
>
> Ok. I am writing a code to build a suitable dataset for my needs as in the
> following:
>
> == Session configuration:
>
>  SparkSession spark = SparkSession
> .builder()
> .master("local[6]") //
> .appName("DatasetForCaseNew")
> .config("spark.executor.memory", "4g")
> .config("spark.shuffle.blockTransferService", "nio")
> .getOrCreate();
>
>
> public Dataset buildDataset(){
> ...
>
> // STEP A
> // Join prdDS with cmpDS
> Dataset prdDS_Join_cmpDS
> = res1
>   .join(res2,
> (res1.col("PRD_asin#100")).equalTo(res2.col("CMP_asin")), "inner");
>
> prdDS_Join_cmpDS.take(1);
>
> // STEP B
> // Join prdDS with cmpDS
> Dataset prdDS_Join_cmpDS_Join
> = prdDS_Join_cmpDS
>   .join(res3,
> prdDS_Join_cmpDS.col("PRD_asin#100").equalTo(res3.col("ORD_asin")),
> "inner");
> prdDS_Join_cmpDS_Join.take(1);
> prdDS_Join_cmpDS_Join.show();
>
> }
>
>
> The exception is thrown when the computation reach the STEP B, until STEP
> A is fine.
>
> Is there anything wrong or missing?
>
> Thanks for your help in advance.
>
> Best Regards,
> Carlo
>
>
>
>
>
> === STACK TRACE
>
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 422.102
> sec <<< FAILURE!
> testBuildDataset(org.mksmart.amaretto.ml.DatasetPerHourVerOneTest)  Time
> elapsed: 421.994 sec  <<< ERROR!
> org.apache.spark.SparkException: Exception thrown in awaitResult:
> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
> at
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:102)
> at
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
> at
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
> at
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
> at
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
> at
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
> at
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.consume(SortMergeJoinExec.scala:35)
> at
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.doProduce(SortMergeJoinExec.scala:565)
> at
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
> at
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
> at
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.produce(SortMergeJoinExec.scala:35)
> at
> 

Re: standalone mode only supports FIFO scheduler across applications ? still in spark 2.0 time ?

2016-07-15 Thread Mark Hamstra
Nothing has changed in that regard, nor is there likely to be "progress",
since more sophisticated or capable resource scheduling at the Application
level is really beyond the design goals for standalone mode.  If you want
more in the way of multi-Application resource scheduling, then you should
be looking at Yarn or Mesos.  Is there some reason why neither of those
options can work for you?

On Fri, Jul 15, 2016 at 9:15 AM, Teng Qiu  wrote:

> Hi,
>
>
> http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/spark-standalone.html#resource-scheduling
> The standalone cluster mode currently only supports a simple FIFO
> scheduler across applications.
>
> is this sentence still true? any progress on this? it will really
> helpful. some roadmap?
>
> Thanks
>
> Teng
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Connection via JDBC to Oracle hangs after count call

2016-07-11 Thread Mark Vervuurt
Thanks Mich,

we have got it working using the example here under ;)

Mark

> On 11 Jul 2016, at 09:45, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Hi Mark,
> 
> Hm. It should work. This is Spark 1.6.1 on Oracle 12c
>  
>  
> scala> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> HiveContext: org.apache.spark.sql.hive.HiveContext = 
> org.apache.spark.sql.hive.HiveContext@70f446c
>  
> scala> var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb12"
> _ORACLEserver: String = jdbc:oracle:thin:@rhes564:1521:mydb12
>  
> scala> var _username : String = "sh"
> _username: String = sh
>  
> scala> var _password : String = ""
> _password: String = sh
>  
> scala> val c = HiveContext.load("jdbc",
>  | Map("url" -> _ORACLEserver,
>  | "dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC 
> FROM sh.channels)",
>  | "user" -> _username,
>  | "password" -> _password))
> warning: there were 1 deprecation warning(s); re-run with -deprecation for 
> details
> c: org.apache.spark.sql.DataFrame = [CHANNEL_ID: string, CHANNEL_DESC: string]
>  
> scala> c.registerTempTable("t_c")
>  
> scala> c.count
> res2: Long = 5
>  
> scala> HiveContext.sql("select * from t_c").collect.foreach(println)
> [3,Direct Sales]
> [9,Tele Sales]
> [5,Catalog]
> [4,Internet]
> [2,Partners]
>  
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 11 July 2016 at 08:25, Mark Vervuurt <m.a.vervu...@gmail.com 
> <mailto:m.a.vervu...@gmail.com>> wrote:
> Hi Mich,
> 
> sorry for bothering did you manage to solve your problem? We have a similar 
> problem with Spark 1.5.2 using a JDBC connection with a DataFrame to an 
> Oracle Database.
> 
> Thanks,
> Mark
> 
>> On 12 Feb 2016, at 11:45, Mich Talebzadeh <m...@peridale.co.uk 
>> <mailto:m...@peridale.co.uk>> wrote:
>> 
>> Hi,
>>  
>> I use the following to connect to Oracle DB from Spark shell 1.5.2
>>  
>> spark-shell --master spark://50.140.197.217:7077 <> --driver-class-path 
>> /home/hduser/jars/ojdbc6.jar
>>  
>> in Scala I do
>>  
>> scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>> sqlContext: org.apache.spark.sql.SQLContext = 
>> org.apache.spark.sql.SQLContext@f9d4387
>>  
>> scala> val channels = sqlContext.read.format("jdbc").options(
>>  |  Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",
>>  |  "dbtable" -> "(select * from sh.channels where channel_id = 14)",
>>  |  "user" -> "sh",
>>  |   "password" -> "xxx")).load
>> channels: org.apache.spark.sql.DataFrame = [CHANNEL_ID: decimal(0,-127), 
>> CHANNEL_DESC: string, CHANNEL_CLASS: string, CHANNEL_CLASS_ID: 
>> decimal(0,-127), CHANNEL_TOTAL: string, CHANNEL_TOTAL_ID: decimal(0,-127)]
>>  
>> scala> channels.count()
>>  
>> But the latter command keeps hanging?
>>  
>> Any ideas appreciated
>>  
>> Thanks,
>>  
>> Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> NOTE: The information in this email is proprietary and confidential. This 
>> message is for the designated recipient only, if you are not the intended 
>> recipient, you should destroy it immediately. Any information in this 
>> message shall not be understood as given or endorsed by Peridale Technology 
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
>> the responsibility of the recipient to ensure that this email is virus free, 
>> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
>> employees accept any responsibility.
> 
> Met vriendelijke groet | Best regards,
> ___
> 
> Ir. Mark Vervuurt
> Senior Big Data Scientist | Insights & Data
> 
> Capgemini Nederland | Utrecht
> Tel.: +31 30 6890978 <tel:%2B31%2030%206890978> – Mob.: +31653670390 
> <tel:%2B31653670390>
> www.capgemini.com <http://www.capgemini.com/>
> 
>  People matter, results count.
> __
> 
> 
> 
> 



Re: Connection via JDBC to Oracle hangs after count call

2016-07-11 Thread Mark Vervuurt
Hi Mich,

sorry for bothering did you manage to solve your problem? We have a similar 
problem with Spark 1.5.2 using a JDBC connection with a DataFrame to an Oracle 
Database.

Thanks,
Mark

> On 12 Feb 2016, at 11:45, Mich Talebzadeh <m...@peridale.co.uk 
> <mailto:m...@peridale.co.uk>> wrote:
> 
> Hi,
>  
> I use the following to connect to Oracle DB from Spark shell 1.5.2
>  
> spark-shell --master spark://50.140.197.217:7077 
>  --driver-class-path /home/hduser/jars/ojdbc6.jar
>  
> in Scala I do
>  
> scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> sqlContext: org.apache.spark.sql.SQLContext = 
> org.apache.spark.sql.SQLContext@f9d4387
>  
> scala> val channels = sqlContext.read.format("jdbc").options(
>  |  Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",
>  |  "dbtable" -> "(select * from sh.channels where channel_id = 14)",
>  |  "user" -> "sh",
>  |   "password" -> "xxx")).load
> channels: org.apache.spark.sql.DataFrame = [CHANNEL_ID: decimal(0,-127), 
> CHANNEL_DESC: string, CHANNEL_CLASS: string, CHANNEL_CLASS_ID: 
> decimal(0,-127), CHANNEL_TOTAL: string, CHANNEL_TOTAL_ID: decimal(0,-127)]
>  
> scala> channels.count()
>  
> But the latter command keeps hanging?
>  
> Any ideas appreciated
>  
> Thanks,
>  
> Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
> employees accept any responsibility.

Met vriendelijke groet | Best regards,
___

Ir. Mark Vervuurt
Senior Big Data Scientist | Insights & Data

Capgemini Nederland | Utrecht
Tel.: +31 30 6890978 – Mob.: +31653670390
www.capgemini.com <http://www.capgemini.com/>
 People matter, results count.
__





Re: What is the interpretation of Cores in Spark doc

2016-06-16 Thread Mark Hamstra
I mean only that hardware-level threads and the processor's scheduling of
those threads is only one segment of the total space of threads and thread
scheduling, and that saying things like cores have threads or only the core
schedules threads can be more confusing than helpful.

On Thu, Jun 16, 2016 at 11:33 AM, Mich Talebzadeh <mich.talebza...@gmail.com
> wrote:

> Well LOL
>
> Given a set of parameters one can argue from any angle.
>
> It is not obvious what you are trying to sate here? "It is not strictly
> true"  yeah OK
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 16 June 2016 at 19:07, Mark Hamstra <m...@clearstorydata.com> wrote:
>
>> In addition, it is the core (not the OS) that determines when the thread
>>> is executed.
>>
>>
>> That's also not strictly true.  "Thread" is a concept that can exist at
>> multiple levels -- even concurrently at multiple levels for a single
>> running program.  Different entities will be responsible for scheduling the
>> execution of threads at these different levels, and the CPU is only in
>> direct control at the lowest level, that of so-called hardware threads.  Of
>> course, higher-level threads eventually need to be run as lower-level
>> hardware tasks, and the mappings between various types of application-level
>> threads and OS- and/or hardware-level threads can be complicated, but it is
>> still not helpful to think of the CPU as being the only entity responsible
>> for the scheduling of threads.
>>
>> On Thu, Jun 16, 2016 at 7:45 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks all.
>>>
>>> I think we are diverging but IMO it is a worthwhile discussion
>>>
>>> Actually, threads are a hardware implementation - hence the whole notion
>>> of “multi-threaded cores”.   What happens is that the cores often have
>>> duplicate registers, etc. for holding execution state.   While it is
>>> correct that only a single process is executing at a time, a single core
>>> will have execution states of multiple processes preserved in these
>>> registers. In addition, it is the core (not the OS) that determines when
>>> the thread is executed. The approach often varies according to the CPU
>>> manufacturer, but the most simple approach is when one thread of execution
>>> executes a multi-cycle operation (e.g. a fetch from main memory, etc.), the
>>> core simply stops processing that thread saves the execution state to a set
>>> of registers, loads instructions from the other set of registers and goes
>>> on.  On the Oracle SPARC chips, it will actually check the next thread to
>>> see if the reason it was ‘parked’ has completed and if not, skip it for the
>>> subsequent thread. The OS is only aware of what are cores and what are
>>> logical processors - and dispatches accordingly.  *Execution is up to
>>> the cores*. .
>>>
>>> Cheers
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 16 June 2016 at 13:02, Robin East <robin.e...@xense.co.uk> wrote:
>>>
>>>> Mich
>>>>
>>>> >> A core may have one or more threads
>>>> It would be more accurate to say that a core could *run* one or more
>>>> threads scheduled for execution. Threads are a software/OS concept that
>>>> represent executable code that is scheduled to run by the OS; A CPU, core
>>>> or virtual core/virtual processor execute that code. Threads are not CPUs
>>>> or cores whether physical or logical - any Spark documentation that implies
>>>> this is mistaken. I’ve looked at the documentation you mention and I don’t
>>>> read it to mean that threads are logical processors.
>>>>
>>>> To go back to your original question, if you set local[6] and you have
>>>> 12 logical processors then you are likely to have half your CPU resources
>>>> unused by Spark.
>&

Re: What is the interpretation of Cores in Spark doc

2016-06-16 Thread Mark Hamstra
>
> In addition, it is the core (not the OS) that determines when the thread
> is executed.


That's also not strictly true.  "Thread" is a concept that can exist at
multiple levels -- even concurrently at multiple levels for a single
running program.  Different entities will be responsible for scheduling the
execution of threads at these different levels, and the CPU is only in
direct control at the lowest level, that of so-called hardware threads.  Of
course, higher-level threads eventually need to be run as lower-level
hardware tasks, and the mappings between various types of application-level
threads and OS- and/or hardware-level threads can be complicated, but it is
still not helpful to think of the CPU as being the only entity responsible
for the scheduling of threads.

On Thu, Jun 16, 2016 at 7:45 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thanks all.
>
> I think we are diverging but IMO it is a worthwhile discussion
>
> Actually, threads are a hardware implementation - hence the whole notion
> of “multi-threaded cores”.   What happens is that the cores often have
> duplicate registers, etc. for holding execution state.   While it is
> correct that only a single process is executing at a time, a single core
> will have execution states of multiple processes preserved in these
> registers. In addition, it is the core (not the OS) that determines when
> the thread is executed. The approach often varies according to the CPU
> manufacturer, but the most simple approach is when one thread of execution
> executes a multi-cycle operation (e.g. a fetch from main memory, etc.), the
> core simply stops processing that thread saves the execution state to a set
> of registers, loads instructions from the other set of registers and goes
> on.  On the Oracle SPARC chips, it will actually check the next thread to
> see if the reason it was ‘parked’ has completed and if not, skip it for the
> subsequent thread. The OS is only aware of what are cores and what are
> logical processors - and dispatches accordingly.  *Execution is up to the
> cores*. .
>
> Cheers
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 16 June 2016 at 13:02, Robin East <robin.e...@xense.co.uk> wrote:
>
>> Mich
>>
>> >> A core may have one or more threads
>> It would be more accurate to say that a core could *run* one or more
>> threads scheduled for execution. Threads are a software/OS concept that
>> represent executable code that is scheduled to run by the OS; A CPU, core
>> or virtual core/virtual processor execute that code. Threads are not CPUs
>> or cores whether physical or logical - any Spark documentation that implies
>> this is mistaken. I’ve looked at the documentation you mention and I don’t
>> read it to mean that threads are logical processors.
>>
>> To go back to your original question, if you set local[6] and you have 12
>> logical processors then you are likely to have half your CPU resources
>> unused by Spark.
>>
>>
>> On 15 Jun 2016, at 23:08, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> I think it is slightly more than that.
>>
>> These days  software is licensed by core (generally speaking).   That is
>> the physical processor.   * A core may have one or more threads - or
>> logical processors*. Virtualization adds some fun to the mix.
>> Generally what they present is ‘virtual processors’.   What that equates to
>> depends on the virtualization layer itself.   In some simpler VM’s - it is
>> virtual=logical.   In others, virtual=logical but they are constrained to
>> be from the same cores - e.g. if you get 6 virtual processors, it really is
>> 3 full cores with 2 threads each.   Rational is due to the way OS
>> dispatching works on ‘logical’ processors vs. cores and POSIX threaded
>> applications.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 13 June 2016 at 18:17, Mark Hamstra <m...@clearstorydata.com> wrote:
>>
>>> I don't know what documentation you were referring to, but this is
>>> clearly an erroneous statement: "Threads are virtual cores

Re: What is the interpretation of Cores in Spark doc

2016-06-16 Thread Mark Hamstra
>
> Actually, threads are a hardware implementation - hence the whole notion
> of “multi-threaded cores”.


No, a multi-threaded core is a core that supports multiple concurrent
threads of execution, not a core that has multiple threads.  The
terminology and marketing around multi-core processors, hyper threading and
virtualization are confusing enough without taking the further step of
misapplying software-specific terms to hardware components.

On Thu, Jun 16, 2016 at 7:45 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thanks all.
>
> I think we are diverging but IMO it is a worthwhile discussion
>
> Actually, threads are a hardware implementation - hence the whole notion
> of “multi-threaded cores”.   What happens is that the cores often have
> duplicate registers, etc. for holding execution state.   While it is
> correct that only a single process is executing at a time, a single core
> will have execution states of multiple processes preserved in these
> registers. In addition, it is the core (not the OS) that determines when
> the thread is executed. The approach often varies according to the CPU
> manufacturer, but the most simple approach is when one thread of execution
> executes a multi-cycle operation (e.g. a fetch from main memory, etc.), the
> core simply stops processing that thread saves the execution state to a set
> of registers, loads instructions from the other set of registers and goes
> on.  On the Oracle SPARC chips, it will actually check the next thread to
> see if the reason it was ‘parked’ has completed and if not, skip it for the
> subsequent thread. The OS is only aware of what are cores and what are
> logical processors - and dispatches accordingly.  *Execution is up to the
> cores*. .
>
> Cheers
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 16 June 2016 at 13:02, Robin East <robin.e...@xense.co.uk> wrote:
>
>> Mich
>>
>> >> A core may have one or more threads
>> It would be more accurate to say that a core could *run* one or more
>> threads scheduled for execution. Threads are a software/OS concept that
>> represent executable code that is scheduled to run by the OS; A CPU, core
>> or virtual core/virtual processor execute that code. Threads are not CPUs
>> or cores whether physical or logical - any Spark documentation that implies
>> this is mistaken. I’ve looked at the documentation you mention and I don’t
>> read it to mean that threads are logical processors.
>>
>> To go back to your original question, if you set local[6] and you have 12
>> logical processors then you are likely to have half your CPU resources
>> unused by Spark.
>>
>>
>> On 15 Jun 2016, at 23:08, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> I think it is slightly more than that.
>>
>> These days  software is licensed by core (generally speaking).   That is
>> the physical processor.   * A core may have one or more threads - or
>> logical processors*. Virtualization adds some fun to the mix.
>> Generally what they present is ‘virtual processors’.   What that equates to
>> depends on the virtualization layer itself.   In some simpler VM’s - it is
>> virtual=logical.   In others, virtual=logical but they are constrained to
>> be from the same cores - e.g. if you get 6 virtual processors, it really is
>> 3 full cores with 2 threads each.   Rational is due to the way OS
>> dispatching works on ‘logical’ processors vs. cores and POSIX threaded
>> applications.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 13 June 2016 at 18:17, Mark Hamstra <m...@clearstorydata.com> wrote:
>>
>>> I don't know what documentation you were referring to, but this is
>>> clearly an erroneous statement: "Threads are virtual cores."  At best it is
>>> terminology abuse by a hardware manufacturer.  Regardless, Spark can't get
>>> too concerned about how any particular hardware vendor wants to refer to
>>> the specific components of their CPU architecture.  For us, a core is a
>>> logical execution unit, something on which a thread of ex

Re: What is the interpretation of Cores in Spark doc

2016-06-13 Thread Mark Hamstra
I don't know what documentation you were referring to, but this is clearly
an erroneous statement: "Threads are virtual cores."  At best it is
terminology abuse by a hardware manufacturer.  Regardless, Spark can't get
too concerned about how any particular hardware vendor wants to refer to
the specific components of their CPU architecture.  For us, a core is a
logical execution unit, something on which a thread of execution can run.
That can map in different ways to different physical or virtual hardware.

On Mon, Jun 13, 2016 at 12:02 AM, Mich Talebzadeh  wrote:

> Hi,
>
> It is not the issue of testing anything. I was referring to documentation
> that clearly use the term "threads". As I said and showed before, one line
> is using the term "thread" and the next one "logical cores".
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 12 June 2016 at 23:57, Daniel Darabos  > wrote:
>
>> Spark is a software product. In software a "core" is something that a
>> process can run on. So it's a "virtual core". (Do not call these "threads".
>> A "thread" is not something a process can run on.)
>>
>> local[*] uses java.lang.Runtime.availableProcessors()
>> .
>> Since Java is software, this also returns the number of virtual cores. (You
>> can test this easily.)
>>
>>
>> On Sun, Jun 12, 2016 at 9:23 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>>
>>> Hi,
>>>
>>> I was writing some docs on Spark P and came across this.
>>>
>>> It is about the terminology or interpretation of that in Spark doc.
>>>
>>> This is my understanding of cores and threads.
>>>
>>>  Cores are physical cores. Threads are virtual cores. Cores with 2
>>> threads is called hyper threading technology so 2 threads per core makes
>>> the core work on two loads at same time. In other words, every thread takes
>>> care of one load.
>>>
>>> Core has its own memory. So if you have a dual core with hyper
>>> threading, the core works with 2 loads each at same time because of the 2
>>> threads per core, but this 2 threads will share memory in that core.
>>>
>>> Some vendors as I am sure most of you aware charge licensing per core.
>>>
>>> For example on the same host that I have Spark, I have a SAP product
>>> that checks the licensing and shuts the application down if the license
>>> does not agree with the cores speced.
>>>
>>> This is what it says
>>>
>>> ./cpuinfo
>>> License hostid:00e04c69159a 0050b60fd1e7
>>> Detected 12 logical processor(s), 6 core(s), in 1 chip(s)
>>>
>>> So here I have 12 logical processors  and 6 cores and 1 chip. I call
>>> logical processors as threads so I have 12 threads?
>>>
>>> Now if I go and start worker process ${SPARK_HOME}/sbin/start-slaves.sh,
>>> I see this in GUI page
>>>
>>> [image: Inline images 1]
>>>
>>> it says 12 cores but I gather it is threads?
>>>
>>>
>>> Spark document
>>> 
>>> states and I quote
>>>
>>>
>>> [image: Inline images 2]
>>>
>>>
>>>
>>> OK the line local[k] adds  ..  *set this to the number of cores on your
>>> machine*
>>>
>>>
>>> But I know that it means threads. Because if I went and set that to 6,
>>> it would be only 6 threads as opposed to 12 threads.
>>>
>>>
>>> the next line local[*] seems to indicate it correctly as it refers to
>>> "logical cores" that in my understanding it is threads.
>>>
>>>
>>> I trust that I am not nitpicking here!
>>>
>>>
>>> Cheers,
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>
>>
>


Re: Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Mark Hamstra
But when you talk about optimizing the DAG, it really doesn't make sense to
also talk about transformation steps as separate entities.  The
DAGScheduler knows about Jobs, Stages, TaskSets and Tasks.  The
TaskScheduler knows about TaskSets ad Tasks.  Neither of them understands
the transformation steps that you used to define your RDD -- at least not
as separable, distinct steps.  To give the kind of
transformation-step-oriented information that you want would require parts
of Spark that don't currently concern themselves at all with RDD
transformation steps to start tracking them and how they map to Jobs,
Stages, TaskSets and Tasks -- and when you start talking about Datasets and
Spark SQL, you then needing to start talking about tracking and mapping
concepts like Plans, Schemas and Queries.  It would introduce significant
new complexity.

On Wed, May 25, 2016 at 6:59 PM, Nirav Patel <npa...@xactlycorp.com> wrote:

> Hi Mark,
>
> I might have said stage instead of step in my last statement "UI just
> says Collect failed but in fact it could be any stage in that lazy chain of
> evaluation."
>
> Anyways even you agree that this visibility of underlaying steps wont't be
> available. which does pose difficulties in terms of troubleshooting as well
> as optimizations at step level. I think users will have hard time without
> this. Its great that spark community working on different levels of
> internal optimizations but its also important to give enough visibility
> to users to enable them to debug issues and resolve bottleneck.
> There is also no visibility into how spark utilizes shuffle memory space
> vs user memory space vs cache space. It's a separate topic though. If
> everything is working magically as a black box then it's fine but when you
> have large number of people on this site complaining about  OOM and shuffle
> error all the time you need to start providing some transparency to
> address that.
>
> Thanks
>
>
> On Wed, May 25, 2016 at 6:41 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> You appear to be misunderstanding the nature of a Stage.  Individual
>> transformation steps such as `map` do not define the boundaries of Stages.
>> Rather, a sequence of transformations in which there is only a
>> NarrowDependency between each of the transformations will be pipelined into
>> a single Stage.  It is only when there is a ShuffleDependency that a new
>> Stage will be defined -- i.e. shuffle boundaries define Stage boundaries.
>> With whole stage code gen in Spark 2.0, there will be even less opportunity
>> to treat individual transformations within a sequence of narrow
>> dependencies as though they were discrete, separable entities.  The Failed
>> Stages portion of the Web UI will tell you which Stage in a Job failed, and
>> the accompanying error log message will generally also give you some idea
>> of which Tasks failed and why.  Tracing the error back further and at a
>> different level of abstraction to lay blame on a particular transformation
>> wouldn't be particularly easy.
>>
>> On Wed, May 25, 2016 at 5:28 PM, Nirav Patel <npa...@xactlycorp.com>
>> wrote:
>>
>>> It's great that spark scheduler does optimized DAG processing and only
>>> does lazy eval when some action is performed or shuffle dependency is
>>> encountered. Sometime it goes further after shuffle dep before executing
>>> anything. e.g. if there are map steps after shuffle then it doesn't stop at
>>> shuffle to execute anything but goes to that next map steps until it finds
>>> a reason(spark action) to execute. As a result stage that spark is running
>>> can be internally series of (map -> shuffle -> map -> map -> collect) and
>>> spark UI just shows its currently running 'collect' stage. SO  if job fails
>>> at that point spark UI just says Collect failed but in fact it could be any
>>> stage in that lazy chain of evaluation. Looking at executor logs gives some
>>> insights but that's not always straightforward.
>>> Correct me if I am wrong here but I think we need more visibility into
>>> what's happening underneath so we can easily troubleshoot as well as
>>> optimize our DAG.
>>>
>>> THanks
>>>
>>>
>>>
>>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>>
>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>> <https://twitter.com/Xactly>  [image: Facebook]
>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>> <http://www.youtube.com/xactlycorporation>
>>
>>
>>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>
>


Re: Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Mark Hamstra
You appear to be misunderstanding the nature of a Stage.  Individual
transformation steps such as `map` do not define the boundaries of Stages.
Rather, a sequence of transformations in which there is only a
NarrowDependency between each of the transformations will be pipelined into
a single Stage.  It is only when there is a ShuffleDependency that a new
Stage will be defined -- i.e. shuffle boundaries define Stage boundaries.
With whole stage code gen in Spark 2.0, there will be even less opportunity
to treat individual transformations within a sequence of narrow
dependencies as though they were discrete, separable entities.  The Failed
Stages portion of the Web UI will tell you which Stage in a Job failed, and
the accompanying error log message will generally also give you some idea
of which Tasks failed and why.  Tracing the error back further and at a
different level of abstraction to lay blame on a particular transformation
wouldn't be particularly easy.

On Wed, May 25, 2016 at 5:28 PM, Nirav Patel  wrote:

> It's great that spark scheduler does optimized DAG processing and only
> does lazy eval when some action is performed or shuffle dependency is
> encountered. Sometime it goes further after shuffle dep before executing
> anything. e.g. if there are map steps after shuffle then it doesn't stop at
> shuffle to execute anything but goes to that next map steps until it finds
> a reason(spark action) to execute. As a result stage that spark is running
> can be internally series of (map -> shuffle -> map -> map -> collect) and
> spark UI just shows its currently running 'collect' stage. SO  if job fails
> at that point spark UI just says Collect failed but in fact it could be any
> stage in that lazy chain of evaluation. Looking at executor logs gives some
> insights but that's not always straightforward.
> Correct me if I am wrong here but I think we need more visibility into
> what's happening underneath so we can easily troubleshoot as well as
> optimize our DAG.
>
> THanks
>
>
>
> [image: What's New with Xactly] 
>
>   [image: LinkedIn]
>   [image: Twitter]
>   [image: Facebook]
>   [image: YouTube]
> 


Re: DeepSpark: where to start

2016-05-05 Thread Mark Vervuurt
Wel you got me fooled as wel ;)
Had it on my todolist to dive into this new component...

Mark

> Op 5 mei 2016 om 07:06 heeft Derek Chan <derek...@gmail.com> het volgende 
> geschreven:
> 
> The blog post is a April Fool's joke. Read the last line in the post:
> 
> https://databricks.com/blog/2016/04/01/unreasonable-effectiveness-of-deep-learning-on-spark.html
> 
> 
> 
>> On Thursday, May 05, 2016 10:42 AM, Joice Joy wrote:
>> I am trying to find info on deepspark. I read the article on databricks blog 
>> which doesnt mention a git repo but does say its open source.
>> Help me find the git repo for this. I found two and not sure which one is
>> the databricks deepspark:
>> https://github.com/deepspark/deepspark
>> https://github.com/nearbydelta/deepspark
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



EMR Spark Custom Metrics

2016-04-21 Thread Mark Kelly
Hi,

So i would like some custom metrics.

The environment we use is AWS EMR 4.5.0 with spark 1.6.1 and Ganglia.

the code snippit below shows how we register custom metrics (this worked in EMR 
4.2.0 with spark 1.5.2)

package org.apache.spark.metrics.source

import com.codahale.metrics._
import org.apache.spark.{Logging, SparkEnv}

class SparkInstrumentation(val prefix: String) extends Serializable with 
Logging {

  class InstrumentationSource(val prefix: String) extends Source {

log.info (s"Starting spark instrumentation with prefix 
$prefix")
override val sourceName = prefix
override val metricRegistry = new MetricRegistry()

def registerCounter(name: String): Counter = {
  metricRegistry.counter(MetricRegistry.name(name))
}

def registerTimer(name: String): Timer = {
  metricRegistry.timer(MetricRegistry.name(name))
}
  }

  val source = new InstrumentationSource(prefix)

  def register() {
SparkEnv.get.metricsSystem.registerSource(source)
  }

}

This unfortunately no longer works.

How is it possible to create custom metrics?

Thanks



Re: Apache Flink

2016-04-17 Thread Mark Hamstra
To be fair, the Stratosphere project from which Flink springs was started
as a collaborative university research project in Germany about the same
time that Spark was first released as Open Source, so they are near
contemporaries rather than Flink having been started only well after Spark
was an established and widely-used Apache project.

On Sun, Apr 17, 2016 at 2:25 PM, Mich Talebzadeh 
wrote:

> Also it always amazes me why they are so many tangential projects in Big
> Data space? Would not it be easier if efforts were spent on adding to Spark
> functionality rather than creating a new product like Flink?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 April 2016 at 21:08, Mich Talebzadeh 
> wrote:
>
>> Thanks Corey for the useful info.
>>
>> I have used Sybase Aleri and StreamBase as commercial CEPs engines.
>> However, there does not seem to be anything close to these products in
>> Hadoop Ecosystem. So I guess there is nothing there?
>>
>> Regards.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 17 April 2016 at 20:43, Corey Nolet  wrote:
>>
>>> i have not been intrigued at all by the microbatching concept in Spark.
>>> I am used to CEP in real streams processing environments like Infosphere
>>> Streams & Storm where the granularity of processing is at the level of each
>>> individual tuple and processing units (workers) can react immediately to
>>> events being received and processed. The closest Spark streaming comes to
>>> this concept is the notion of "state" that that can be updated via the
>>> "updateStateBykey()" functions which are only able to be run in a
>>> microbatch. Looking at the expected design changes to Spark Streaming in
>>> Spark 2.0.0, it also does not look like tuple-at-a-time processing is on
>>> the radar for Spark, though I have seen articles stating that more effort
>>> is going to go into the Spark SQL layer in Spark streaming which may make
>>> it more reminiscent of Esper.
>>>
>>> For these reasons, I have not even tried to implement CEP in Spark. I
>>> feel it's a waste of time without immediate tuple-at-a-time processing.
>>> Without this, they avoid the whole problem of "back pressure" (though keep
>>> in mind, it is still very possible to overload the Spark streaming layer
>>> with stages that will continue to pile up and never get worked off) but
>>> they lose the granular control that you get in CEP environments by allowing
>>> the rules & processors to react with the receipt of each tuple, right away.
>>>
>>> Awhile back, I did attempt to implement an InfoSphere Streams-like API
>>> [1] on top of Apache Storm as an example of what such a design may look
>>> like. It looks like Storm is going to be replaced in the not so distant
>>> future by Twitter's new design called Heron. IIRC, Heron does not have an
>>> open source implementation as of yet.
>>>
>>> [1] https://github.com/calrissian/flowmix
>>>
>>> On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi Corey,

 Can you please point me to docs on using Spark for CEP? Do we have a
 set of CEP libraries somewhere. I am keen on getting hold of adaptor
 libraries for Spark something like below



 ​
 Thanks


 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 17 April 2016 at 16:07, Corey Nolet  wrote:

> One thing I've noticed about Flink in my following of the project has
> been that it has established, in a few cases, some novel ideas and
> improvements over Spark. The problem with it, however, is that both the
> development team and the community around it are very small and many of
> those novel improvements have been rolled directly into Spark in 
> subsequent
> versions. I was considering changing over my architecture to Flink at one
> point to get better, more real-time CEP streaming support, but in the end 
> I
> decided to stick with Spark and just watch Flink continue to pressure it
> into improvement.
>
> On Sun, Apr 17, 2016 at 11:03 AM, Koert Kuipers 
> wrote:
>
>> i never found much info that flink was actually designed to be fault

Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Mark Hamstra
That's also available in standalone.

On Thu, Apr 14, 2016 at 12:47 PM, Alexander Pivovarov 
wrote:

> Spark on Yarn supports dynamic resource allocation
>
> So, you can run several spark-shells / spark-submits / spark-jobserver /
> zeppelin on one cluster without defining upfront how many executors /
> memory you want to allocate to each app
>
> Great feature for regular users who just want to run Spark / Spark SQL
>
>
> On Thu, Apr 14, 2016 at 12:05 PM, Sean Owen  wrote:
>
>> I don't think usage is the differentiating factor. YARN and standalone
>> are pretty well supported. If you are only running a Spark cluster by
>> itself with nothing else, standalone is probably simpler than setting
>> up YARN just for Spark. However if you're running on a cluster that
>> will host other applications, you'll need to integrate with a shared
>> resource manager and its security model, and for anything
>> Hadoop-related that's YARN. Standalone wouldn't make as much sense.
>>
>> On Thu, Apr 14, 2016 at 6:46 PM, Alexander Pivovarov
>>  wrote:
>> > AWS EMR includes Spark on Yarn
>> > Hortonworks and Cloudera platforms include Spark on Yarn as well
>> >
>> >
>> > On Thu, Apr 14, 2016 at 7:29 AM, Arkadiusz Bicz <
>> arkadiusz.b...@gmail.com>
>> > wrote:
>> >>
>> >> Hello,
>> >>
>> >> Is there any statistics regarding YARN vs Standalone Spark Usage in
>> >> production ?
>> >>
>> >> I would like to choose most supported and used technology in
>> >> production for our project.
>> >>
>> >>
>> >> BR,
>> >>
>> >> Arkadiusz Bicz
>> >>
>> >> -
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>> >
>>
>
>


Re: Spark GUI, Workers and Executors

2016-04-09 Thread Mark Hamstra
https://spark.apache.org/docs/latest/cluster-overview.html

On Sat, Apr 9, 2016 at 12:28 AM, Ashok Kumar 
wrote:

> On Spark GUI I can see the list of Workers.
>
> I always understood that workers are used by executors.
>
> What is the relationship between workers and executors please. Is it one
> to one?
>
> Thanks
>


Re: Executor shutdown hooks?

2016-04-06 Thread Mark Hamstra
Why would the Executors shutdown when the Job is terminated?  Executors are
bound to Applications, not Jobs.  Furthermore,
unless spark.job.interruptOnCancel is set to true, canceling the Job at the
Application and DAGScheduler level won't actually interrupt the Tasks
running on the Executors.  If you do have interruptOnCancel set, then you
can catch the interrupt exception within the Task.

On Wed, Apr 6, 2016 at 12:24 PM, Sung Hwan Chung  wrote:

> Hi,
>
> I'm looking for ways to add shutdown hooks to executors : i.e., when a Job
> is forcefully terminated before it finishes.
>
> The scenario goes likes this : executors are running a long running job
> within a 'map' function. The user decides to terminate the job, then the
> mappers should perform some cleanups before going offline.
>
> What would be the best way to do this?
>


Re: Spark and N-tier architecture

2016-03-29 Thread Mark Hamstra
Our difference is mostly over whether n-tier means what it meant long ago,
or whether it is a malleable concept that can be stretched without breaking
to cover newer architectures.  As I said before, if n-tier helps you think
about Spark, then use it; if it doesn't, don't force it.

On Tue, Mar 29, 2016 at 5:44 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Mark,
>
> I beg I agree to differ on the interpretation of N-tier architecture.
> Agreed that 3-tier and by extrapolation N-tier have been around since days
> of client-server architecture. However, they are as valid today as 20 years
> ago. I believe the main recent expansion of n-tier has been on horizontal
> scaling and Spark by means of its clustering capability contributes to this
> model.
>
> Cheers
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 30 March 2016 at 00:22, Mark Hamstra <m...@clearstorydata.com> wrote:
>
>> Yes and no.  The idea of n-tier architecture is about 20 years older than
>> Spark and doesn't really apply to Spark as n-tier was original conceived.
>> If the n-tier model helps you make sense of some things related to Spark,
>> then use it; but don't get hung up on trying to force a Spark architecture
>> into an outdated model.
>>
>> On Tue, Mar 29, 2016 at 5:02 PM, Ashok Kumar <
>> ashok34...@yahoo.com.invalid> wrote:
>>
>>> Thank you both.
>>>
>>> So am I correct that Spark fits in within the application tier in N-tier
>>> architecture?
>>>
>>>
>>> On Tuesday, 29 March 2016, 23:50, Alexander Pivovarov <
>>> apivova...@gmail.com> wrote:
>>>
>>>
>>> Spark is a distributed data processing engine plus distributed in-memory
>>> / disk data cache
>>>
>>> spark-jobserver provides REST API to your spark applications. It allows
>>> you to submit jobs to spark and get results in sync or async mode
>>>
>>> It also can create long running Spark context to cache RDDs in memory
>>> with some name (namedRDD) and then use it to serve requests from multiple
>>> users. Because RDD is in memory response should be super fast (seconds)
>>>
>>> https://github.com/spark-jobserver/spark-jobserver
>>>
>>>
>>> On Tue, Mar 29, 2016 at 2:50 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>> Interesting question.
>>>
>>> The most widely used application of N-tier is the traditional three-tier
>>> architecture that has been the backbone of Client-server architecture by
>>> having presentation layer, application layer and data layer. This is
>>> primarily for performance, scalability and maintenance. The most profound
>>> changes that Big data space has introduced to N-tier architecture is the
>>> concept of horizontal scaling as opposed to the previous tiers that relied
>>> on vertical scaling. HDFS is an example of horizontal scaling at the data
>>> tier by adding more JBODS to storage. Similarly adding more nodes to Spark
>>> cluster should result in better performance.
>>>
>>> Bear in mind that these tiers are at Logical levels which means that
>>> there or may not be so many so many physical layers. For example multiple
>>> virtual servers can be hosted on the same physical server.
>>>
>>> With regard to Spark, it is effectively a powerful query tools that sits
>>> in between the presentation layer (say Tableau) and the HDFS or Hive as you
>>> alluded. In that sense you can think of Spark as part of the application
>>> layer that communicates with the backend via a number of protocols
>>> including the standard JDBC. There is rather a blurred vision here whether
>>> Spark is a database or query tool. IMO it is a query tool in a sense that
>>> Spark by itself does not have its own storage concept or metastore. Thus it
>>> relies on others to provide that service.
>>>
>>> HTH
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> On 29 March 2016 at 22:07, Ashok Kumar <ashok34...@yahoo.com.invalid>
>>> wrote:
>>>
>>> Experts,
>>>
>>> One of terms used and I hear is N-tier architecture within Big Data used
>>> for availability, performance etc. I also hear that Spark by means of its
>>> query engine and in-memory caching fits into middle tier (application
>>> layer) with HDFS and Hive may be providing the data tier.  Can someone
>>> elaborate the role of Spark here. For example A Scala program that we write
>>> uses JDBC to talk to databases so in that sense is Spark a middle tier
>>> application?
>>>
>>> I hope that someone can clarify this and if so what would the best
>>> practice in using Spark as middle tier and within Big data.
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>
>


Re: Spark and N-tier architecture

2016-03-29 Thread Mark Hamstra
Yes and no.  The idea of n-tier architecture is about 20 years older than
Spark and doesn't really apply to Spark as n-tier was original conceived.
If the n-tier model helps you make sense of some things related to Spark,
then use it; but don't get hung up on trying to force a Spark architecture
into an outdated model.

On Tue, Mar 29, 2016 at 5:02 PM, Ashok Kumar 
wrote:

> Thank you both.
>
> So am I correct that Spark fits in within the application tier in N-tier
> architecture?
>
>
> On Tuesday, 29 March 2016, 23:50, Alexander Pivovarov <
> apivova...@gmail.com> wrote:
>
>
> Spark is a distributed data processing engine plus distributed in-memory /
> disk data cache
>
> spark-jobserver provides REST API to your spark applications. It allows
> you to submit jobs to spark and get results in sync or async mode
>
> It also can create long running Spark context to cache RDDs in memory with
> some name (namedRDD) and then use it to serve requests from multiple users.
> Because RDD is in memory response should be super fast (seconds)
>
> https://github.com/spark-jobserver/spark-jobserver
>
>
> On Tue, Mar 29, 2016 at 2:50 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
> Interesting question.
>
> The most widely used application of N-tier is the traditional three-tier
> architecture that has been the backbone of Client-server architecture by
> having presentation layer, application layer and data layer. This is
> primarily for performance, scalability and maintenance. The most profound
> changes that Big data space has introduced to N-tier architecture is the
> concept of horizontal scaling as opposed to the previous tiers that relied
> on vertical scaling. HDFS is an example of horizontal scaling at the data
> tier by adding more JBODS to storage. Similarly adding more nodes to Spark
> cluster should result in better performance.
>
> Bear in mind that these tiers are at Logical levels which means that there
> or may not be so many so many physical layers. For example multiple virtual
> servers can be hosted on the same physical server.
>
> With regard to Spark, it is effectively a powerful query tools that sits
> in between the presentation layer (say Tableau) and the HDFS or Hive as you
> alluded. In that sense you can think of Spark as part of the application
> layer that communicates with the backend via a number of protocols
> including the standard JDBC. There is rather a blurred vision here whether
> Spark is a database or query tool. IMO it is a query tool in a sense that
> Spark by itself does not have its own storage concept or metastore. Thus it
> relies on others to provide that service.
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
> http://talebzadehmich.wordpress.com
>
>
> On 29 March 2016 at 22:07, Ashok Kumar 
> wrote:
>
> Experts,
>
> One of terms used and I hear is N-tier architecture within Big Data used
> for availability, performance etc. I also hear that Spark by means of its
> query engine and in-memory caching fits into middle tier (application
> layer) with HDFS and Hive may be providing the data tier.  Can someone
> elaborate the role of Spark here. For example A Scala program that we write
> uses JDBC to talk to databases so in that sense is Spark a middle tier
> application?
>
> I hope that someone can clarify this and if so what would the best
> practice in using Spark as middle tier and within Big data.
>
> Thanks
>
>
>
>
>
>


Re: No active SparkContext

2016-03-24 Thread Mark Hamstra
You seem to be confusing the concepts of Job and Application.  A Spark
Application has a SparkContext.  A Spark Application is capable of running
multiple Jobs, each with its own ID, visible in the webUI.

On Thu, Mar 24, 2016 at 6:11 AM, Max Schmidt  wrote:

> Am 24.03.2016 um 10:34 schrieb Simon Hafner:
>
> 2016-03-24 9:54 GMT+01:00 Max Schmidt :
> > we're using with the java-api (1.6.0) a ScheduledExecutor that
> continuously
> > executes a SparkJob to a standalone cluster.
> I'd recommend Scala.
>
> Why should I use scala?
>
>
> > After each job we close the JavaSparkContext and create a new one.
> Why do that? You can happily reuse it. Pretty sure that also causes
> the other problems, because you have a race condition on waiting for
> the job to finish and stopping the Context.
>
> I do that because it is a very common pattern to create an object for
> specific "job" and release its resources when its done.
>
> The first problem that came in my mind was that the appName is immutable
> once the JavaSparkContext was created, so it is, to me, not possible to
> resuse the JavaSparkContext for jobs with different IDs (that we wanna see
> in the webUI).
>
> And of course it is possible to wait for closing the JavaSparkContext
> gracefully, except when there is some asynchronous action in the background?
>
> --
> *Max Schmidt, Senior Java Developer* | m...@datapath.io |
> LinkedIn 
> [image: Datapath.io]
>
> Decreasing AWS latency.
> Your traffic optimized.
>
> Datapath.io GmbH
> Mainz | HRB Nr. 46222
> Sebastian Spies, CEO
>


Re: What's the benifit of RDD checkpoint against RDD save

2016-03-23 Thread Mark Hamstra
Yes, the terminology is being used sloppily/non-standardly in this thread
-- "the last RDD" after a series of transformation is the RDD at the
beginning of the chain, just now with an attached chain of "to be done"
transformations when an action is eventually run.  If the saveXXX action is
the only action being performed on the RDD, the rest of the chain being
purely transformations, then checkpointing instead of saving still wouldn't
execute any action on the RDD -- it would just mark the point at which
checkpointing should be done when an action is eventually run.

On Wed, Mar 23, 2016 at 7:38 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> bq. when I get the last RDD
> If I read Todd's first email correctly, the computation has been done.
> I could be wrong.
>
> On Wed, Mar 23, 2016 at 7:34 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> Neither of you is making any sense to me.  If you just have an RDD for
>> which you have specified a series of transformations but you haven't run
>> any actions, then neither checkpointing nor saving makes sense -- you
>> haven't computed anything yet, you've only written out the recipe for how
>> the computation should be done when it is needed.  Neither does the "called
>> before any job" comment pose any restriction in this case since no jobs
>> have yet been executed on the RDD.
>>
>> On Wed, Mar 23, 2016 at 7:18 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> See the doc for checkpoint:
>>>
>>>* Mark this RDD for checkpointing. It will be saved to a file inside
>>> the checkpoint
>>>* directory set with `SparkContext#setCheckpointDir` and all
>>> references to its parent
>>>* RDDs will be removed. *This function must be called before any job
>>> has been*
>>> *   * executed on this RDD*. It is strongly recommended that this RDD
>>> is persisted in
>>>* memory, otherwise saving it on a file will require recomputation.
>>>
>>> From the above description, you should not call it at the end of
>>> transformations.
>>>
>>> Cheers
>>>
>>> On Wed, Mar 23, 2016 at 7:14 PM, Todd <bit1...@163.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a long computing chain, when I get the last RDD after a series
>>>> of transformation. I have two choices to do with this last RDD
>>>>
>>>> 1. Call checkpoint on RDD to materialize it to disk
>>>> 2. Call RDD.saveXXX to save it to HDFS, and read it back for further
>>>> processing
>>>>
>>>> I would ask which choice is better? It looks to me that is not much
>>>> difference between the two choices.
>>>> Thanks!
>>>>
>>>>
>>>>
>>>
>>
>


Re: What's the benifit of RDD checkpoint against RDD save

2016-03-23 Thread Mark Hamstra
Neither of you is making any sense to me.  If you just have an RDD for
which you have specified a series of transformations but you haven't run
any actions, then neither checkpointing nor saving makes sense -- you
haven't computed anything yet, you've only written out the recipe for how
the computation should be done when it is needed.  Neither does the "called
before any job" comment pose any restriction in this case since no jobs
have yet been executed on the RDD.

On Wed, Mar 23, 2016 at 7:18 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> See the doc for checkpoint:
>
>* Mark this RDD for checkpointing. It will be saved to a file inside
> the checkpoint
>* directory set with `SparkContext#setCheckpointDir` and all references
> to its parent
>* RDDs will be removed. *This function must be called before any job
> has been*
> *   * executed on this RDD*. It is strongly recommended that this RDD is
> persisted in
>* memory, otherwise saving it on a file will require recomputation.
>
> From the above description, you should not call it at the end of
> transformations.
>
> Cheers
>
> On Wed, Mar 23, 2016 at 7:14 PM, Todd <bit1...@163.com> wrote:
>
>> Hi,
>>
>> I have a long computing chain, when I get the last RDD after a series of
>> transformation. I have two choices to do with this last RDD
>>
>> 1. Call checkpoint on RDD to materialize it to disk
>> 2. Call RDD.saveXXX to save it to HDFS, and read it back for further
>> processing
>>
>> I would ask which choice is better? It looks to me that is not much
>> difference between the two choices.
>> Thanks!
>>
>>
>>
>


Re: Error using collectAsMap() in scala

2016-03-20 Thread Mark Hamstra
You're not getting what Ted is telling you.  Your `dict` is an RDD[String]
 -- i.e. it is a collection of a single value type, String.  But
`collectAsMap` is only defined for PairRDDs that have key-value pairs for
their data elements.  Both a key and a value are needed to collect into a
Map[K, V].

On Sun, Mar 20, 2016 at 8:19 PM, Shishir Anshuman  wrote:

> yes I have included that class in my code.
> I guess its something to do with the RDD format. Not able to figure out
> the exact reason.
>
> On Fri, Mar 18, 2016 at 9:27 AM, Ted Yu  wrote:
>
>> It is defined in:
>> core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
>>
>> On Thu, Mar 17, 2016 at 8:55 PM, Shishir Anshuman <
>> shishiranshu...@gmail.com> wrote:
>>
>>> I am using following code snippet in scala:
>>>
>>>
>>> *val dict: RDD[String] = sc.textFile("path/to/csv/file")*
>>> *val dict_broadcast=sc.broadcast(dict.collectAsMap())*
>>>
>>> On compiling It generates this error:
>>>
>>> *scala:42: value collectAsMap is not a member of
>>> org.apache.spark.rdd.RDD[String]*
>>>
>>>
>>> *val dict_broadcast=sc.broadcast(dict.collectAsMap())
>>>   ^*
>>>
>>
>>
>


Re: Spark UI Completed Jobs

2016-03-15 Thread Mark Hamstra
It's not just if the RDD is explicitly cached, but also if the map outputs
for stages have been materialized into shuffle files and are still
accessible through the map output tracker.  Because of that, explicitly
caching RDD actions often gains you little or nothing, since even without a
call to cache() or persist() the prior computation will largely be reused
and stages will show up as skipped -- i.e. no need to recompute that stage.

On Tue, Mar 15, 2016 at 5:50 PM, Jeff Zhang  wrote:

> If RDD is cached, this RDD is only computed once and the stages for
> computing this RDD in the following jobs are skipped.
>
>
> On Wed, Mar 16, 2016 at 8:14 AM, Prabhu Joseph  > wrote:
>
>> Hi All,
>>
>>
>> Spark UI Completed Jobs section shows below information, what is the
>> skipped value shown for Stages and Tasks below.
>>
>> Job_IDDescriptionSubmittedDuration
>> Stages (Succeeded/Total)Tasks (for all stages): Succeeded/Total
>>
>> 11 count  2016/03/14 15:35:32  1.4
>> min 164/164 * (163 skipped)   *19841/19788
>> *(41405 skipped)*
>> Thanks,
>> Prabhu Joseph
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: Can we use spark inside a web service?

2016-03-10 Thread Mark Hamstra
The fact that a typical Job requires multiple Tasks is not a problem, but
rather an opportunity for the Scheduler to interleave the workloads of
multiple concurrent Jobs across the available cores.

I work every day with such a production architecture with Spark on the user
request/response hot path.

On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly <ch...@fregly.com> wrote:

> you are correct, mark.  i misspoke.  apologies for the confusion.
>
> so the problem is even worse given that a typical job requires multiple
> tasks/cores.
>
> i have yet to see this particular architecture work in production.  i
> would love for someone to prove otherwise.
>
> On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> For example, if you're looking to scale out to 1000 concurrent requests,
>>> this is 1000 concurrent Spark jobs.  This would require a cluster with 1000
>>> cores.
>>
>>
>> This doesn't make sense.  A Spark Job is a driver/DAGScheduler concept
>> without any 1:1 correspondence between Worker cores and Jobs.  Cores are
>> used to run Tasks, not Jobs.  So, yes, a 1000 core cluster can run at most
>> 1000 simultaneous Tasks, but that doesn't really tell you anything about
>> how many Jobs are or can be concurrently tracked by the DAGScheduler, which
>> will be apportioning the Tasks from those concurrent Jobs across the
>> available Executor cores.
>>
>> On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly <ch...@fregly.com> wrote:
>>
>>> Good stuff, Evan.  Looks like this is utilizing the in-memory
>>> capabilities of FiloDB which is pretty cool.  looking forward to the
>>> webcast as I don't know much about FiloDB.
>>>
>>> My personal thoughts here are to removed Spark from the user
>>> request/response hot path.
>>>
>>> I can't tell you how many times i've had to unroll that architecture at
>>> clients - and replace with a real database like Cassandra, ElasticSearch,
>>> HBase, MySql.
>>>
>>> Unfortunately, Spark - and Spark Streaming, especially - lead you to
>>> believe that Spark could be used as an application server.  This is not a
>>> good use case for Spark.
>>>
>>> Remember that every job that is launched by Spark requires 1 CPU core,
>>> some memory, and an available Executor JVM to provide the CPU and memory.
>>>
>>> Yes, you can horizontally scale this because of the distributed nature
>>> of Spark, however it is not an efficient scaling strategy.
>>>
>>> For example, if you're looking to scale out to 1000 concurrent requests,
>>> this is 1000 concurrent Spark jobs.  This would require a cluster with 1000
>>> cores.  this is just not cost effective.
>>>
>>> Use Spark for what it's good for - ad-hoc, interactive, and iterative
>>> (machine learning, graph) analytics.  Use an application server for what
>>> it's good - managing a large amount of concurrent requests.  And use a
>>> database for what it's good for - storing/retrieving data.
>>>
>>> And any serious production deployment will need failover, throttling,
>>> back pressure, auto-scaling, and service discovery.
>>>
>>> While Spark supports these to varying levels of production-readiness,
>>> Spark is a batch-oriented system and not meant to be put on the user
>>> request/response hot path.
>>>
>>> For the failover, throttling, back pressure, autoscaling that i
>>> mentioned above, it's worth checking out the suite of Netflix OSS -
>>> particularly Hystrix, Eureka, Zuul, Karyon, etc:
>>> http://netflix.github.io/
>>>
>>> Here's my github project that incorporates a lot of these:
>>> https://github.com/cfregly/fluxcapacitor
>>>
>>> Here's a netflix Skunkworks github project that packages these up in
>>> Docker images:  https://github.com/Netflix-Skunkworks/zerotodocker
>>>
>>>
>>> On Thu, Mar 10, 2016 at 1:40 PM, velvia.github <velvia.git...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I just wrote a blog post which might be really useful to you -- I have
>>>> just
>>>> benchmarked being able to achieve 700 queries per second in Spark.  So,
>>>> yes,
>>>> web speed SQL queries are definitely possible.   Read my new blog post:
>>>>
>>>> http://velvia.github.io/Spark-Concurrent-Fast-Queries/
>>>>
>>>> and feel free to email me (at vel...@gmail.com) if you would like to
>>&g

Re: Can we use spark inside a web service?

2016-03-10 Thread Mark Hamstra
>
> For example, if you're looking to scale out to 1000 concurrent requests,
> this is 1000 concurrent Spark jobs.  This would require a cluster with 1000
> cores.


This doesn't make sense.  A Spark Job is a driver/DAGScheduler concept
without any 1:1 correspondence between Worker cores and Jobs.  Cores are
used to run Tasks, not Jobs.  So, yes, a 1000 core cluster can run at most
1000 simultaneous Tasks, but that doesn't really tell you anything about
how many Jobs are or can be concurrently tracked by the DAGScheduler, which
will be apportioning the Tasks from those concurrent Jobs across the
available Executor cores.

On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly  wrote:

> Good stuff, Evan.  Looks like this is utilizing the in-memory capabilities
> of FiloDB which is pretty cool.  looking forward to the webcast as I don't
> know much about FiloDB.
>
> My personal thoughts here are to removed Spark from the user
> request/response hot path.
>
> I can't tell you how many times i've had to unroll that architecture at
> clients - and replace with a real database like Cassandra, ElasticSearch,
> HBase, MySql.
>
> Unfortunately, Spark - and Spark Streaming, especially - lead you to
> believe that Spark could be used as an application server.  This is not a
> good use case for Spark.
>
> Remember that every job that is launched by Spark requires 1 CPU core,
> some memory, and an available Executor JVM to provide the CPU and memory.
>
> Yes, you can horizontally scale this because of the distributed nature of
> Spark, however it is not an efficient scaling strategy.
>
> For example, if you're looking to scale out to 1000 concurrent requests,
> this is 1000 concurrent Spark jobs.  This would require a cluster with 1000
> cores.  this is just not cost effective.
>
> Use Spark for what it's good for - ad-hoc, interactive, and iterative
> (machine learning, graph) analytics.  Use an application server for what
> it's good - managing a large amount of concurrent requests.  And use a
> database for what it's good for - storing/retrieving data.
>
> And any serious production deployment will need failover, throttling, back
> pressure, auto-scaling, and service discovery.
>
> While Spark supports these to varying levels of production-readiness,
> Spark is a batch-oriented system and not meant to be put on the user
> request/response hot path.
>
> For the failover, throttling, back pressure, autoscaling that i mentioned
> above, it's worth checking out the suite of Netflix OSS - particularly
> Hystrix, Eureka, Zuul, Karyon, etc:  http://netflix.github.io/
>
> Here's my github project that incorporates a lot of these:
> https://github.com/cfregly/fluxcapacitor
>
> Here's a netflix Skunkworks github project that packages these up in
> Docker images:  https://github.com/Netflix-Skunkworks/zerotodocker
>
>
> On Thu, Mar 10, 2016 at 1:40 PM, velvia.github 
> wrote:
>
>> Hi,
>>
>> I just wrote a blog post which might be really useful to you -- I have
>> just
>> benchmarked being able to achieve 700 queries per second in Spark.  So,
>> yes,
>> web speed SQL queries are definitely possible.   Read my new blog post:
>>
>> http://velvia.github.io/Spark-Concurrent-Fast-Queries/
>>
>> and feel free to email me (at vel...@gmail.com) if you would like to
>> follow
>> up.
>>
>> -Evan
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
>
> *Chris Fregly*
> Principal Data Solutions Engineer
> IBM Spark Technology Center, San Francisco, CA
> http://spark.tc | http://advancedspark.com
>


Re: Spark on RAID

2016-03-08 Thread Mark Hamstra
One issue is that RAID levels providing data replication are not necessary
since HDFS already replicates blocks on multiple nodes.

On Tue, Mar 8, 2016 at 8:45 AM, Alex Kozlov  wrote:

> Parallel disk IO?  But the effect should be less noticeable compared to
> Hadoop which reads/writes a lot.  Much depends on how often Spark persists
> on disk.  Depends on the specifics of the RAID controller as well.
>
> If you write to HDFS as opposed to local file system this may be a big
> factor as well.
>
> On Tue, Mar 8, 2016 at 8:34 AM, Eddie Esquivel  > wrote:
>
>> Hello All,
>> In the Spark documentation under "Hardware Requirements" it very clearly
>> states:
>>
>> We recommend having *4-8 disks* per node, configured *without* RAID
>> (just as separate mount points)
>>
>> My question is why not raid? What is the argument\reason for not using
>> Raid?
>>
>> Thanks!
>> -Eddie
>>
>
> --
> Alex Kozlov
>


Re: Understanding the Web_UI 4040

2016-03-07 Thread Mark Hamstra
There's probably nothing wrong other than a glitch in the reporting of
Executor state transitions to the UI -- one of those low-priority items
I've been meaning to look at for awhile

On Mon, Mar 7, 2016 at 12:15 AM, Sonal Goyal  wrote:

> Maybe check the worker logs to see what's going wrong with it?
> On Mar 7, 2016 9:10 AM, "Angel Angel"  wrote:
>
>> Hello Sir/Madam,
>>
>>
>> I am running the spark-sql application on the cluster.
>> In my cluster there are 3 slaves and one Master.
>>
>> When i saw the progress of my application in web UI hadoopm0:8080
>>
>> I found that one of my slaves node is always in  *LOADDING *mode.
>>
>> Can you tell me what is that mean?
>>
>> Also i am unable to see the DAG graph (As click on the DAG graph it hangs
>> the scree for some time).
>>
>> [image: Inline image 1]
>>
>


Re: Fair scheduler pool details

2016-03-02 Thread Mark Hamstra
If I'm understanding you correctly, then you are correct that the fair
scheduler doesn't currently do everything that you want to achieve.  Fair
scheduler pools currently can be configured with a minimum number of cores
that they will need before accepting Tasks, but there isn't a way to
restrict a pool to use no more than a certain number of cores.  That means
that a lower-priority pool can grab all of the cores as long as there is no
demand on high-priority pools, and then the higher-priority pools will have
to wait for the lower-priority pool to complete Tasks before the
higher-priority pools will be able to run Tasks.  That means that fair
scheduling pools really aren't a sufficient means to satisfy multi-tenancy
requirements or other scenarios where you want a guarantee that there will
always be some cores available to run a high-priority job.  There is a JIRA
issue and a PR out there to address some of this issue, and I've been
starting to come around to the notion that we should support a max cores
configuration for fair scheduler pools, but there is nothing like that
available right now.  Neither is there a way at the application level in a
standalone-mode cluster for one application to pre-empt another in order to
acquires its cores or other resources.  YARN does provide some support for
that, and Mesos may as well, so that is the closest option that I think
currently exists to satisfy your requirement.

On Wed, Mar 2, 2016 at 6:20 PM, Eugene Morozov <evgeny.a.moro...@gmail.com>
wrote:

> Mark,
>
> I'm trying to configure spark cluster to share resources between two pools.
>
> I can do that by assigning minimal shares (it works fine), but that means
> specific amount of cores is going to be wasted by just being ready to run
> anything. While that's better, than nothing, I'd like to specify percentage
> of cores instead of specific number of cores as cluster might be changed in
> size either up or down. Is there such an option?
>
> Also I haven't found anything about sort of preemptive scheduler for
> standalone deployment (it is slightly mentioned in SPARK-9882, but it seems
> to be abandoned). Do you know if there is such an activity?
>
> --
> Be well!
> Jean Morozov
>
> On Sun, Feb 21, 2016 at 4:32 AM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> It's 2 -- and it's pretty hard to point to a line of code, a method, or
>> even a class since the scheduling of Tasks involves a pretty complex
>> interaction of several Spark components -- mostly the DAGScheduler,
>> TaskScheduler/TaskSchedulerImpl, TaskSetManager, Schedulable and Pool, as
>> well as the SchedulerBackend (CoarseGrainedSchedulerBackend in this case.)
>>  The key thing to understand, though, is the comment at the top of
>> SchedulerBackend.scala: "A backend interface for scheduling systems that
>> allows plugging in different ones under TaskSchedulerImpl. We assume a
>> Mesos-like model where the application gets resource offers as machines
>> become available and can launch tasks on them."  In other words, the whole
>> scheduling system is built on a model that starts with offers made by
>> workers when resources are available to run Tasks.  Other than the big
>> hammer of canceling a Job while interruptOnCancel is true, there isn't
>> really any facility for stopping or rescheduling Tasks that are already
>> started, so that rules out your option 1.  Similarly, option 3 is out
>> because the scheduler doesn't know when Tasks will complete; it just knows
>> when a new offer comes in and it is time to send more Tasks to be run on
>> the machine making the offer.
>>
>> What actually happens is that the Pool with which a Job is associated
>> maintains a queue of TaskSets needing to be scheduled.  When in
>> resourceOffers the TaskSchedulerImpl needs sortedTaskSets, the Pool
>> supplies those from its scheduling queue after first sorting it according
>> to the Pool's taskSetSchedulingAlgorithm.  In other words, what Spark's
>> fair scheduling does in essence is, in response to worker resource offers,
>> to send new Tasks to be run; those Tasks are taken in sets from the queue
>> of waiting TaskSets, sorted according to a scheduling algorithm.  There is
>> no pre-emption or rescheduling of Tasks that the scheduler has already sent
>> to the workers, nor is there any attempt to anticipate when already running
>> Tasks will complete.
>>
>>
>> On Sat, Feb 20, 2016 at 4:14 PM, Eugene Morozov <
>> evgeny.a.moro...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I'm trying to understand how this thing works underneath. Let's say I
>>> have two types of jobs - high important, that might use small amount of
>>>

Re: Fair scheduler pool details

2016-02-20 Thread Mark Hamstra
It's 2 -- and it's pretty hard to point to a line of code, a method, or
even a class since the scheduling of Tasks involves a pretty complex
interaction of several Spark components -- mostly the DAGScheduler,
TaskScheduler/TaskSchedulerImpl, TaskSetManager, Schedulable and Pool, as
well as the SchedulerBackend (CoarseGrainedSchedulerBackend in this case.)
 The key thing to understand, though, is the comment at the top of
SchedulerBackend.scala: "A backend interface for scheduling systems that
allows plugging in different ones under TaskSchedulerImpl. We assume a
Mesos-like model where the application gets resource offers as machines
become available and can launch tasks on them."  In other words, the whole
scheduling system is built on a model that starts with offers made by
workers when resources are available to run Tasks.  Other than the big
hammer of canceling a Job while interruptOnCancel is true, there isn't
really any facility for stopping or rescheduling Tasks that are already
started, so that rules out your option 1.  Similarly, option 3 is out
because the scheduler doesn't know when Tasks will complete; it just knows
when a new offer comes in and it is time to send more Tasks to be run on
the machine making the offer.

What actually happens is that the Pool with which a Job is associated
maintains a queue of TaskSets needing to be scheduled.  When in
resourceOffers the TaskSchedulerImpl needs sortedTaskSets, the Pool
supplies those from its scheduling queue after first sorting it according
to the Pool's taskSetSchedulingAlgorithm.  In other words, what Spark's
fair scheduling does in essence is, in response to worker resource offers,
to send new Tasks to be run; those Tasks are taken in sets from the queue
of waiting TaskSets, sorted according to a scheduling algorithm.  There is
no pre-emption or rescheduling of Tasks that the scheduler has already sent
to the workers, nor is there any attempt to anticipate when already running
Tasks will complete.


On Sat, Feb 20, 2016 at 4:14 PM, Eugene Morozov 
wrote:

> Hi,
>
> I'm trying to understand how this thing works underneath. Let's say I have
> two types of jobs - high important, that might use small amount of cores
> and has to be run pretty fast. And less important, but greedy - uses as
> many cores as available. So, the idea is to use two corresponding pools.
>
> Then thing I'm trying to understand is the following.
> I use standalone spark deployment (no YARN, no Mesos).
> Let's say that less important took all the cores and then someone runs
> high important job. Then I see three possibilities:
> 1. Spark kill some executors that currently runs less important partitions
> to assign them to a high performant job.
> 2. Spark will wait until some partitions of less important job will be
> completely processed and then first executors that become free will be
> assigned to process high important job.
> 3. Spark will figure out specific time, when particular stages of
> partitions of less important jobs is done, and instead of continue with
> this job, these executors will be reassigned to high important one.
>
> Which one it is? Could you please point me to a class / method / line of
> code?
> --
> Be well!
> Jean Morozov
>


Re: Creating HiveContext in Spark-Shell fails

2016-02-15 Thread Mark Hamstra
Welcome to

    __

 / __/__  ___ _/ /__

_\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT

  /_/



Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_72)

Type in expressions to have them evaluated.

Type :help for more information.


scala> sqlContext.isInstanceOf[org.apache.spark.sql.hive.HiveContext]

res0: Boolean = true



On Mon, Feb 15, 2016 at 8:51 PM, Prabhu Joseph 
wrote:

> Hi All,
>
> On creating HiveContext in spark-shell, fails with
>
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted
> the database /SPARK/metastore_db.
>
> Spark-Shell already has created metastore_db for SqlContext.
>
> Spark context available as sc.
> SQL context available as sqlContext.
>
> But without HiveContext, i am able to query the data using SqlContext .
>
> scala>  var df =
> sqlContext.read.format("com.databricks.spark.csv").option("header",
> "true").option("inferSchema", "true").load("/SPARK/abc")
> df: org.apache.spark.sql.DataFrame = [Prabhu: string, Joseph: string]
>
> So is there any real need for HiveContext inside Spark Shell. Is
> everything that can be done with HiveContext, achievable with SqlContext
> inside Spark Shell.
>
>
>
> Thanks,
> Prabhu Joseph
>
>
>
>
>


  1   2   3   >