Possible "split brain" situation

2017-11-12 Thread Gimantha Bandara
Hi all,

We are using embedded Spark 1.6.2 in our analytics platform[1]. For the
cluster communication we use hazel-cast clustering capabilities. From
Hazelcast side we set the following configurations, in order to configure
the hearbeat properties.

hazelcast.max.no.heartbeat.seconds=30
hazelcast.max.no.master.confirmation.seconds=45

So, if there is a network failure for more than 30 seconds,  Hazelcast
detects that this is a network outage. So each node sees that the other
node left the cluster (in a 2 node cluster). What happens is, the current
active node stays active and the passive master also becomes active because
of the network outage. Following are the respective messages printed by
hazelcast.

*node 1:*
TID: [-1] [] [2017-11-05 08:59:14,071]  INFO
{org.wso2.carbon.core.clustering.hazelcast.wka.WKABasedMembershipScheme} -
Member left [ff9788ad-8570-465c-846a-82393c636ae5]: /10.36.239.70:4000
{org.wso2.carbon.core.clustering.hazelcast.wka.WKABasedMembershipScheme}

*node 2:*
TID: [-1] [] [2017-11-05 09:00:40,357]  INFO
{org.wso2.carbon.core.clustering.hazelcast.wka.WKABasedMembershipScheme} -
Member left [d02e49f5-78ab-4bfb-ba2c-df24f7e5c058]: /10.36.239.67:4000
{org.wso2.carbon.core.clustering.hazelcast.wka.WKABasedMembershipScheme}


When the network recovers, we see the following error in Spark,

ERROR {org.apache.spark.deploy.worker.Worker} -  Worker registration
failed: Duplicate worker ID {org.apache.spark.deploy.worker.Worker}


This also cause our server to shutdown(please note that we spawn a spark
JVM from our server). Please see the following shutdown error.

INFO {org.wso2.carbon.core.init.CarbonServerManager} -  Shutdown hook
triggered {org.wso2.carbon.core.init.CarbonServerManager}

Do you guys have any idea what is happening to a Spark cluster when there
is a network outage and how it handles the split brain situations? Is there
a way to restore the previous cluster state, once both nodes join the
cluster again? Also I am curious to know why the shudown hook is triggered.

Thanks in advance!
Gimantha


Re: Spark based Data Warehouse

2017-11-12 Thread Patrick Alwell
Alcon,

You can most certainly do this. I’ve done benchmarking with Spark SQL and the 
TPCDS queries using S3 as the filesystem.

Zeppelin and Livy server work well for the dash boarding and concurrent query 
issues:  https://hortonworks.com/blog/livy-a-rest-interface-for-apache-spark/

Livy Server will allow you to create multiple spark contexts via REST: 
https://livy.incubator.apache.org/

If you are looking for broad SQL functionality I’d recommend instantiating a 
Hive context. And Spark is able to spill to disk --> 
https://spark.apache.org/faq.html

There are multiple companies running spark within their data warehouse 
solutions: 
https://ibmdatawarehousing.wordpress.com/2016/10/12/steinbach_dashdb_local_spark/

Edmunds used Spark to allow business analysts to point Spark to files in S3 and 
infer schema: https://www.youtube.com/watch?v=gsR1ljgZLq0

Recommend running some benchmarks and testing query scenarios for your end 
users; but it sounds like you’ll be using it for exploratory analysis. Spark is 
great for this ☺

-Pat


From: Vadim Semenov 
Date: Sunday, November 12, 2017 at 1:06 PM
To: Gourav Sengupta 
Cc: Phillip Henry , ashish rawat 
, Jörn Franke , Deepak Sharma 
, spark users 
Subject: Re: Spark based Data Warehouse

It's actually quite simple to answer

> 1. Is Spark SQL and UDF, able to handle all the workloads?
Yes

> 2. What user interface did you provide for data scientist, data engineers and 
> analysts
Home-grown platform, EMR, Zeppelin

> What are the challenges in running concurrent queries, by many users, over 
> Spark SQL? Considering Spark still does not provide spill to disk, in many 
> scenarios, are there frequent query failures when executing concurrent queries
You can run separate Spark Contexts, so jobs will be isolated

> Are there any open source implementations, which provide something similar?
Yes, many.


On Sun, Nov 12, 2017 at 1:47 PM, Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>> wrote:
Dear Ashish,
what you are asking for involves at least a few weeks of dedicated 
understanding of your used case and then it takes at least 3 to 4 months to 
even propose a solution. You can even build a fantastic data warehouse just 
using C++. The matter depends on lots of conditions. I just think that your 
approach and question needs a lot of modification.

Regards,
Gourav

On Sun, Nov 12, 2017 at 6:19 PM, Phillip Henry 
mailto:londonjava...@gmail.com>> wrote:
Hi, Ashish.
You are correct in saying that not *all* functionality of Spark is 
spill-to-disk but I am not sure how this pertains to a "concurrent user 
scenario". Each executor will run in its own JVM and is therefore isolated from 
others. That is, if the JVM of one user dies, this should not effect another 
user who is running their own jobs in their own JVMs. The amount of resources 
used by a user can be controlled by the resource manager.
AFAIK, you configure something like YARN to limit the number of cores and the 
amount of memory in the cluster a certain user or group is allowed to use for 
their job. This is obviously quite a coarse-grained approach as (to my 
knowledge) IO is not throttled. I believe people generally use something like 
Apache Ambari to keep an eye on network and disk usage to mitigate problems in 
a shared cluster.

If the user has badly designed their query, it may very well fail with OOMEs 
but this can happen irrespective of whether one user or many is using the 
cluster at a given moment in time.

Does this help?
Regards,
Phillip

On Sun, Nov 12, 2017 at 5:50 PM, ashish rawat 
mailto:dceash...@gmail.com>> wrote:
Thanks Jorn and Phillip. My question was specifically to anyone who have tried 
creating a system using spark SQL, as Data Warehouse. I was trying to check, if 
someone has tried it and they can help with the kind of workloads which worked 
and the ones, which have problems.

Regarding spill to disk, I might be wrong but not all functionality of spark is 
spill to disk. So it still doesn't provide DB like reliability in execution. In 
case of DBs, queries get slow but they don't fail or go out of memory, 
specifically in concurrent user scenarios.

Regards,
Ashish

On Nov 12, 2017 3:02 PM, "Phillip Henry" 
mailto:londonjava...@gmail.com>> wrote:
Agree with Jorn. The answer is: it depends.

In the past, I've worked with data scientists who are happy to use the Spark 
CLI. Again, the answer is "it depends" (in this case, on the skills of your 
customers).
Regarding sharing resources, different teams were limited to their own queue so 
they could not hog all the resources. However, people within a team had to do 
some horse trading if they had a particularly intensive job to run. I did feel 
that this was an area that could be improved. It may be by now, I've just not 
looked into it for a while.
BTW I'm not sure what you mean by "Spark still does not provide spill to disk" 
as the FAQ says "Spark's operators spill data to disk if it does not fit in 
memory" (http://spark.ap

Re: Spark based Data Warehouse

2017-11-12 Thread Vadim Semenov
It's actually quite simple to answer

> 1. Is Spark SQL and UDF, able to handle all the workloads?
Yes

> 2. What user interface did you provide for data scientist, data engineers
and analysts
Home-grown platform, EMR, Zeppelin

> What are the challenges in running concurrent queries, by many users,
over Spark SQL? Considering Spark still does not provide spill to disk, in
many scenarios, are there frequent query failures when executing concurrent
queries
You can run separate Spark Contexts, so jobs will be isolated

> Are there any open source implementations, which provide something
similar?
Yes, many.


On Sun, Nov 12, 2017 at 1:47 PM, Gourav Sengupta 
wrote:

> Dear Ashish,
> what you are asking for involves at least a few weeks of dedicated
> understanding of your used case and then it takes at least 3 to 4 months to
> even propose a solution. You can even build a fantastic data warehouse just
> using C++. The matter depends on lots of conditions. I just think that your
> approach and question needs a lot of modification.
>
> Regards,
> Gourav
>
> On Sun, Nov 12, 2017 at 6:19 PM, Phillip Henry 
> wrote:
>
>> Hi, Ashish.
>>
>> You are correct in saying that not *all* functionality of Spark is
>> spill-to-disk but I am not sure how this pertains to a "concurrent user
>> scenario". Each executor will run in its own JVM and is therefore isolated
>> from others. That is, if the JVM of one user dies, this should not effect
>> another user who is running their own jobs in their own JVMs. The amount of
>> resources used by a user can be controlled by the resource manager.
>>
>> AFAIK, you configure something like YARN to limit the number of cores and
>> the amount of memory in the cluster a certain user or group is allowed to
>> use for their job. This is obviously quite a coarse-grained approach as (to
>> my knowledge) IO is not throttled. I believe people generally use something
>> like Apache Ambari to keep an eye on network and disk usage to mitigate
>> problems in a shared cluster.
>>
>> If the user has badly designed their query, it may very well fail with
>> OOMEs but this can happen irrespective of whether one user or many is using
>> the cluster at a given moment in time.
>>
>> Does this help?
>>
>> Regards,
>>
>> Phillip
>>
>>
>> On Sun, Nov 12, 2017 at 5:50 PM, ashish rawat 
>> wrote:
>>
>>> Thanks Jorn and Phillip. My question was specifically to anyone who have
>>> tried creating a system using spark SQL, as Data Warehouse. I was trying to
>>> check, if someone has tried it and they can help with the kind of workloads
>>> which worked and the ones, which have problems.
>>>
>>> Regarding spill to disk, I might be wrong but not all functionality of
>>> spark is spill to disk. So it still doesn't provide DB like reliability in
>>> execution. In case of DBs, queries get slow but they don't fail or go out
>>> of memory, specifically in concurrent user scenarios.
>>>
>>> Regards,
>>> Ashish
>>>
>>> On Nov 12, 2017 3:02 PM, "Phillip Henry" 
>>> wrote:
>>>
>>> Agree with Jorn. The answer is: it depends.
>>>
>>> In the past, I've worked with data scientists who are happy to use the
>>> Spark CLI. Again, the answer is "it depends" (in this case, on the skills
>>> of your customers).
>>>
>>> Regarding sharing resources, different teams were limited to their own
>>> queue so they could not hog all the resources. However, people within a
>>> team had to do some horse trading if they had a particularly intensive job
>>> to run. I did feel that this was an area that could be improved. It may be
>>> by now, I've just not looked into it for a while.
>>>
>>> BTW I'm not sure what you mean by "Spark still does not provide spill to
>>> disk" as the FAQ says "Spark's operators spill data to disk if it does not
>>> fit in memory" (http://spark.apache.org/faq.html). So, your data will
>>> not normally cause OutOfMemoryErrors (certain terms and conditions may
>>> apply).
>>>
>>> My 2 cents.
>>>
>>> Phillip
>>>
>>>
>>>
>>> On Sun, Nov 12, 2017 at 9:14 AM, Jörn Franke 
>>> wrote:
>>>
 What do you mean all possible workloads?
 You cannot prepare any system to do all possible processing.

 We do not know the requirements of your data scientists now or in the
 future so it is difficult to say. How do they work currently without the
 new solution? Do they all work on the same data? I bet you will receive on
 your email a lot of private messages trying to sell their solution that
 solves everything - with the information you provided this is impossible to
 say.

 Then with every system: have incremental releases but have then in
 short time frames - do not engineer a big system that you will deliver in 2
 years. In the cloud you have the perfect possibility to scale feature but
 also infrastructure wise.

 Challenges with concurrent queries is the right definition of the
 scheduler (eg fairscheduler) that not one query take all the resources or
 that long 

Re: Spark based Data Warehouse

2017-11-12 Thread Gourav Sengupta
Dear Ashish,
what you are asking for involves at least a few weeks of dedicated
understanding of your used case and then it takes at least 3 to 4 months to
even propose a solution. You can even build a fantastic data warehouse just
using C++. The matter depends on lots of conditions. I just think that your
approach and question needs a lot of modification.

Regards,
Gourav

On Sun, Nov 12, 2017 at 6:19 PM, Phillip Henry 
wrote:

> Hi, Ashish.
>
> You are correct in saying that not *all* functionality of Spark is
> spill-to-disk but I am not sure how this pertains to a "concurrent user
> scenario". Each executor will run in its own JVM and is therefore isolated
> from others. That is, if the JVM of one user dies, this should not effect
> another user who is running their own jobs in their own JVMs. The amount of
> resources used by a user can be controlled by the resource manager.
>
> AFAIK, you configure something like YARN to limit the number of cores and
> the amount of memory in the cluster a certain user or group is allowed to
> use for their job. This is obviously quite a coarse-grained approach as (to
> my knowledge) IO is not throttled. I believe people generally use something
> like Apache Ambari to keep an eye on network and disk usage to mitigate
> problems in a shared cluster.
>
> If the user has badly designed their query, it may very well fail with
> OOMEs but this can happen irrespective of whether one user or many is using
> the cluster at a given moment in time.
>
> Does this help?
>
> Regards,
>
> Phillip
>
>
> On Sun, Nov 12, 2017 at 5:50 PM, ashish rawat  wrote:
>
>> Thanks Jorn and Phillip. My question was specifically to anyone who have
>> tried creating a system using spark SQL, as Data Warehouse. I was trying to
>> check, if someone has tried it and they can help with the kind of workloads
>> which worked and the ones, which have problems.
>>
>> Regarding spill to disk, I might be wrong but not all functionality of
>> spark is spill to disk. So it still doesn't provide DB like reliability in
>> execution. In case of DBs, queries get slow but they don't fail or go out
>> of memory, specifically in concurrent user scenarios.
>>
>> Regards,
>> Ashish
>>
>> On Nov 12, 2017 3:02 PM, "Phillip Henry"  wrote:
>>
>> Agree with Jorn. The answer is: it depends.
>>
>> In the past, I've worked with data scientists who are happy to use the
>> Spark CLI. Again, the answer is "it depends" (in this case, on the skills
>> of your customers).
>>
>> Regarding sharing resources, different teams were limited to their own
>> queue so they could not hog all the resources. However, people within a
>> team had to do some horse trading if they had a particularly intensive job
>> to run. I did feel that this was an area that could be improved. It may be
>> by now, I've just not looked into it for a while.
>>
>> BTW I'm not sure what you mean by "Spark still does not provide spill to
>> disk" as the FAQ says "Spark's operators spill data to disk if it does not
>> fit in memory" (http://spark.apache.org/faq.html). So, your data will
>> not normally cause OutOfMemoryErrors (certain terms and conditions may
>> apply).
>>
>> My 2 cents.
>>
>> Phillip
>>
>>
>>
>> On Sun, Nov 12, 2017 at 9:14 AM, Jörn Franke 
>> wrote:
>>
>>> What do you mean all possible workloads?
>>> You cannot prepare any system to do all possible processing.
>>>
>>> We do not know the requirements of your data scientists now or in the
>>> future so it is difficult to say. How do they work currently without the
>>> new solution? Do they all work on the same data? I bet you will receive on
>>> your email a lot of private messages trying to sell their solution that
>>> solves everything - with the information you provided this is impossible to
>>> say.
>>>
>>> Then with every system: have incremental releases but have then in short
>>> time frames - do not engineer a big system that you will deliver in 2
>>> years. In the cloud you have the perfect possibility to scale feature but
>>> also infrastructure wise.
>>>
>>> Challenges with concurrent queries is the right definition of the
>>> scheduler (eg fairscheduler) that not one query take all the resources or
>>> that long running queries starve.
>>>
>>> User interfaces: what could help are notebooks (Jupyter etc) but you may
>>> need to train your data scientists. Some may know or prefer other tools.
>>>
>>> On 12. Nov 2017, at 08:32, Deepak Sharma  wrote:
>>>
>>> I am looking for similar solution more aligned to data scientist group.
>>> The concern i have is about supporting complex aggregations at runtime .
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Nov 12, 2017 12:51, "ashish rawat"  wrote:
>>>
 Hello Everyone,

 I was trying to understand if anyone here has tried a data warehouse
 solution using S3 and Spark SQL. Out of multiple possible options
 (redshift, presto, hive etc), we were planning to go with Spark SQL, for
 our aggregates and processing requirements.

 If 

Re: Spark based Data Warehouse

2017-11-12 Thread Phillip Henry
Hi, Ashish.

You are correct in saying that not *all* functionality of Spark is
spill-to-disk but I am not sure how this pertains to a "concurrent user
scenario". Each executor will run in its own JVM and is therefore isolated
from others. That is, if the JVM of one user dies, this should not effect
another user who is running their own jobs in their own JVMs. The amount of
resources used by a user can be controlled by the resource manager.

AFAIK, you configure something like YARN to limit the number of cores and
the amount of memory in the cluster a certain user or group is allowed to
use for their job. This is obviously quite a coarse-grained approach as (to
my knowledge) IO is not throttled. I believe people generally use something
like Apache Ambari to keep an eye on network and disk usage to mitigate
problems in a shared cluster.

If the user has badly designed their query, it may very well fail with
OOMEs but this can happen irrespective of whether one user or many is using
the cluster at a given moment in time.

Does this help?

Regards,

Phillip


On Sun, Nov 12, 2017 at 5:50 PM, ashish rawat  wrote:

> Thanks Jorn and Phillip. My question was specifically to anyone who have
> tried creating a system using spark SQL, as Data Warehouse. I was trying to
> check, if someone has tried it and they can help with the kind of workloads
> which worked and the ones, which have problems.
>
> Regarding spill to disk, I might be wrong but not all functionality of
> spark is spill to disk. So it still doesn't provide DB like reliability in
> execution. In case of DBs, queries get slow but they don't fail or go out
> of memory, specifically in concurrent user scenarios.
>
> Regards,
> Ashish
>
> On Nov 12, 2017 3:02 PM, "Phillip Henry"  wrote:
>
> Agree with Jorn. The answer is: it depends.
>
> In the past, I've worked with data scientists who are happy to use the
> Spark CLI. Again, the answer is "it depends" (in this case, on the skills
> of your customers).
>
> Regarding sharing resources, different teams were limited to their own
> queue so they could not hog all the resources. However, people within a
> team had to do some horse trading if they had a particularly intensive job
> to run. I did feel that this was an area that could be improved. It may be
> by now, I've just not looked into it for a while.
>
> BTW I'm not sure what you mean by "Spark still does not provide spill to
> disk" as the FAQ says "Spark's operators spill data to disk if it does not
> fit in memory" (http://spark.apache.org/faq.html). So, your data will not
> normally cause OutOfMemoryErrors (certain terms and conditions may apply).
>
> My 2 cents.
>
> Phillip
>
>
>
> On Sun, Nov 12, 2017 at 9:14 AM, Jörn Franke  wrote:
>
>> What do you mean all possible workloads?
>> You cannot prepare any system to do all possible processing.
>>
>> We do not know the requirements of your data scientists now or in the
>> future so it is difficult to say. How do they work currently without the
>> new solution? Do they all work on the same data? I bet you will receive on
>> your email a lot of private messages trying to sell their solution that
>> solves everything - with the information you provided this is impossible to
>> say.
>>
>> Then with every system: have incremental releases but have then in short
>> time frames - do not engineer a big system that you will deliver in 2
>> years. In the cloud you have the perfect possibility to scale feature but
>> also infrastructure wise.
>>
>> Challenges with concurrent queries is the right definition of the
>> scheduler (eg fairscheduler) that not one query take all the resources or
>> that long running queries starve.
>>
>> User interfaces: what could help are notebooks (Jupyter etc) but you may
>> need to train your data scientists. Some may know or prefer other tools.
>>
>> On 12. Nov 2017, at 08:32, Deepak Sharma  wrote:
>>
>> I am looking for similar solution more aligned to data scientist group.
>> The concern i have is about supporting complex aggregations at runtime .
>>
>> Thanks
>> Deepak
>>
>> On Nov 12, 2017 12:51, "ashish rawat"  wrote:
>>
>>> Hello Everyone,
>>>
>>> I was trying to understand if anyone here has tried a data warehouse
>>> solution using S3 and Spark SQL. Out of multiple possible options
>>> (redshift, presto, hive etc), we were planning to go with Spark SQL, for
>>> our aggregates and processing requirements.
>>>
>>> If anyone has tried it out, would like to understand the following:
>>>
>>>1. Is Spark SQL and UDF, able to handle all the workloads?
>>>2. What user interface did you provide for data scientist, data
>>>engineers and analysts
>>>3. What are the challenges in running concurrent queries, by many
>>>users, over Spark SQL? Considering Spark still does not provide spill to
>>>disk, in many scenarios, are there frequent query failures when executing
>>>concurrent queries
>>>4. Are there any open source implementations, which provide
>>>  

Re: Spark based Data Warehouse

2017-11-12 Thread ashish rawat
Thanks Jorn and Phillip. My question was specifically to anyone who have
tried creating a system using spark SQL, as Data Warehouse. I was trying to
check, if someone has tried it and they can help with the kind of workloads
which worked and the ones, which have problems.

Regarding spill to disk, I might be wrong but not all functionality of
spark is spill to disk. So it still doesn't provide DB like reliability in
execution. In case of DBs, queries get slow but they don't fail or go out
of memory, specifically in concurrent user scenarios.

Regards,
Ashish

On Nov 12, 2017 3:02 PM, "Phillip Henry"  wrote:

Agree with Jorn. The answer is: it depends.

In the past, I've worked with data scientists who are happy to use the
Spark CLI. Again, the answer is "it depends" (in this case, on the skills
of your customers).

Regarding sharing resources, different teams were limited to their own
queue so they could not hog all the resources. However, people within a
team had to do some horse trading if they had a particularly intensive job
to run. I did feel that this was an area that could be improved. It may be
by now, I've just not looked into it for a while.

BTW I'm not sure what you mean by "Spark still does not provide spill to
disk" as the FAQ says "Spark's operators spill data to disk if it does not
fit in memory" (http://spark.apache.org/faq.html). So, your data will not
normally cause OutOfMemoryErrors (certain terms and conditions may apply).

My 2 cents.

Phillip



On Sun, Nov 12, 2017 at 9:14 AM, Jörn Franke  wrote:

> What do you mean all possible workloads?
> You cannot prepare any system to do all possible processing.
>
> We do not know the requirements of your data scientists now or in the
> future so it is difficult to say. How do they work currently without the
> new solution? Do they all work on the same data? I bet you will receive on
> your email a lot of private messages trying to sell their solution that
> solves everything - with the information you provided this is impossible to
> say.
>
> Then with every system: have incremental releases but have then in short
> time frames - do not engineer a big system that you will deliver in 2
> years. In the cloud you have the perfect possibility to scale feature but
> also infrastructure wise.
>
> Challenges with concurrent queries is the right definition of the
> scheduler (eg fairscheduler) that not one query take all the resources or
> that long running queries starve.
>
> User interfaces: what could help are notebooks (Jupyter etc) but you may
> need to train your data scientists. Some may know or prefer other tools.
>
> On 12. Nov 2017, at 08:32, Deepak Sharma  wrote:
>
> I am looking for similar solution more aligned to data scientist group.
> The concern i have is about supporting complex aggregations at runtime .
>
> Thanks
> Deepak
>
> On Nov 12, 2017 12:51, "ashish rawat"  wrote:
>
>> Hello Everyone,
>>
>> I was trying to understand if anyone here has tried a data warehouse
>> solution using S3 and Spark SQL. Out of multiple possible options
>> (redshift, presto, hive etc), we were planning to go with Spark SQL, for
>> our aggregates and processing requirements.
>>
>> If anyone has tried it out, would like to understand the following:
>>
>>1. Is Spark SQL and UDF, able to handle all the workloads?
>>2. What user interface did you provide for data scientist, data
>>engineers and analysts
>>3. What are the challenges in running concurrent queries, by many
>>users, over Spark SQL? Considering Spark still does not provide spill to
>>disk, in many scenarios, are there frequent query failures when executing
>>concurrent queries
>>4. Are there any open source implementations, which provide something
>>similar?
>>
>>
>> Regards,
>> Ashish
>>
>


Re: Spark based Data Warehouse

2017-11-12 Thread Phillip Henry
Agree with Jorn. The answer is: it depends.

In the past, I've worked with data scientists who are happy to use the
Spark CLI. Again, the answer is "it depends" (in this case, on the skills
of your customers).

Regarding sharing resources, different teams were limited to their own
queue so they could not hog all the resources. However, people within a
team had to do some horse trading if they had a particularly intensive job
to run. I did feel that this was an area that could be improved. It may be
by now, I've just not looked into it for a while.

BTW I'm not sure what you mean by "Spark still does not provide spill to
disk" as the FAQ says "Spark's operators spill data to disk if it does not
fit in memory" (http://spark.apache.org/faq.html). So, your data will not
normally cause OutOfMemoryErrors (certain terms and conditions may apply).

My 2 cents.

Phillip



On Sun, Nov 12, 2017 at 9:14 AM, Jörn Franke  wrote:

> What do you mean all possible workloads?
> You cannot prepare any system to do all possible processing.
>
> We do not know the requirements of your data scientists now or in the
> future so it is difficult to say. How do they work currently without the
> new solution? Do they all work on the same data? I bet you will receive on
> your email a lot of private messages trying to sell their solution that
> solves everything - with the information you provided this is impossible to
> say.
>
> Then with every system: have incremental releases but have then in short
> time frames - do not engineer a big system that you will deliver in 2
> years. In the cloud you have the perfect possibility to scale feature but
> also infrastructure wise.
>
> Challenges with concurrent queries is the right definition of the
> scheduler (eg fairscheduler) that not one query take all the resources or
> that long running queries starve.
>
> User interfaces: what could help are notebooks (Jupyter etc) but you may
> need to train your data scientists. Some may know or prefer other tools.
>
> On 12. Nov 2017, at 08:32, Deepak Sharma  wrote:
>
> I am looking for similar solution more aligned to data scientist group.
> The concern i have is about supporting complex aggregations at runtime .
>
> Thanks
> Deepak
>
> On Nov 12, 2017 12:51, "ashish rawat"  wrote:
>
>> Hello Everyone,
>>
>> I was trying to understand if anyone here has tried a data warehouse
>> solution using S3 and Spark SQL. Out of multiple possible options
>> (redshift, presto, hive etc), we were planning to go with Spark SQL, for
>> our aggregates and processing requirements.
>>
>> If anyone has tried it out, would like to understand the following:
>>
>>1. Is Spark SQL and UDF, able to handle all the workloads?
>>2. What user interface did you provide for data scientist, data
>>engineers and analysts
>>3. What are the challenges in running concurrent queries, by many
>>users, over Spark SQL? Considering Spark still does not provide spill to
>>disk, in many scenarios, are there frequent query failures when executing
>>concurrent queries
>>4. Are there any open source implementations, which provide something
>>similar?
>>
>>
>> Regards,
>> Ashish
>>
>


Re: Spark based Data Warehouse

2017-11-12 Thread Jörn Franke
What do you mean all possible workloads?
You cannot prepare any system to do all possible processing.

We do not know the requirements of your data scientists now or in the future so 
it is difficult to say. How do they work currently without the new solution? Do 
they all work on the same data? I bet you will receive on your email a lot of 
private messages trying to sell their solution that solves everything - with 
the information you provided this is impossible to say.

Then with every system: have incremental releases but have then in short time 
frames - do not engineer a big system that you will deliver in 2 years. In the 
cloud you have the perfect possibility to scale feature but also infrastructure 
wise.

Challenges with concurrent queries is the right definition of the scheduler (eg 
fairscheduler) that not one query take all the resources or that long running 
queries starve.

User interfaces: what could help are notebooks (Jupyter etc) but you may need 
to train your data scientists. Some may know or prefer other tools.

> On 12. Nov 2017, at 08:32, Deepak Sharma  wrote:
> 
> I am looking for similar solution more aligned to data scientist group.
> The concern i have is about supporting complex aggregations at runtime .
> 
> Thanks
> Deepak
> 
>> On Nov 12, 2017 12:51, "ashish rawat"  wrote:
>> Hello Everyone,
>> 
>> I was trying to understand if anyone here has tried a data warehouse 
>> solution using S3 and Spark SQL. Out of multiple possible options (redshift, 
>> presto, hive etc), we were planning to go with Spark SQL, for our aggregates 
>> and processing requirements.
>> 
>> If anyone has tried it out, would like to understand the following:
>> Is Spark SQL and UDF, able to handle all the workloads?
>> What user interface did you provide for data scientist, data engineers and 
>> analysts
>> What are the challenges in running concurrent queries, by many users, over 
>> Spark SQL? Considering Spark still does not provide spill to disk, in many 
>> scenarios, are there frequent query failures when executing concurrent 
>> queries
>> Are there any open source implementations, which provide something similar?
>> 
>> Regards,
>> Ashish