Re: help from other committers on getting started

2016-09-02 Thread Dayne Sorvisto
thank you Michael!I didn't know apache was a deep website on the clear net :P 
But I didn't expect anything less lol very coo 

On Friday, September 2, 2016 6:04 PM, Michael Allman  
wrote:
 

 Hi Dayne,
Have a look at 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark. I 
think you'll find answers to most of your questions there.
Cheers,
Michael


On Sep 2, 2016, at 8:53 AM, Dayne Sorvisto  
wrote:
Hi,
I'd like to request help from committers/contributors to work on some trivial 
bug fixes or documentation for the Spark project. I'm very interested in the 
machine learning side of things as I have a math background. I recently passed 
the databricks cert and feel I have a decent understanding of the key concepts 
I need to get started as a beginner contributor. My github is DayneSorvisto 
(Dayne ) and I've signed up for a Jira account.
|   |
|   |  |   |   |   |   |   |
| DayneSorvisto (Dayne )DayneSorvisto has 11 repositories available. Follow 
their code on GitHub. |
|  |
| View on github.com | Preview by Yahoo |
|  |
|   |



Thank you,Dayne Sorvisto



   

Re: help from other committers on getting started

2016-09-02 Thread Michael Allman
Hi Dayne,

Have a look at 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
. I 
think you'll find answers to most of your questions there.

Cheers,

Michael


> On Sep 2, 2016, at 8:53 AM, Dayne Sorvisto  
> wrote:
> 
> Hi,
> 
> I'd like to request help from committers/contributors to work on some trivial 
> bug fixes or documentation for the Spark project. I'm very interested in the 
> machine learning side of things as I have a math background. I recently 
> passed the databricks cert and feel I have a decent understanding of the key 
> concepts I need to get started as a beginner contributor. My github is 
> DayneSorvisto (Dayne )  and I've signed up 
> for a Jira account.
>  
>  
>  
>  
>  
>  
>  
>  
> DayneSorvisto (Dayne )
>  DayneSorvisto has 11 repositories 
> available. Follow their code on GitHub.
> View on github.com   
> Preview by Yahoo
>  
> 
> 
> Thank you,
> Dayne Sorvisto



Re: critical bugs to be fixed in Spark 2.0.1?

2016-09-02 Thread Miao Wang

I am trying to reproduce it on my cluster based on your instructions.



From:   tomerk11 
To: dev@spark.apache.org
Date:   09/02/2016 12:32 PM
Subject:Re: critical bugs to be fixed in Spark 2.0.1?



We are regularly hitting the issue described in SPARK-17110
(https://issues.apache.org/jira/browse/SPARK-17110) and this is blocking us
from upgrading from 1.6 to 2.0.0.

It would be great if this could be fixed for 2.0.1



--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/critical-bugs-to-be-fixed-in-Spark-2-0-1-tp18686p18838.html

Sent from the Apache Spark Developers List mailing list archive at
Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org





Committing Kafka offsets when using DirectKafkaInputDStream

2016-09-02 Thread vonnagy
I have upgrading to Spark 2.0 and am experimenting with using Kafka 0.10.0. I
have a stream that I extract the data and would like to update the Kafka
offsets as each partition is handled. With Spark 1.6 or Spark 2.0 and Kafka
0.8.2 I was able to update the offsets, but now there seems no way to do so.
Here is an example

val stream = getStream

stream.forEachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

rdd.foreachPartition { events =>
val partId = TaskContext.get.partitionId
val offsets = offsetRanges(partId)

// Do something with the data

// Update the offsets for the partition so at most, the partition's
data would be duplicated
}
}

With the new stream, I could call `commitAsync` with the offsets, but the
drawback here is that it would only update the offsets after the entire RDD
is handled. This can be a real issue for near "exactly once".

With the new logic, each partition has a Kafka consumer associated with each
partition, however, there is no access to it. I have looked at the
CachedKafkaConsumer classes and there is no way at the cache as well so that
I could call a commit on the offsets.

Beyond that I have tried to use the new Kafka 0.10 APIs, but always run into
errors as it requires one to subscribe to the topic and get assigned
partitions. I only want to update the offsets in Kafka. 

Any ideas would be helpful of how I might work with the Kafka API to set the
offsets or get Spark to add logic to allow the commitment of offsets on a
partition basis.

Thanks,

Ivan



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Committing-Kafka-offsets-when-using-DirectKafkaInputDStream-tp18840.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: help getting started

2016-09-02 Thread Jakob Odersky
Hi Dayne,
you can look at this page for some starter issues:
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened).
Also check out this guide on how to contribute to Spark
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

regards,
--Jakob

On Fri, Sep 2, 2016 at 11:56 AM, dayne sorvisto  wrote:
> Hi,
>
> I'd like to request help from committers/contributors to work on some
> trivial bug fixes or documentation for the Spark project. I'm very
> interested in the machine learning side of things as I have a math
> background. I recently passed the databricks cert and feel I have a decent
> understanding of the key concepts I need to get started as a beginner
> contributor.  and I've signed up for a Jira account.
>
> Thank you
>
> On Fri, Sep 2, 2016 at 12:54 PM, dayne sorvisto 
> wrote:
>>
>> Hi,
>>
>> I'd like to request help from committers/contributors to work on some
>> trivial bug fixes or documentation for the Spark project. I'm very
>> interested in the machine learning side of things as I have a math
>> background. I recently passed the databricks cert and feel I have a decent
>> understanding of the key concepts I need to get started as a beginner
>> contributor.  and I've signed up for a Jira account.
>>
>> Thank you
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: critical bugs to be fixed in Spark 2.0.1?

2016-09-02 Thread tomerk11
We are regularly hitting the issue described in SPARK-17110
(https://issues.apache.org/jira/browse/SPARK-17110) and this is blocking us
from upgrading from 1.6 to 2.0.0.

It would be great if this could be fixed for 2.0.1 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/critical-bugs-to-be-fixed-in-Spark-2-0-1-tp18686p18838.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: help getting started

2016-09-02 Thread dayne sorvisto
Hi,

I'd like to request help from committers/contributors to work on some
trivial bug fixes or documentation for the Spark project. I'm very
interested in the machine learning side of things as I have a math
background. I recently passed the databricks cert and feel I have a decent
understanding of the key concepts I need to get started as a beginner
contributor.  and I've signed up for a Jira account.

Thank you

On Fri, Sep 2, 2016 at 12:54 PM, dayne sorvisto 
wrote:

> Hi,
>
> I'd like to request help from committers/contributors to work on some
> trivial bug fixes or documentation for the Spark project. I'm very
> interested in the machine learning side of things as I have a math
> background. I recently passed the databricks cert and feel I have a decent
> understanding of the key concepts I need to get started as a beginner
> contributor.  and I've signed up for a Jira account.
>
> Thank you
>


Re: Is Spark's KMeans unable to handle bigdata?

2016-09-02 Thread Georgios Samaras
I am not using the "runs" parameter anyway, but I see your point. If you
could point out any modifications in the minimal example I posted, I would
be more than interested to try them!

On Fri, Sep 2, 2016 at 10:43 AM, Sean Owen  wrote:

> Eh... more specifically, since Spark 2.0 the "runs" parameter in the
> KMeans mllib implementation has been ignored and is always 1. This
> means a lot of code that wraps this stuff up in arrays could be
> simplified quite a lot. I'll take a shot at optimizing this code and
> see if I can measure an effect.
>
> On Fri, Sep 2, 2016 at 6:33 PM, Sean Owen  wrote:
> > Yes it works fine, though each iteration of the parallel init step is
> > slow indeed -- about 5 minutes on my cluster. Given your question I
> > think you are actually 'hanging' because resources are being killed.
> >
> > I think this init may need some love and optimization. For example, I
> > think treeAggregate might work better. An Array[Float] may be just
> > fine and cut down memory usage, etc.
> >
> > On Fri, Sep 2, 2016 at 5:47 PM, Georgios Samaras
> >  wrote:
> >> So you were able to execute the minimal example I posted?
> >>
> >> I mean that the application doesn't progresses, it hangs (I would be OK
> if
> >> it was just slower). It doesn't seem to me a configuration issue.
> >>
> >> On Fri, Sep 2, 2016 at 1:07 AM, Sean Owen  wrote:
> >>>
> >>> Hm, what do you mean? k-means|| init is certainly slower because it's
> >>> making passes over the data in order to pick better initial centroids.
> >>> The idea is that you might then spend fewer iterations converging
> >>> later, and converge to a better clustering.
> >>>
> >>> Your problem doesn't seem to be related to scale. You aren't even
> >>> running out of memory it seems. Your memory settings are causing YARN
> >>> to kill the executors for using more memory than they advertise. That
> >>> could mean it never proceeds if this happens a lot.
> >>>
> >>> I don't have any problems with it.
> >>>
> >>> On Thu, Sep 1, 2016 at 11:35 PM, Georgios Samaras
> >>>  wrote:
> >>> > Dear all,
> >>> >
> >>> >   the random initialization works well, but the default
> initialization
> >>> > is
> >>> > k-means|| and has made me struggle. Also, I had heard people one year
> >>> > ago
> >>> > struggling with it too, and everybody would just skip it and use
> random,
> >>> > but
> >>> > I cannot keep it inside me!
> >>> >
> >>> >   I have posted a minimal example here..
> >>> >
> >>> > Please advice,
> >>> > George Samaras
> >>
> >>
>


Re: Is Spark's KMeans unable to handle bigdata?

2016-09-02 Thread Sean Owen
Eh... more specifically, since Spark 2.0 the "runs" parameter in the
KMeans mllib implementation has been ignored and is always 1. This
means a lot of code that wraps this stuff up in arrays could be
simplified quite a lot. I'll take a shot at optimizing this code and
see if I can measure an effect.

On Fri, Sep 2, 2016 at 6:33 PM, Sean Owen  wrote:
> Yes it works fine, though each iteration of the parallel init step is
> slow indeed -- about 5 minutes on my cluster. Given your question I
> think you are actually 'hanging' because resources are being killed.
>
> I think this init may need some love and optimization. For example, I
> think treeAggregate might work better. An Array[Float] may be just
> fine and cut down memory usage, etc.
>
> On Fri, Sep 2, 2016 at 5:47 PM, Georgios Samaras
>  wrote:
>> So you were able to execute the minimal example I posted?
>>
>> I mean that the application doesn't progresses, it hangs (I would be OK if
>> it was just slower). It doesn't seem to me a configuration issue.
>>
>> On Fri, Sep 2, 2016 at 1:07 AM, Sean Owen  wrote:
>>>
>>> Hm, what do you mean? k-means|| init is certainly slower because it's
>>> making passes over the data in order to pick better initial centroids.
>>> The idea is that you might then spend fewer iterations converging
>>> later, and converge to a better clustering.
>>>
>>> Your problem doesn't seem to be related to scale. You aren't even
>>> running out of memory it seems. Your memory settings are causing YARN
>>> to kill the executors for using more memory than they advertise. That
>>> could mean it never proceeds if this happens a lot.
>>>
>>> I don't have any problems with it.
>>>
>>> On Thu, Sep 1, 2016 at 11:35 PM, Georgios Samaras
>>>  wrote:
>>> > Dear all,
>>> >
>>> >   the random initialization works well, but the default initialization
>>> > is
>>> > k-means|| and has made me struggle. Also, I had heard people one year
>>> > ago
>>> > struggling with it too, and everybody would just skip it and use random,
>>> > but
>>> > I cannot keep it inside me!
>>> >
>>> >   I have posted a minimal example here..
>>> >
>>> > Please advice,
>>> > George Samaras
>>
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Support for Hive 2.x

2016-09-02 Thread Dongjoon Hyun
Hi, Rostyslav,

After your email, I also tried to search in this morning, but I didn't find
a proper one.

The last related issue is SPARK-8064, `Upgrade Hive to 1.2`

https://issues.apache.org/jira/browse/SPARK-8064

If you want, you can file an JIRA issue including your pain points, then
you can monitor through it.

I guess you have more reasons to do that, not just a compilation issue.

Bests,
Dongjoon.



On Fri, Sep 2, 2016 at 12:51 AM, Rostyslav Sotnychenko <
r.sotnyche...@gmail.com> wrote:

> Hello!
>
> I tried compiling Spark 2.0 with Hive 2.0, but as expected this failed.
>
> So I am wondering if there is any talks going on about adding support of
> Hive 2.x to Spark? I was unable to find any JIRA about this.
>
>
> Thanks,
> Rostyslav
>
>


Re: Is Spark's KMeans unable to handle bigdata?

2016-09-02 Thread Sean Owen
Yes it works fine, though each iteration of the parallel init step is
slow indeed -- about 5 minutes on my cluster. Given your question I
think you are actually 'hanging' because resources are being killed.

I think this init may need some love and optimization. For example, I
think treeAggregate might work better. An Array[Float] may be just
fine and cut down memory usage, etc.

On Fri, Sep 2, 2016 at 5:47 PM, Georgios Samaras
 wrote:
> So you were able to execute the minimal example I posted?
>
> I mean that the application doesn't progresses, it hangs (I would be OK if
> it was just slower). It doesn't seem to me a configuration issue.
>
> On Fri, Sep 2, 2016 at 1:07 AM, Sean Owen  wrote:
>>
>> Hm, what do you mean? k-means|| init is certainly slower because it's
>> making passes over the data in order to pick better initial centroids.
>> The idea is that you might then spend fewer iterations converging
>> later, and converge to a better clustering.
>>
>> Your problem doesn't seem to be related to scale. You aren't even
>> running out of memory it seems. Your memory settings are causing YARN
>> to kill the executors for using more memory than they advertise. That
>> could mean it never proceeds if this happens a lot.
>>
>> I don't have any problems with it.
>>
>> On Thu, Sep 1, 2016 at 11:35 PM, Georgios Samaras
>>  wrote:
>> > Dear all,
>> >
>> >   the random initialization works well, but the default initialization
>> > is
>> > k-means|| and has made me struggle. Also, I had heard people one year
>> > ago
>> > struggling with it too, and everybody would just skip it and use random,
>> > but
>> > I cannot keep it inside me!
>> >
>> >   I have posted a minimal example here..
>> >
>> > Please advice,
>> > George Samaras
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Is Spark's KMeans unable to handle bigdata?

2016-09-02 Thread Georgios Samaras
So you were able to execute the minimal example I posted?

I mean that the application doesn't progresses, it hangs (I would be OK if
it was just slower). It doesn't seem to me a configuration issue.

On Fri, Sep 2, 2016 at 1:07 AM, Sean Owen  wrote:

> Hm, what do you mean? k-means|| init is certainly slower because it's
> making passes over the data in order to pick better initial centroids.
> The idea is that you might then spend fewer iterations converging
> later, and converge to a better clustering.
>
> Your problem doesn't seem to be related to scale. You aren't even
> running out of memory it seems. Your memory settings are causing YARN
> to kill the executors for using more memory than they advertise. That
> could mean it never proceeds if this happens a lot.
>
> I don't have any problems with it.
>
> On Thu, Sep 1, 2016 at 11:35 PM, Georgios Samaras
>  wrote:
> > Dear all,
> >
> >   the random initialization works well, but the default initialization is
> > k-means|| and has made me struggle. Also, I had heard people one year ago
> > struggling with it too, and everybody would just skip it and use random,
> but
> > I cannot keep it inside me!
> >
> >   I have posted a minimal example here..
> >
> > Please advice,
> > George Samaras
>


Re: sparkR array type not supported

2016-09-02 Thread Shivaram Venkataraman
I think it needs a type for the elements in the array. For example

f <- structField("x", "array")

Thanks
Shivaram

On Fri, Sep 2, 2016 at 8:26 AM, Paul R  wrote:
> Hi there,
>
> I’ve noticed the following command in sparkR
>
 field = structField(“x”, “array”)
>
> Throws this error
>
 Error in checkType(type) : Unsupported type for SparkDataframe: array
>
> Was wondering if this is a bug as the documentation says “array” should be 
> implemented
>
> Thanks
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



help from other committers on getting started

2016-09-02 Thread Dayne Sorvisto
Hi,
I'd like to request help from committers/contributors to work on some trivial 
bug fixes or documentation for the Spark project. I'm very interested in the 
machine learning side of things as I have a math background. I recently passed 
the databricks cert and feel I have a decent understanding of the key concepts 
I need to get started as a beginner contributor. My github is DayneSorvisto 
(Dayne ) and I've signed up for a Jira account.
|   |
|   |  |   |   |   |   |   |
| DayneSorvisto (Dayne )DayneSorvisto has 11 repositories available. Follow 
their code on GitHub. |
|  |
| View on github.com | Preview by Yahoo |
|  |
|   |



Thank you,Dayne Sorvisto

sparkR array type not supported

2016-09-02 Thread Paul R
Hi there,

I’ve noticed the following command in sparkR 

>>> field = structField(“x”, “array”)

Throws this error

>>> Error in checkType(type) : Unsupported type for SparkDataframe: array

Was wondering if this is a bug as the documentation says “array” should be 
implemented

Thanks 
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: shuffle files not deleted after executor restarted

2016-09-02 Thread Artur Sukhenko
I believe in your case it will help, as executor's shuffle files will be
managed by external service.
It is described in spark docs: graceful-decommission-of-executors



Artur



On Fri, Sep 2, 2016 at 1:01 PM 汪洋  wrote:

> 在 2016年9月2日,下午5:58,汪洋  写道:
>
> Yeah, using external shuffle service is a reasonable choice but I think we
> will still face the same problems. We use SSDs to store shuffle files for
> performance considerations. If the shuffle files are not going to be used
> anymore, we want them to be deleted instead of taking up valuable SSD space.
>
> Not very familiar with external shuffle service though. Is it going to
> help in this case? -:)
>
> 在 2016年9月2日,下午5:40,Artur Sukhenko  写道:
>
> Hi Yang,
>
> Isn't external shuffle service better for long running applications?
> "It runs as a standalone application and manages shuffle output files so
> they are available for executors at all time"
>
> It is described here:
>
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-ExternalShuffleService.html
>
> ---
> Artur
>
> On Fri, Sep 2, 2016 at 12:30 PM 汪洋  wrote:
>
>> Thank you for you response.
>>
>> We are using spark-1.6.2 on standalone deploy mode with dynamic
>> allocation disabled.
>>
>> I have traced the code. IMHO, it seems this cleanup is not handled by
>> shutdown hooks directly. The shutdown hooks only send a
>> “ExecutorStateChanged” message to the worker and if the worker see the
>> message, it will cleanup the directory *only when this application is
>> finished*. In our case, the application is not finished (long running).
>> The executor exits due to some unknown error and it is restarted by worker
>> right away. In this scenario, those old directories are not going to be
>> deleted.
>>
>> If the application is still running, is it safe to delete the old
>> “blockmgr” directory and leaving only the newest one?
>>
>> Our temporary solution is to restart our application regularly and we are
>> seeking a more elegant way.
>>
>> Thanks.
>>
>> Yang
>>
>>
>> 在 2016年9月2日,下午4:11,Sun Rui  写道:
>>
>> Hi,
>> Could you give more information about your Spark environment? cluster
>> manager, spark version, using dynamic allocation or not, etc..
>>
>> Generally, executors will delete temporary directories for shuffle files
>> on exit because JVM shutdown hooks are registered. Unless they are brutally
>> killed.
>>
>> You can safely delete the directories when you are sure that the spark
>> applications related to them have finished. A crontab task may be used for
>> automatic clean up.
>>
>> On Sep 2, 2016, at 12:18, 汪洋  wrote:
>>
>> Hi all,
>>
>> I discovered that sometimes executor exits unexpectedly and when it is
>> restarted, it will create another blockmgr directory without deleting the
>> old ones. Thus, for a long running application, some shuffle files will
>> never be cleaned up. Sometimes those files could take up the whole disk.
>>
>> Is there a way to clean up those unused file automatically? Or is it safe
>> to delete the old directory manually only leaving the newest one?
>>
>> Here is the executor’s local directory.
>> 
>>
>> Any advice on this?
>>
>> Thanks.
>>
>> Yang
>>
>>
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>> --
> --
> Artur Sukhenko
>
>
> --
--
Artur Sukhenko


Re: shuffle files not deleted after executor restarted

2016-09-02 Thread 汪洋

> 在 2016年9月2日,下午5:58,汪洋  写道:
> 
> Yeah, using external shuffle service is a reasonable choice but I think we 
> will still face the same problems. We use SSDs to store shuffle files for 
> performance considerations. If the shuffle files are not going to be used 
> anymore, we want them to be deleted instead of taking up valuable SSD space.
> 
Not very familiar with external shuffle service though. Is it going to help in 
this case? -:)
>> 在 2016年9月2日,下午5:40,Artur Sukhenko > > 写道:
>> 
>> Hi Yang,
>> 
>> Isn't external shuffle service better for long running applications? 
>> "It runs as a standalone application and manages shuffle output files so 
>> they are available for executors at all time"
>> 
>> It is described here:
>> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-ExternalShuffleService.html
>>  
>> 
>> 
>> ---
>> Artur
>> 
>> On Fri, Sep 2, 2016 at 12:30 PM 汪洋 > > wrote:
>> Thank you for you response. 
>> 
>> We are using spark-1.6.2 on standalone deploy mode with dynamic allocation 
>> disabled.
>> 
>> I have traced the code. IMHO, it seems this cleanup is not handled by 
>> shutdown hooks directly. The shutdown hooks only send a 
>> “ExecutorStateChanged” message to the worker and if the worker see the 
>> message, it will cleanup the directory only when this application is 
>> finished. In our case, the application is not finished (long running). The 
>> executor exits due to some unknown error and it is restarted by worker right 
>> away. In this scenario, those old directories are not going to be deleted. 
>> 
>> If the application is still running, is it safe to delete the old “blockmgr” 
>> directory and leaving only the newest one?
>> 
>> Our temporary solution is to restart our application regularly and we are 
>> seeking a more elegant way. 
>> 
>> Thanks.
>> 
>> Yang
>> 
>> 
>>> 在 2016年9月2日,下午4:11,Sun Rui >> > 写道:
>>> 
>>> Hi,
>>> Could you give more information about your Spark environment? cluster 
>>> manager, spark version, using dynamic allocation or not, etc..
>>> 
>>> Generally, executors will delete temporary directories for shuffle files on 
>>> exit because JVM shutdown hooks are registered. Unless they are brutally 
>>> killed.
>>> 
>>> You can safely delete the directories when you are sure that the spark 
>>> applications related to them have finished. A crontab task may be used for 
>>> automatic clean up.
>>> 
 On Sep 2, 2016, at 12:18, 汪洋 >>> > wrote:
 
 Hi all,
 
 I discovered that sometimes executor exits unexpectedly and when it is 
 restarted, it will create another blockmgr directory without deleting the 
 old ones. Thus, for a long running application, some shuffle files will 
 never be cleaned up. Sometimes those files could take up the whole disk. 
 
 Is there a way to clean up those unused file automatically? Or is it safe 
 to delete the old directory manually only leaving the newest one?
 
 Here is the executor’s local directory.
 
 
 Any advice on this?
 
 Thanks.
 
 Yang
>>> 
>>> 
>>> 
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>>> 
>>> 
>> 
>> -- 
>> --
>> Artur Sukhenko
> 



Re: shuffle files not deleted after executor restarted

2016-09-02 Thread 汪洋
Yeah, using external shuffle service is a reasonable choice but I think we will 
still face the same problems. We use SSDs to store shuffle files for 
performance considerations. If the shuffle files are not going to be used 
anymore, we want them to be deleted instead of taking up valuable SSD space.

> 在 2016年9月2日,下午5:40,Artur Sukhenko  写道:
> 
> Hi Yang,
> 
> Isn't external shuffle service better for long running applications? 
> "It runs as a standalone application and manages shuffle output files so they 
> are available for executors at all time"
> 
> It is described here:
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-ExternalShuffleService.html
>  
> 
> 
> ---
> Artur
> 
> On Fri, Sep 2, 2016 at 12:30 PM 汪洋  > wrote:
> Thank you for you response. 
> 
> We are using spark-1.6.2 on standalone deploy mode with dynamic allocation 
> disabled.
> 
> I have traced the code. IMHO, it seems this cleanup is not handled by 
> shutdown hooks directly. The shutdown hooks only send a 
> “ExecutorStateChanged” message to the worker and if the worker see the 
> message, it will cleanup the directory only when this application is 
> finished. In our case, the application is not finished (long running). The 
> executor exits due to some unknown error and it is restarted by worker right 
> away. In this scenario, those old directories are not going to be deleted. 
> 
> If the application is still running, is it safe to delete the old “blockmgr” 
> directory and leaving only the newest one?
> 
> Our temporary solution is to restart our application regularly and we are 
> seeking a more elegant way. 
> 
> Thanks.
> 
> Yang
> 
> 
>> 在 2016年9月2日,下午4:11,Sun Rui > > 写道:
>> 
>> Hi,
>> Could you give more information about your Spark environment? cluster 
>> manager, spark version, using dynamic allocation or not, etc..
>> 
>> Generally, executors will delete temporary directories for shuffle files on 
>> exit because JVM shutdown hooks are registered. Unless they are brutally 
>> killed.
>> 
>> You can safely delete the directories when you are sure that the spark 
>> applications related to them have finished. A crontab task may be used for 
>> automatic clean up.
>> 
>>> On Sep 2, 2016, at 12:18, 汪洋 >> > wrote:
>>> 
>>> Hi all,
>>> 
>>> I discovered that sometimes executor exits unexpectedly and when it is 
>>> restarted, it will create another blockmgr directory without deleting the 
>>> old ones. Thus, for a long running application, some shuffle files will 
>>> never be cleaned up. Sometimes those files could take up the whole disk. 
>>> 
>>> Is there a way to clean up those unused file automatically? Or is it safe 
>>> to delete the old directory manually only leaving the newest one?
>>> 
>>> Here is the executor’s local directory.
>>> 
>>> 
>>> Any advice on this?
>>> 
>>> Thanks.
>>> 
>>> Yang
>> 
>> 
>> 
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>> 
>> 
> 
> -- 
> --
> Artur Sukhenko



Re: shuffle files not deleted after executor restarted

2016-09-02 Thread Artur Sukhenko
Hi Yang,

Isn't external shuffle service better for long running applications?
"It runs as a standalone application and manages shuffle output files so
they are available for executors at all time"

It is described here:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-ExternalShuffleService.html

---
Artur

On Fri, Sep 2, 2016 at 12:30 PM 汪洋  wrote:

> Thank you for you response.
>
> We are using spark-1.6.2 on standalone deploy mode with dynamic allocation
> disabled.
>
> I have traced the code. IMHO, it seems this cleanup is not handled by
> shutdown hooks directly. The shutdown hooks only send a
> “ExecutorStateChanged” message to the worker and if the worker see the
> message, it will cleanup the directory *only when this application is
> finished*. In our case, the application is not finished (long running).
> The executor exits due to some unknown error and it is restarted by worker
> right away. In this scenario, those old directories are not going to be
> deleted.
>
> If the application is still running, is it safe to delete the old
> “blockmgr” directory and leaving only the newest one?
>
> Our temporary solution is to restart our application regularly and we are
> seeking a more elegant way.
>
> Thanks.
>
> Yang
>
>
> 在 2016年9月2日,下午4:11,Sun Rui  写道:
>
> Hi,
> Could you give more information about your Spark environment? cluster
> manager, spark version, using dynamic allocation or not, etc..
>
> Generally, executors will delete temporary directories for shuffle files
> on exit because JVM shutdown hooks are registered. Unless they are brutally
> killed.
>
> You can safely delete the directories when you are sure that the spark
> applications related to them have finished. A crontab task may be used for
> automatic clean up.
>
> On Sep 2, 2016, at 12:18, 汪洋  wrote:
>
> Hi all,
>
> I discovered that sometimes executor exits unexpectedly and when it is
> restarted, it will create another blockmgr directory without deleting the
> old ones. Thus, for a long running application, some shuffle files will
> never be cleaned up. Sometimes those files could take up the whole disk.
>
> Is there a way to clean up those unused file automatically? Or is it safe
> to delete the old directory manually only leaving the newest one?
>
> Here is the executor’s local directory.
> 
>
> Any advice on this?
>
> Thanks.
>
> Yang
>
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
> --
--
Artur Sukhenko


Re: shuffle files not deleted after executor restarted

2016-09-02 Thread 汪洋
Thank you for you response. 

We are using spark-1.6.2 on standalone deploy mode with dynamic allocation 
disabled.

I have traced the code. IMHO, it seems this cleanup is not handled by shutdown 
hooks directly. The shutdown hooks only send a “ExecutorStateChanged” message 
to the worker and if the worker see the message, it will cleanup the directory 
only when this application is finished. In our case, the application is not 
finished (long running). The executor exits due to some unknown error and it is 
restarted by worker right away. In this scenario, those old directories are not 
going to be deleted. 

If the application is still running, is it safe to delete the old “blockmgr” 
directory and leaving only the newest one?

Our temporary solution is to restart our application regularly and we are 
seeking a more elegant way. 

Thanks.

Yang


> 在 2016年9月2日,下午4:11,Sun Rui  写道:
> 
> Hi,
> Could you give more information about your Spark environment? cluster 
> manager, spark version, using dynamic allocation or not, etc..
> 
> Generally, executors will delete temporary directories for shuffle files on 
> exit because JVM shutdown hooks are registered. Unless they are brutally 
> killed.
> 
> You can safely delete the directories when you are sure that the spark 
> applications related to them have finished. A crontab task may be used for 
> automatic clean up.
> 
>> On Sep 2, 2016, at 12:18, 汪洋  wrote:
>> 
>> Hi all,
>> 
>> I discovered that sometimes executor exits unexpectedly and when it is 
>> restarted, it will create another blockmgr directory without deleting the 
>> old ones. Thus, for a long running application, some shuffle files will 
>> never be cleaned up. Sometimes those files could take up the whole disk. 
>> 
>> Is there a way to clean up those unused file automatically? Or is it safe to 
>> delete the old directory manually only leaving the newest one?
>> 
>> Here is the executor’s local directory.
>> 
>> 
>> Any advice on this?
>> 
>> Thanks.
>> 
>> Yang
> 
> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 



Re: shuffle files not deleted after executor restarted

2016-09-02 Thread Sun Rui
Hi,
Could you give more information about your Spark environment? cluster manager, 
spark version, using dynamic allocation or not, etc..

Generally, executors will delete temporary directories for shuffle files on 
exit because JVM shutdown hooks are registered. Unless they are brutally killed.

You can safely delete the directories when you are sure that the spark 
applications related to them have finished. A crontab task may be used for 
automatic clean up.

> On Sep 2, 2016, at 12:18, 汪洋  wrote:
> 
> Hi all,
> 
> I discovered that sometimes executor exits unexpectedly and when it is 
> restarted, it will create another blockmgr directory without deleting the old 
> ones. Thus, for a long running application, some shuffle files will never be 
> cleaned up. Sometimes those files could take up the whole disk. 
> 
> Is there a way to clean up those unused file automatically? Or is it safe to 
> delete the old directory manually only leaving the newest one?
> 
> Here is the executor’s local directory.
> 
> 
> Any advice on this?
> 
> Thanks.
> 
> Yang



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Is Spark's KMeans unable to handle bigdata?

2016-09-02 Thread Sean Owen
Hm, what do you mean? k-means|| init is certainly slower because it's
making passes over the data in order to pick better initial centroids.
The idea is that you might then spend fewer iterations converging
later, and converge to a better clustering.

Your problem doesn't seem to be related to scale. You aren't even
running out of memory it seems. Your memory settings are causing YARN
to kill the executors for using more memory than they advertise. That
could mean it never proceeds if this happens a lot.

I don't have any problems with it.

On Thu, Sep 1, 2016 at 11:35 PM, Georgios Samaras
 wrote:
> Dear all,
>
>   the random initialization works well, but the default initialization is
> k-means|| and has made me struggle. Also, I had heard people one year ago
> struggling with it too, and everybody would just skip it and use random, but
> I cannot keep it inside me!
>
>   I have posted a minimal example here..
>
> Please advice,
> George Samaras

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Support for Hive 2.x

2016-09-02 Thread Rostyslav Sotnychenko
Hello!

I tried compiling Spark 2.0 with Hive 2.0, but as expected this failed.

So I am wondering if there is any talks going on about adding support of
Hive 2.x to Spark? I was unable to find any JIRA about this.


Thanks,
Rostyslav