Support for Hive 2.x

2016-09-02 Thread Rostyslav Sotnychenko
Hello! I tried compiling Spark 2.0 with Hive 2.0, but as expected this failed. So I am wondering if there is any talks going on about adding support of Hive 2.x to Spark? I was unable to find any JIRA about this. Thanks, Rostyslav

Re: Is Spark's KMeans unable to handle bigdata?

2016-09-02 Thread Sean Owen
Hm, what do you mean? k-means|| init is certainly slower because it's making passes over the data in order to pick better initial centroids. The idea is that you might then spend fewer iterations converging later, and converge to a better clustering. Your problem doesn't seem to be related to scal

Re: shuffle files not deleted after executor restarted

2016-09-02 Thread Sun Rui
Hi, Could you give more information about your Spark environment? cluster manager, spark version, using dynamic allocation or not, etc.. Generally, executors will delete temporary directories for shuffle files on exit because JVM shutdown hooks are registered. Unless they are brutally killed. Y

Re: shuffle files not deleted after executor restarted

2016-09-02 Thread 汪洋
Thank you for you response. We are using spark-1.6.2 on standalone deploy mode with dynamic allocation disabled. I have traced the code. IMHO, it seems this cleanup is not handled by shutdown hooks directly. The shutdown hooks only send a “ExecutorStateChanged” message to the worker and if th

Re: shuffle files not deleted after executor restarted

2016-09-02 Thread Artur Sukhenko
Hi Yang, Isn't external shuffle service better for long running applications? "It runs as a standalone application and manages shuffle output files so they are available for executors at all time" It is described here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-Extern

Re: shuffle files not deleted after executor restarted

2016-09-02 Thread 汪洋
Yeah, using external shuffle service is a reasonable choice but I think we will still face the same problems. We use SSDs to store shuffle files for performance considerations. If the shuffle files are not going to be used anymore, we want them to be deleted instead of taking up valuable SSD spa

Re: shuffle files not deleted after executor restarted

2016-09-02 Thread 汪洋
> 在 2016年9月2日,下午5:58,汪洋 写道: > > Yeah, using external shuffle service is a reasonable choice but I think we > will still face the same problems. We use SSDs to store shuffle files for > performance considerations. If the shuffle files are not going to be used > anymore, we want them to be dele

Re: shuffle files not deleted after executor restarted

2016-09-02 Thread Artur Sukhenko
I believe in your case it will help, as executor's shuffle files will be managed by external service. It is described in spark docs: graceful-decommission-of-executors Artur On Fri, Sep 2, 2016 at 1:01

sparkR array type not supported

2016-09-02 Thread Paul R
Hi there, I’ve noticed the following command in sparkR >>> field = structField(“x”, “array”) Throws this error >>> Error in checkType(type) : Unsupported type for SparkDataframe: array Was wondering if this is a bug as the documentation says “array” should be implemented Thanks ---

help from other committers on getting started

2016-09-02 Thread Dayne Sorvisto
Hi, I'd like to request help from committers/contributors to work on some trivial bug fixes or documentation for the Spark project. I'm very interested in the machine learning side of things as I have a math background. I recently passed the databricks cert and feel I have a decent understanding

Re: sparkR array type not supported

2016-09-02 Thread Shivaram Venkataraman
I think it needs a type for the elements in the array. For example f <- structField("x", "array") Thanks Shivaram On Fri, Sep 2, 2016 at 8:26 AM, Paul R wrote: > Hi there, > > I’ve noticed the following command in sparkR > field = structField(“x”, “array”) > > Throws this error > Erro

Re: Is Spark's KMeans unable to handle bigdata?

2016-09-02 Thread Georgios Samaras
So you were able to execute the minimal example I posted? I mean that the application doesn't progresses, it hangs (I would be OK if it was just slower). It doesn't seem to me a configuration issue. On Fri, Sep 2, 2016 at 1:07 AM, Sean Owen wrote: > Hm, what do you mean? k-means|| init is certa

Re: Is Spark's KMeans unable to handle bigdata?

2016-09-02 Thread Sean Owen
Yes it works fine, though each iteration of the parallel init step is slow indeed -- about 5 minutes on my cluster. Given your question I think you are actually 'hanging' because resources are being killed. I think this init may need some love and optimization. For example, I think treeAggregate m

Re: Support for Hive 2.x

2016-09-02 Thread Dongjoon Hyun
Hi, Rostyslav, After your email, I also tried to search in this morning, but I didn't find a proper one. The last related issue is SPARK-8064, `Upgrade Hive to 1.2` https://issues.apache.org/jira/browse/SPARK-8064 If you want, you can file an JIRA issue including your pain points, then you can

Re: Is Spark's KMeans unable to handle bigdata?

2016-09-02 Thread Sean Owen
Eh... more specifically, since Spark 2.0 the "runs" parameter in the KMeans mllib implementation has been ignored and is always 1. This means a lot of code that wraps this stuff up in arrays could be simplified quite a lot. I'll take a shot at optimizing this code and see if I can measure an effect

Re: Is Spark's KMeans unable to handle bigdata?

2016-09-02 Thread Georgios Samaras
I am not using the "runs" parameter anyway, but I see your point. If you could point out any modifications in the minimal example I posted, I would be more than interested to try them! On Fri, Sep 2, 2016 at 10:43 AM, Sean Owen wrote: > Eh... more specifically, since Spark 2.0 the "runs" paramet

Re: help getting started

2016-09-02 Thread dayne sorvisto
Hi, I'd like to request help from committers/contributors to work on some trivial bug fixes or documentation for the Spark project. I'm very interested in the machine learning side of things as I have a math background. I recently passed the databricks cert and feel I have a decent understanding o

Re: critical bugs to be fixed in Spark 2.0.1?

2016-09-02 Thread tomerk11
We are regularly hitting the issue described in SPARK-17110 (https://issues.apache.org/jira/browse/SPARK-17110) and this is blocking us from upgrading from 1.6 to 2.0.0. It would be great if this could be fixed for 2.0.1 -- View this message in context: http://apache-spark-developers-list.100

Re: help getting started

2016-09-02 Thread Jakob Odersky
Hi Dayne, you can look at this page for some starter issues: https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened). Also check out this guide on how to contribute to Spark https://cwiki.apa

Committing Kafka offsets when using DirectKafkaInputDStream

2016-09-02 Thread vonnagy
I have upgrading to Spark 2.0 and am experimenting with using Kafka 0.10.0. I have a stream that I extract the data and would like to update the Kafka offsets as each partition is handled. With Spark 1.6 or Spark 2.0 and Kafka 0.8.2 I was able to update the offsets, but now there seems no way to do

Re: critical bugs to be fixed in Spark 2.0.1?

2016-09-02 Thread Miao Wang
I am trying to reproduce it on my cluster based on your instructions. From: tomerk11 To: dev@spark.apache.org Date: 09/02/2016 12:32 PM Subject:Re: critical bugs to be fixed in Spark 2.0.1? We are regularly hitting the issue described in SPARK-17110 (https://issues.apache.or

Re: help from other committers on getting started

2016-09-02 Thread Michael Allman
Hi Dayne, Have a look at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark . I think you'll find answers to most of your questions there. Cheers, Michael > On Sep 2, 2016, at 8:53 AM, Dayne Sorvis

Re: help from other committers on getting started

2016-09-02 Thread Dayne Sorvisto
thank you Michael!I didn't know apache was a deep website on the clear net :P But I didn't expect anything less lol very coo On Friday, September 2, 2016 6:04 PM, Michael Allman wrote: Hi Dayne, Have a look at  https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark