Re: help from other committers on getting started

2016-09-02 Thread Dayne Sorvisto
thank you Michael!I didn't know apache was a deep website on the clear net :P But I didn't expect anything less lol very coo On Friday, September 2, 2016 6:04 PM, Michael Allman wrote: Hi Dayne, Have a look at 

Re: help from other committers on getting started

2016-09-02 Thread Michael Allman
Hi Dayne, Have a look at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark . I think you'll find answers to most of your questions there. Cheers, Michael > On Sep 2, 2016, at 8:53 AM, Dayne

Re: critical bugs to be fixed in Spark 2.0.1?

2016-09-02 Thread Miao Wang
I am trying to reproduce it on my cluster based on your instructions. From: tomerk11 To: dev@spark.apache.org Date: 09/02/2016 12:32 PM Subject:Re: critical bugs to be fixed in Spark 2.0.1? We are regularly hitting the issue described in SPARK-17110

Committing Kafka offsets when using DirectKafkaInputDStream

2016-09-02 Thread vonnagy
I have upgrading to Spark 2.0 and am experimenting with using Kafka 0.10.0. I have a stream that I extract the data and would like to update the Kafka offsets as each partition is handled. With Spark 1.6 or Spark 2.0 and Kafka 0.8.2 I was able to update the offsets, but now there seems no way to

Re: help getting started

2016-09-02 Thread Jakob Odersky
Hi Dayne, you can look at this page for some starter issues: https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened). Also check out this guide on how to contribute to Spark

Re: critical bugs to be fixed in Spark 2.0.1?

2016-09-02 Thread tomerk11
We are regularly hitting the issue described in SPARK-17110 (https://issues.apache.org/jira/browse/SPARK-17110) and this is blocking us from upgrading from 1.6 to 2.0.0. It would be great if this could be fixed for 2.0.1 -- View this message in context:

Re: help getting started

2016-09-02 Thread dayne sorvisto
Hi, I'd like to request help from committers/contributors to work on some trivial bug fixes or documentation for the Spark project. I'm very interested in the machine learning side of things as I have a math background. I recently passed the databricks cert and feel I have a decent understanding

Re: Is Spark's KMeans unable to handle bigdata?

2016-09-02 Thread Sean Owen
Eh... more specifically, since Spark 2.0 the "runs" parameter in the KMeans mllib implementation has been ignored and is always 1. This means a lot of code that wraps this stuff up in arrays could be simplified quite a lot. I'll take a shot at optimizing this code and see if I can measure an

Re: Support for Hive 2.x

2016-09-02 Thread Dongjoon Hyun
Hi, Rostyslav, After your email, I also tried to search in this morning, but I didn't find a proper one. The last related issue is SPARK-8064, `Upgrade Hive to 1.2` https://issues.apache.org/jira/browse/SPARK-8064 If you want, you can file an JIRA issue including your pain points, then you can

Re: Is Spark's KMeans unable to handle bigdata?

2016-09-02 Thread Sean Owen
Yes it works fine, though each iteration of the parallel init step is slow indeed -- about 5 minutes on my cluster. Given your question I think you are actually 'hanging' because resources are being killed. I think this init may need some love and optimization. For example, I think treeAggregate

Re: Is Spark's KMeans unable to handle bigdata?

2016-09-02 Thread Georgios Samaras
So you were able to execute the minimal example I posted? I mean that the application doesn't progresses, it hangs (I would be OK if it was just slower). It doesn't seem to me a configuration issue. On Fri, Sep 2, 2016 at 1:07 AM, Sean Owen wrote: > Hm, what do you mean?

Re: sparkR array type not supported

2016-09-02 Thread Shivaram Venkataraman
I think it needs a type for the elements in the array. For example f <- structField("x", "array") Thanks Shivaram On Fri, Sep 2, 2016 at 8:26 AM, Paul R wrote: > Hi there, > > I’ve noticed the following command in sparkR > field = structField(“x”, “array”) > > Throws

help from other committers on getting started

2016-09-02 Thread Dayne Sorvisto
Hi, I'd like to request help from committers/contributors to work on some trivial bug fixes or documentation for the Spark project. I'm very interested in the machine learning side of things as I have a math background. I recently passed the databricks cert and feel I have a decent

sparkR array type not supported

2016-09-02 Thread Paul R
Hi there, I’ve noticed the following command in sparkR >>> field = structField(“x”, “array”) Throws this error >>> Error in checkType(type) : Unsupported type for SparkDataframe: array Was wondering if this is a bug as the documentation says “array” should be implemented Thanks

Re: shuffle files not deleted after executor restarted

2016-09-02 Thread Artur Sukhenko
I believe in your case it will help, as executor's shuffle files will be managed by external service. It is described in spark docs: graceful-decommission-of-executors Artur On Fri, Sep 2, 2016 at

Re: shuffle files not deleted after executor restarted

2016-09-02 Thread 汪洋
Yeah, using external shuffle service is a reasonable choice but I think we will still face the same problems. We use SSDs to store shuffle files for performance considerations. If the shuffle files are not going to be used anymore, we want them to be deleted instead of taking up valuable SSD

Re: shuffle files not deleted after executor restarted

2016-09-02 Thread Artur Sukhenko
Hi Yang, Isn't external shuffle service better for long running applications? "It runs as a standalone application and manages shuffle output files so they are available for executors at all time" It is described here:

Re: shuffle files not deleted after executor restarted

2016-09-02 Thread Sun Rui
Hi, Could you give more information about your Spark environment? cluster manager, spark version, using dynamic allocation or not, etc.. Generally, executors will delete temporary directories for shuffle files on exit because JVM shutdown hooks are registered. Unless they are brutally killed.

Re: Is Spark's KMeans unable to handle bigdata?

2016-09-02 Thread Sean Owen
Hm, what do you mean? k-means|| init is certainly slower because it's making passes over the data in order to pick better initial centroids. The idea is that you might then spend fewer iterations converging later, and converge to a better clustering. Your problem doesn't seem to be related to

Support for Hive 2.x

2016-09-02 Thread Rostyslav Sotnychenko
Hello! I tried compiling Spark 2.0 with Hive 2.0, but as expected this failed. So I am wondering if there is any talks going on about adding support of Hive 2.x to Spark? I was unable to find any JIRA about this. Thanks, Rostyslav