Re: is the Lineage of RDD stored as a byte code in memory or a file?

2016-08-24 Thread Daniel Darabos
You are saying the RDD lineage must be serialized, otherwise we could not recreate it after a node failure. This is false. The RDD lineage is not serialized. It is only relevant to the driver application and as such it is just kept in memory in the driver application. If the driver application

Re: [ANNOUNCE] Apache Spark 2.0.0-preview release

2016-05-25 Thread Daniel Darabos
Awesome, thanks! It's very helpful for preparing for the migration. Do you plan to push 2.0.0-preview to Maven too? (I for one would appreciate the convenience.) On Wed, May 25, 2016 at 8:44 AM, Reynold Xin wrote: > In the past the Spark community have created preview

Re: Ever increasing physical memory for a Spark Application in YARN

2016-05-02 Thread Daniel Darabos
Hi Nitin, Sorry for waking up this ancient thread. That's a fantastic set of JVM flags! We just hit the same problem, but we haven't even discovered all those flags for limiting memory growth. I wanted to ask if you ever discovered anything further? I see you also set -XX:NewRatio=3. This is a

Re: Does SparkSql has official jdbc/odbc driver ?

2016-03-25 Thread Daniel Darabos
I haven't tried this, but I thought you can run the Thriftserver in Spark and then connect with the HiveServer2 JDBC driver: http://spark.apache.org/docs/1.6.1/sql-programming-guide.html#running-the-thrift-jdbcodbc-server On Fri, Mar 25, 2016 at 7:57 AM, Reynold Xin wrote:

Re: Performance improvements for sorted RDDs

2016-03-21 Thread Daniel Darabos
There is related discussion in https://issues.apache.org/jira/browse/SPARK-8836. It's not too hard to implement this without modifying Spark and we measured ~10x improvement over plain RDD joins. I haven't benchmarked against DataFrames -- maybe they also realize this performance advantage. On

Re: SPARK-9559

2016-02-18 Thread Daniel Darabos
YARN may be a workaround. On Thu, Feb 18, 2016 at 4:13 PM, Ashish Soni wrote: > Hi All , > > Just wanted to know if there is any work around or resolution for below > issue in Stand alone mode > > https://issues.apache.org/jira/browse/SPARK-9559 > > Ashish >

Re: Spark 1.6.1

2016-02-03 Thread Daniel Darabos
On Tue, Feb 2, 2016 at 7:10 PM, Michael Armbrust wrote: > What about the memory leak bug? >> https://issues.apache.org/jira/browse/SPARK-11293 >> Even after the memory rewrite in 1.6.0, it still happens in some cases. >> Will it be fixed for 1.6.1? >> > > I think we have

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-18 Thread Daniel Darabos
+1 (non-binding) It passes our tests after we registered 6 new classes with Kryo: kryo.register(classOf[org.apache.spark.sql.catalyst.expressions.UnsafeRow]) kryo.register(classOf[Array[org.apache.spark.mllib.tree.model.Split]])

Re: Difference between a task and a job

2015-10-05 Thread Daniel Darabos
Actions trigger jobs. A job is made up of stages. A stage is made up of tasks. Executor threads execute tasks. Does that answer your question? On Mon, Oct 5, 2015 at 12:52 PM, Guna Prasaad wrote: > What is the difference between a task and a job in spark and >

Re: HyperLogLogUDT

2015-07-01 Thread Daniel Darabos
It's already possible to just copy the code from countApproxDistinct https://github.com/apache/spark/blob/v1.4.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1153 and access the HLL directly, or do anything you like. On Wed, Jul 1, 2015 at 5:26 PM, Nick Pentreath nick.pentre...@gmail.com

Re: how can I write a language wrapper?

2015-06-29 Thread Daniel Darabos
Hi Vasili, It so happens that the entire SparkR code was merged to Apache Spark in a single pull request. So you can see at once all the required changes in https://github.com/apache/spark/pull/5096. It's 12,043 lines and took more than 20 people about a year to write as I understand it. On Mon,

Re: RDD split into multiple RDDs

2015-04-29 Thread Daniel Darabos
Check out http://stackoverflow.com/a/26051042/3318517. It's a nice method for saving the RDD into separate files by key in a single pass. Then you can read the files into separate RDDs. On Wed, Apr 29, 2015 at 2:10 PM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi