Mark DataFrame/Dataset APIs stable

2016-10-12 Thread Reynold Xin
I took a look at all the public APIs we expose in o.a.spark.sql tonight, and realized we still have a large number of APIs that are marked experimental. Most of these haven't really changed, except in 2.0 we merged DataFrame and Dataset. I think it's long overdue to mark them stable. I'm tracking

Re: Spark Improvement Proposals

2016-10-12 Thread kant kodali
Some of you guys may have already seen this but in case if you haven't you may want to check it out. http://www.slideshare.net/sbaltagi/flink-vs-spark On Tue, Oct 11, 2016 at 1:57 PM, Ryan Blue wrote: > I don't think we will have trouble with whatever rule that is

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-12 Thread msukmanowsky
As very heavy Spark users at Parse.ly, I just wanted to give a +1 to all of the issues raised by Holden and Ricardo. I'm also giving a talk at PyCon Canada on PySpark https://2016.pycon.ca/en/schedule/096-mike-sukmanowsky/. Being a Python shop, we were extremely pleased to learn about PySpark a

Memory leak warnings in Spark 2.0.1

2016-10-12 Thread vonnagy
I am getting excessive memory leak warnings when running multiple mapping and aggregations and using DataSets. Is there anything I should be looking for to resolve this or is this a known issue? WARN [Executor task launch worker-0] org.apache.spark.memory.TaskMemoryManager - leak 16.3 MB memory

Re: incorrect message that path appears to be local

2016-10-12 Thread Sean Owen
I'm not sure this is applied consistently across Spark, but I'm dealing with another change now where an unqualified path is assumed to be a local file. The method Utils.resolvePath implements this logic and is used several places. Therefore I think this is probably intended behavior and you can

incorrect message that path appears to be local

2016-10-12 Thread Koert Kuipers
i see this warning when running jobs on cluster: 2016-10-12 14:46:47 WARN spark.SparkContext: Spark is not running in local mode, therefore the checkpoint directory must not be on the local filesystem. Directory '/tmp' appears to be on the local filesystem. however the checkpoint "directory"

Re: `Project` not preserving child partitioning ?

2016-10-12 Thread Tejas Patil
Sure :) Thanks, Tejas On Wed, Oct 12, 2016 at 11:26 AM, Reynold Xin wrote: > It actually does -- but do it through a really weird way. > > UnaryNodeExec actually defines: > > trait UnaryExecNode extends SparkPlan { > def child: SparkPlan > > override final def

Re: `Project` not preserving child partitioning ?

2016-10-12 Thread Reynold Xin
It actually does -- but do it through a really weird way. UnaryNodeExec actually defines: trait UnaryExecNode extends SparkPlan { def child: SparkPlan override final def children: Seq[SparkPlan] = child :: Nil override def outputPartitioning: Partitioning = child.outputPartitioning } I

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-12 Thread Fred Reiss
On Tue, Oct 11, 2016 at 10:57 AM, Reynold Xin wrote: > > On Tue, Oct 11, 2016 at 10:55 AM, Michael Armbrust > wrote: > >> *Complex event processing and state management:* Several groups I've >>> talked to want to run a large number (tens or hundreds

`Project` not preserving child partitioning ?

2016-10-12 Thread Tejas Patil
See https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala#L80 Project operator preserves child's sort ordering but for output partitioning, it does not. I don't see any way projection would alter the partitioning of the

RFC / PRD: new executor & node blacklist mechanism (SPARK-8425)

2016-10-12 Thread Imran Rashid
Some new features are about to land in spark to improve Spark's ability to handle bad executors and nodes. These are some significant changes, and we'd like to gather more input from the community about it, especially folks that use *large clusters*. We've spent a lot of time discussing the