Re: [build system] short downtime next thursday morning, 5-12-16 @ 8am PDT

2016-05-11 Thread shane knapp
reminder: this is happening tomorrow morning! 7am PDT: builds paused 8am PDT: master reboot, upgrade happens 9am PDT: builds restarted On Mon, May 9, 2016 at 4:17 PM, shane knapp wrote: > reminder: this is happening thursday morning. > > On Wed, May 4, 2016 at 11:38

Re: Adding HDFS read-time metrics per task (RE: SPARK-1683)

2016-05-11 Thread Reynold Xin
Adding Kay On Wed, May 11, 2016 at 12:01 PM, Brian Cho wrote: > Hi, > > I'm interested in adding read-time (from HDFS) to Task Metrics. The > motivation is to help debug performance issues. After some digging, its > briefly mentioned in SPARK-1683 that this feature didn't

Shrinking the DataFrame lineage

2016-05-11 Thread Ulanov, Alexander
Dear Spark developers, Recently, I was trying to switch my code from RDDs to DataFrames in order to compare the performance. The code computes RDD in a loop. I use RDD.persist followed by RDD.count to force Spark compute the RDD and cache it, so that it does not need to re-compute it on each

Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread James Hammerton
This may be related to: https://issues.apache.org/jira/browse/SPARK-13773 Regards, James On 11 May 2016 at 15:49, Ted Yu wrote: > In master branch, behavior is the same. > > Suggest opening a JIRA if you haven't done so. > > On Wed, May 11, 2016 at 6:55 AM, Tony Jin

Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread Ted Yu
In master branch, behavior is the same. Suggest opening a JIRA if you haven't done so. On Wed, May 11, 2016 at 6:55 AM, Tony Jin wrote: > Hi guys, > > I have a problem about spark DataFrame. My spark version is 1.6.1. > Basically, i used udf and df.withColumn to create a

dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread Tony Jin
Hi guys, I have a problem about spark DataFrame. My spark version is 1.6.1. Basically, i used udf and df.withColumn to create a "new" column, and then i filter the values on this new columns and call show(action). I see the udf function (which is used to by withColumn to create the new column) is

Re: Structured Streaming with Kafka source/sink

2016-05-11 Thread Ted Yu
Please see this thread: http://search-hadoop.com/m/q3RTt9XAz651PiG/Adhoc+queries+spark+streaming=Re+Adhoc+queries+on+Spark+2+0+with+Structured+Streaming > On May 11, 2016, at 1:47 AM, Ofir Manor wrote: > > Hi, > I'm trying out Structured Streaming from current 2.0

Structured Streaming with Kafka source/sink

2016-05-11 Thread Ofir Manor
Hi, I'm trying out Structured Streaming from current 2.0 branch. Does the branch currently support Kafka as either source or sink? I couldn't find a specific JIRA or design doc for that in SPARK-8360 or in the examples... Is it still targeted for 2.0? Also, I naively assume it will look similar