Re: DataFrame Min By Column

2016-07-09 Thread Michael Armbrust
I would guess that using the built in min/max/struct functions will be much faster than a UDAF. They should have native internal implementations that utilize code generation. On Sat, Jul 9, 2016 at 2:20 PM, Pedro Rodriguez wrote: > Thanks Michael, > > That seems like

Spark application Runtime Measurement

2016-07-09 Thread Fei Hu
Dear all, I have a question about how to measure the runtime for a Spak application. Here is an example: - On the Spark UI: the total duration time is 2.0 minutes = 120 seconds as following [image: Screen Shot 2016-07-09 at 11.45.44 PM.png] - However, when I check the jobs launched

Re: DataFrame Min By Column

2016-07-09 Thread Pedro Rodriguez
Thanks Michael, That seems like the analog to sorting tuples. I am curious, is there a significant performance penalty to the UDAF versus that? Its certainly nicer and more compact code at least. — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data

Re: DataFrame Min By Column

2016-07-09 Thread Michael Armbrust
You can do whats called an *argmax/argmin*, where you take the min/max of a couple of columns that have been grouped together as a struct. We sort in column order, so you can put the timestamp first. Here is an example

Re: Broadcast hash join implementation in Spark

2016-07-09 Thread Lalitha MV
Hi Jagat, This property only defines the threshold of small table's size for broadcast hash join to be supported. Lalitha On Fri, Jul 8, 2016 at 11:47 PM, Jagat Singh wrote: > Hi, > > Please see the property spark.sql.autoBroadcastJoinThreshold here > > >

Re: problem making Zeppelin 0.6 work with Spark 1.6.1, throwing jackson.databind.JsonMappingException exception

2016-07-09 Thread Pedro Rodriguez
It would be helpful if you included relevant configuration files from each or if you are using the defaults, particularly any changes to class paths. I worked through Zeppelin to 0.6.0 at work and at home without any issue so hard to say more without having more details. — Pedro Rodriguez PhD

Re: problem making Zeppelin 0.6 work with Spark 1.6.1, throwing jackson.databind.JsonMappingException exception

2016-07-09 Thread Chanh Le
Hi, This weird because I am using Zeppelin from version 0.5.6 and just upgraded to 0.6.0 for couple of days both work fine with Spark 1.6.1. For 0.6.0 I am using zeppelin-0.6.0-bin-netinst. > On Jul 9, 2016, at 9:25 PM, Mich Talebzadeh wrote: > > Hi, > > I just

problem making Zeppelin 0.6 work with Spark 1.6.1, throwing jackson.databind.JsonMappingException exception

2016-07-09 Thread Mich Talebzadeh
Hi, I just installed the latest Zeppelin 0.6 as follows: Source: zeppelin-0.6.0-bin-all With Spark 1.6.1 Now I am getting this issue with jackson. I did some search that suggested this is caused by the classpath providing you with a different version of Jackson than the one Spark is

Re: DataFrame Min By Column

2016-07-09 Thread Pedro Rodriguez
I implemented a more generic version which I posted here:  https://gist.github.com/EntilZha/3951769a011389fef25e930258c20a2a I think I could generalize this by pattern matching on DataType to use different getLong/getDouble/etc functions ( not trying to use getAs[] because getting T from

Re: Spark 2.0 Release Date

2016-07-09 Thread Taotao.Li
docs I found and unreleased officially: - 2.0.0-preview: http://spark.apache.org/docs/2.0.0-preview/ - master-docs : http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/index.html - 2.0.0 docs:

Re: 回复: Bug about reading parquet files

2016-07-09 Thread Cheng Lian
According to our offline discussion, the target table consists of 1M+ small Parquet files (~12M by average). The OOM occurred at driver side while listing input files. My theory is that the total size of all listed FileStatus objects is too large for the driver and caused the OOM.

Re: DataFrame Min By Column

2016-07-09 Thread Pedro Rodriguez
Hi Xinh, A co-worker also found that solution but I thought it was possibly overkill/brittle so looks into UDAFs (user defined aggregate functions). I don’t have code, but Databricks has a post that has an example 

Re: Is that possible to launch spark streaming application on yarn with only one machine?

2016-07-09 Thread Yu Wei
I tried to flush the information to external system in cluster mode. It works well. I suspect that in yarn cluster mode, stdout is closed. From: Rabin Banerjee Sent: Saturday, July 9, 2016 4:22:10 AM To: Yu Wei Cc: Mich

Re: Broadcast hash join implementation in Spark

2016-07-09 Thread Jagat Singh
Hi, Please see the property spark.sql.autoBroadcastJoinThreshold here http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options Thanks, Jagat Singh On Sat, Jul 9, 2016 at 9:50 AM, Lalitha MV wrote: > Hi, > > 1. What implementation is

Re: Spark performance testing

2016-07-09 Thread Mich Talebzadeh
Hi Andrew, I suggest that you narrow down your scope for performance testing using the same setup and doing incremental changes keeping other systematics the same. Spark itself can run on local, standalone, yarn client and yarn cluster modes So really you need to target a particular setup of run