Re: [discuss] dropping Python 2.6 support

2016-01-09 Thread Sasha Kacanski
+1 Companies that use stock python in redhat 2.6 will need to upgrade or install fresh version wich is total of 3.5 minutes so no issues ... On Tue, Jan 5, 2016 at 2:17 AM, Reynold Xin wrote: > Does anybody here care about us dropping support for Python 2.6 in Spark > 2.0?

Re: pyspark: conditionals inside functions

2016-01-09 Thread Maciej Szymkiewicz
On 01/09/2016 04:45 AM, Franc Carter wrote: > > Hi, > > I'm trying to write a short function that returns the last sunday of > the week of a given date, code below > > def getSunday(day): > > day = day.cast("date") > > sun = next_day(day, "Sunday") > >

Re: [discuss] dropping Python 2.6 support

2016-01-09 Thread Sean Owen
Chiming in late, but my take on this line of argument is: these companies are welcome to keep using Spark 1.x. If anything the argument here is about how long to maintain 1.x, and indeed, it's going to go dormant quite soon. But using RHEL 6 (or any old-er version of any platform) and not wanting

broadcast params to workers at the very beginning

2016-01-09 Thread octavian.ganea
Hi, In my app, I have a Params scala object that keeps all the specific (hyper)parameters of my program. This object is read in each worker. I would like to be able to pass specific values of the Params' fields in the command line. One way would be to simply update all the fields of the Params

Re: How to merge two large table and remove duplicates?

2016-01-09 Thread Ted Yu
See the first half of this wiki: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO > On Jan 9, 2016, at 1:02 AM, Gavin Yue wrote: > > So I tried to set the parquet compression codec to lzo, but hadoop does not > have the lzo natives, while lz4 does

Re: [discuss] dropping Python 2.6 support

2016-01-09 Thread Jacek Laskowski
On Sat, Jan 9, 2016 at 1:48 PM, Sean Owen wrote: > (For similar reasons I personally don't favor supporting Java 7 or > Scala 2.10 in Spark 2.x.) That reflects my sentiments as well. Thanks Sean for bringing that up! Jacek

Re: pyspark: conditionals inside functions

2016-01-09 Thread Franc Carter
My Python is not particularly good, so I'm afraid I don't understand what that mean cheers On 9 January 2016 at 14:45, Franc Carter wrote: > > Hi, > > I'm trying to write a short function that returns the last sunday of the > week of a given date, code below > > def

pyspark: calculating row deltas

2016-01-09 Thread Franc Carter
Hi, I have a DataFrame with the columns ID,Year,Value I'd like to create a new Column that is Value2-Value1 where the corresponding Year2=Year-1 At the moment I am creating a new DataFrame with renamed columns and doing DF.join(DF2, . . . .) This looks cumbersome to me, is there

Best IDE Configuration

2016-01-09 Thread Jorge Machado
Hello everyone, I´m just wondering how do you guys develop for spark. For example I cannot find any decent documentation for connecting Spark to Eclipse using maven or sbt. Is there any link around ? Jorge thanks - To

spark access old version of Hadoop 2.1.0 and Hive version 0.11

2016-01-09 Thread Jade Liu
Hi, All: I'm trying to read and write from the hdfs cluster using SparkSQL hive context. My current build of spark is 1.5.2. The problem is that currently our company has very old version of hdfs (hadoop 2.1.0) and hive metastore (0.11) using Hortonworks bundle. One of the possible solution

Re: Best IDE Configuration

2016-01-09 Thread Ted Yu
Please take a look at: https://cwiki.apache.org/confluence/display/SPARK/ Useful+Developer+Tools#UsefulDeveloperTools-IDESetup On Sat, Jan 9, 2016 at 11:16 AM, Jorge Machado wrote: > Hello everyone, > > > I´m just wondering how do you guys develop for spark. > > For example I

Re: org.apache.spark.storage.BlockNotFoundException in Spark1.5.2+Tachyon0.7.1

2016-01-09 Thread Gene Pang
Yes, the tiered storage feature in Tachyon can address this issue. Here is a link to more information: http://tachyon-project.org/documentation/Tiered-Storage-on-Tachyon.html Thanks, Gene On Wed, Jan 6, 2016 at 8:44 PM, Ted Yu wrote: > Have you seen this thread ? > >

java.lang.NoClassDefFoundError even when use sc.addJar

2016-01-09 Thread rayqiu
Code: val sc = new SparkContext(sparkConf) sc.addJar("/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-streaming-kafka-assembly_2.10-1.6.0.jar") >spark-submit --class "GeoIP" target/scala-2.10/geoip-assembly-1.0.jar Show jar added: 16/01/09 16:05:20 INFO SparkContext: Added JAR

StandardScaler in spark.ml.feature requires vector input?

2016-01-09 Thread Kristina Rogale Plazonic
Hi, The code below gives me an unexpected result. I expected that StandardScaler (in ml, not mllib) will take a specified column of an input dataframe and subtract the mean of the column and divide the difference by the standard deviation of the dataframe column. However, Spark gives me the

Re: pyspark: conditionals inside functions

2016-01-09 Thread Franc Carter
Got it, I needed to use the when/otherwise construct - code below def getSunday(day): day = day.cast("date") sun = next_day(day, "Sunday") n = datediff(sun,day) x = when(n==7,day).otherwise(sun) return x On 10 January 2016 at 08:41, Franc Carter

Re: How to merge two large table and remove duplicates?

2016-01-09 Thread Gavin Yue
I saw in the document, the value is LZO.Is it LZO or LZ4? https://github.com/Cyan4973/lz4 Based on this benchmark, they differ quite a lot. On Fri, Jan 8, 2016 at 9:55 PM, Ted Yu wrote: > gzip is relatively slow. It consumes much CPU. > > snappy is faster. > > LZ4

Re: How to merge two large table and remove duplicates?

2016-01-09 Thread Gavin Yue
So I tried to set the parquet compression codec to lzo, but hadoop does not have the lzo natives, while lz4 does included. But I could set the code to lz4, it only accepts lzo. Any solution here? Thank, Gavin On Sat, Jan 9, 2016 at 12:09 AM, Gavin Yue wrote: > I saw

Re: How to merge two large table and remove duplicates?

2016-01-09 Thread Sayan Sanyal
Unsubscribe Sent from Outlook Mobile _ From: Gavin Yue Sent: Saturday, January 9, 2016 14:33 Subject: Re: How to merge two large table and remove duplicates? To: Ted Yu Cc: Benyi Wang , user