Re: how to use lit() in spark-java

2018-03-23 Thread Anil Langote
You have import functions dataset.withColumn(columnName,functions.lit("constant")) Thank you Anil Langote Sent from my iPhone _ From: 崔苗 <cuim...@danale.com> Sent: Friday, March 23, 2018 8:33 AM Subject: how to use lit() in spark-java To: <user@spa

Re: Spark Inner Join on pivoted datasets results empty dataset

2017-10-19 Thread Anil Langote
Is there any limit on number of columns used in inner join ? Thank you Anil Langote Sent from my iPhone _ From: Anil Langote <anillangote0...@gmail.com<mailto:anillangote0...@gmail.com>> Sent: Thursday, October 19, 2017 5:01 PM Subject: Spark Inner Joi

Spark Inner Join on pivoted datasets results empty dataset

2017-10-19 Thread Anil Langote
0 records. Is there anything i am missing here is there any better way to pivot the multiple columns i can not do combine because my aggregation columns are array of doubles. The pivot1 & pivot2 dataset derived by same parent dataset the group by columns are same all i am doing is inner join on these two dataset with same group by columns why it doesn't work? Thank you Anil Langote

Issue with caching

2017-01-27 Thread Anil Langote
with same configuration it takes 40 mins why this is happening ? Best Regards, Anil Langote +1-425-633-9747

Re: Efficient look up in Key Pair RDD

2017-01-08 Thread Anil Langote
you Anil Langote +1-425-633-9747 From: ayan guha <guha.a...@gmail.com> Date: Sunday, January 8, 2017 at 10:26 PM To: Anil Langote <anillangote0...@gmail.com> Cc: Holden Karau <hol...@pigscanfly.ca>, user <user@spark.apache.org> Subject: Re: Efficient look up in K

Re: Efficient look up in Key Pair RDD

2017-01-08 Thread Anil Langote
use case. Best Regards, Anil Langote +1-425-633-9747 > On Jan 8, 2017, at 8:17 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > > To start with caching and having a known partioner will help a bit, then > there is also the IndexedRDD project, but in general spark might not be t

Efficient look up in Key Pair RDD

2017-01-08 Thread Anil Langote
by a given key? Thank you Anil Langote

Spark Aggregator for array of doubles

2017-01-04 Thread Anil Langote
can be done in scala only how can define the aggregator which takes array of doubles as input, note that I have parquet file as my input. Any pointers are highly appreciated, I read that spark UDAF is slow and aggregators are the way to go. Best Regards, Anil Langote +1-425-633-9747

Re: Parquet with group by queries

2016-12-21 Thread Anil Langote
count(*), col1, col2, col3, aggregationFunction(doublecol) from table group by col1,col2,col3 having count(*) >1 The about queries group by columns will change similarly I have to run 100 queries on same data set. Best Regards, Anil Langote +1-425-633-9747 > On Dec 21, 2016, at 11:41 AM

Parquet with group by queries

2016-12-21 Thread Anil Langote
in this regard is appreciated. Best Regards, Anil Langote +1-425-633-9747

Re: DataSet is not able to handle 50,000 columns to sum

2016-11-11 Thread Anil Langote
are suggesting. Best Regards, Anil Langote +1-425-633-9747 > On Nov 11, 2016, at 7:10 PM, ayan guha <guha.a...@gmail.com> wrote: > > You can explore grouping sets in SQL and write an aggregate function to add > array wise sum. > > It will boil down to something like

DataSet is not able to handle 50,000 columns to sum

2016-11-11 Thread Anil Langote
sults against the keys. Same process will be repeated for next combinations. Thank you Anil Langote +1-425-633-9747

Running yarn with spark not working with Java 8

2016-08-25 Thread Anil Langote
Hi All, I have cluster with 1 master and 6 slaves which uses pre-built version of hadoop 2.6.0 and spark 1.6.2. I was running hadoop MR and spark jobs without any problem with openjdk 7 installed on all the nodes. However when I upgraded openjdk 7 to openjdk 8 on all nodes, spark submit and

Append is not working with data frame

2016-04-20 Thread Anil Langote
days of data. ∂ Thank you Anil Langote > On Apr 20, 2016, at 1:12 PM, Wei Chen <wei.chen.ri...@gmail.com> wrote: > > Found it. In case someone else if looking for this: > cvModel.bestModel.asInstanceOf[org.apache.spark.ml.classification.LogisticRegressionModel].weights > &