Writing UDF with variable number of arguments

2015-10-05 Thread tridib
Hi Friends, I want to write a UDF which takes variable number of arguments with varying type. myudf(String key1, String value1, String key2, int value2,) What is the best way to do it in Spark? Thanks Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3

RE: nested collection object query

2015-09-29 Thread Tridib Samanta
Well I figure out a way to use explode. But it returns two rows if there is two match in nested array objects. select id from department LATERAL VIEW explode(employee) dummy_table as emp where emp.name = 'employee0' I was looking for an operator that loops through the array and return true

RE: nested collection object query

2015-09-28 Thread Tridib Samanta
Thanks for you response Yong! Array syntax works fine. But I am not sure how to use explode. Should I use as follows? select id from department where explode(employee).name = 'employee0 This query gives me java.lang.UnsupportedOperationException . I am using HiveContext. From:

nested collection object query

2015-09-28 Thread tridib
Hi Friends, What is the right syntax to query on collection of nested object? I have a following schema and SQL. But it does not return anything. Is the syntax correct? root |-- id: string (nullable = false) |-- employee: array (nullable = false) ||-- element: struct (containsNull = true)

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-23 Thread tridib
Setting spark.sql.shuffle.partitions = 2000 solved my issue. I am able to join 2 1 billion rows tables in 3 minutes. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24782.html Sent from the

How to control spark.sql.shuffle.partitions per query

2015-09-23 Thread tridib
Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-control-spark-sql-shuffle-partitions-per-query-tp24781.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread tridib
By skewed did you mean it's not distributed uniformly across partition? All of my columns are string and almost of same size. i.e. id1,field11,fields12 id2,field21,field22 -- View this message in context:

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-21 Thread tridib
Did you get any solution to this? I am getting same issue. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24759.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Official Docker container for Spark

2015-05-29 Thread Tridib Samanta
Thanks all for your reply. I was evaluating which one fits best for me. I picked epahomov/docker-spark from docker registry and suffice my need. Thanks Tridib Date: Fri, 22 May 2015 14:15:42 +0530 Subject: Re: Official Docker container for Spark From: riteshoneinamill...@gmail.com To: 917361

Official Docker container for Spark

2015-05-21 Thread tridib
Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Official-Docker-container-for-Spark-tp22977.html Sent from the Apache Spark User List mailing list archive at Nabble.com

RE: HBase HTable constructor hangs

2015-04-30 Thread Tridib Samanta
the hbase release you're using has the following fix ? HBASE-8 non environment variable solution for IllegalAccessError Cheers On Tue, Apr 28, 2015 at 10:47 PM, Tridib Samanta tridib.sama...@live.com wrote: I turned on the TRACE and I see lot of following exception

RE: HBase HTable constructor hangs

2015-04-29 Thread Tridib Samanta
(HConnectionManager.java:1054) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1011) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:326) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:192) Thanks Tridib From: d

Re: HBase HTable constructor hangs

2015-04-28 Thread tridib
I am exactly having same issue. I am running hbase and spark in docker container. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HBase-HTable-constructor-hangs-tp4926p22696.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: HBase HTable constructor hangs

2015-04-28 Thread Tridib Samanta
On Tue, Apr 28, 2015 at 7:12 PM, tridib tridib.sama...@live.com wrote: I am exactly having same issue. I am running hbase and spark in docker container. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HBase-HTable-constructor-hangs-tp4926p22696.html

RE: HBase HTable constructor hangs

2015-04-28 Thread Tridib Samanta
the spark-job jar as standalone and execute the HBase client from a main method, it works fine. Same client unable to connect/hangs when the jar is distributed in spark. Thanks Tridib Date: Tue, 28 Apr 2015 21:25:41 -0700 Subject: Re: HBase HTable constructor hangs From: yuzhih...@gmail.com

spark sql median and standard deviation

2015-03-04 Thread tridib
Hello, Is there in built function for getting median and standard deviation in spark sql? Currently I am converting the schemaRdd to DoubleRdd and calling doubleRDD.stats(). But still it does not have median. What is the most efficient way to get the median? Thanks Regards Tridib -- View

Re: Running spark function on parquet without sql

2015-02-27 Thread tridib
Somehow my posts are not getting excepted, and replies are not visible here. But I got following reply from Zhan. From Zhan Zhang's reply, yes I still get the parquet's advantage. My next question is, if I operate on SchemaRdd will I get the advantage of Spark SQL's in memory columnar store

RE: group by order by fails

2015-02-27 Thread Tridib Samanta
an alias to the count in the select clause and use that alias in the order by clause. On Wed, Feb 25, 2015 at 11:17 PM, Tridib Samanta tridib.sama...@live.com wrote: Actually I just realized , I am using 1.2.0. Thanks Tridib Date: Thu, 26 Feb 2015 12:37:06 +0530 Subject: Re: group by order

Running spark function on parquet without sql

2015-02-26 Thread tridib
Regards Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-function-on-parquet-without-sql-tp21833.html Sent from the Apache Spark User List mailing list archive at Nabble.com

RE: spark sql: join sql fails after sqlCtx.cacheTable()

2015-02-25 Thread tridib
Using Hivecontext solved it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p21807.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: group by order by fails

2015-02-25 Thread Tridib Samanta
Actually I just realized , I am using 1.2.0. Thanks Tridib Date: Thu, 26 Feb 2015 12:37:06 +0530 Subject: Re: group by order by fails From: ak...@sigmoidanalytics.com To: tridib.sama...@live.com CC: user@spark.apache.org Which version of spark are you having? It seems there was a similar Jira

group by order by fails

2015-02-25 Thread tridib
Hi, I need to find top 10 most selling samples. So query looks like: select s.name, count(s.name) from sample s group by s.name order by count(s.name) This query fails with following error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: sort, tree: Sort [COUNT(name#0) ASC], true

Re: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread tridib
I am experimenting with two files and trying to generate 1 parquet file. public class CompactParquetGenerator implements Serializable { public void generateParquet(JavaSparkContext sc, String jsonFilePath, String parquetPath) { //int MB_128 = 128*1024*1024;

Re: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread tridib
public void generateParquet(JavaSparkContext sc, String jsonFilePath, String parquetPath) { //int MB_128 = 128*1024*1024; //sc.hadoopConfiguration().setInt(dfs.blocksize, MB_128); //sc.hadoopConfiguration().setInt(parquet.block.size, MB_128); JavaSQLContext

Re: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread tridib
Ohh...how can I miss that. :(. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Control-number-of-parquet-generated-from-JavaSchemaRDD-tp19717p19788.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread tridib
Thanks Michael, It worked like a charm! I have few more queries: 1. Is there a way to control the size of parquet file? 2. Which method do you recommend coalesce(n, true), coalesce(n, false) or repartition(n)? Thanks Regards Tridib -- View this message in context: http://apache-spark-user

Control number of parquet generated from JavaSchemaRDD

2014-11-24 Thread tridib
; sc.hadoopConfiguration().setInt(dfs.blocksize, MB_128); sc.hadoopConfiguration().setInt(parquet.block.size, MB_128); No luck. Is there a way to control the size/number of parquet files generated? Thanks Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3

allocating different memory to different executor for same application

2014-11-21 Thread tridib
Hello Experts, I have 5 worker machines with different size of RAM. is there a way to configure it with different executor memory? Currently I see that all worker spins up 1 executor with same amount of memory. Thanks Regards Tridib -- View this message in context: http://apache-spark-user

spark-sql broken

2014-11-21 Thread tridib
=2.4.0 -Phive -Phive-thriftserver -DskipTests Is there anything I am missing? Thanks Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-broken-tp19536.html Sent from the Apache Spark User List mailing list archive at Nabble.com

sum/avg group by specified ranges

2014-11-18 Thread tridib
of spark sql on top of parquet file. Any suggestion? Thanks Regards Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sum-avg-group-by-specified-ranges-tp19187.html Sent from the Apache Spark User List mailing list archive at Nabble.com

spark sql - save to Parquet file - Unsupported datatype TimestampType

2014-11-11 Thread tridib
$class.saveAsParquetFile(SchemaRDDLike.scala:76) at org.apache.spark.sql.api.java.JavaSchemaRDD.saveAsParquetFile(JavaSchemaRDD.scala:42) Thanks Regards Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-save-to-Parquet-file-Unsupported-datatype

sql - group by on UDF not working

2014-11-07 Thread Tridib Samanta
$.launch(SparkSubmit.scala:353) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Thanks Regards Tridib

Re: Unable to use HiveContext in spark-shell

2014-11-06 Thread tridib
Help please! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261p18280.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: spark sql: join sql fails after sqlCtx.cacheTable()

2014-11-06 Thread Tridib Samanta
the compiler. Shall I replay your session? I can re-run each line except the last one. Thanks Tridib Date: Tue, 21 Oct 2014 09:39:49 -0700 Subject: Re: spark sql: join sql fails after sqlCtx.cacheTable() From: ri...@infoobjects.com To: tridib.sama...@live.com CC: u...@spark.incubator.apache.org

RE: Unable to use HiveContext in spark-shell

2014-11-06 Thread Tridib Samanta
? I can re-run each line except the last one. [y/n] Thanks Tridib From: terry@smartfocus.com To: tridib.sama...@live.com; u...@spark.incubator.apache.org Subject: Re: Unable to use HiveContext in spark-shell Date: Thu, 6 Nov 2014 17:38:51 + What version of Spark are you using

Re: Unable to use HiveContext in spark-shell

2014-11-06 Thread tridib
Yes. I have org.apache.hadoop.hive package in spark assembly. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261p18322.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Unable to use HiveContext in spark-shell

2014-11-06 Thread tridib
I built spark-1.1.0 in a new fresh machine. This issue is gone! Thank you all for your help. Thanks Regards Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261p18324.html Sent from the Apache Spark User

Unable to use HiveContext in spark-shell

2014-11-05 Thread tridib
compiling HiveContext.class. That entry seems to have slain the compiler. Shall I replay your session? I can re-run each line except the last one. [y/n] Thanks Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261

spark sql create nested schema

2014-11-04 Thread tridib
= true) ||-- State: string (nullable = true) ||-- Hobby: string (nullable = true) ||-- Zip: string (nullable = true) How do I create a StructField of StructType? I think that's what the root is. Thanks Regards Tridib -- View this message in context: http://apache-spark-user-list

StructField of StructType

2014-11-04 Thread tridib
How do I create a StructField of StructType? I need to create a nested schema. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/StructField-of-StructType-tp18091.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: submit query to spark cluster using spark-sql

2014-10-24 Thread tridib
Figured it out. spark-sql --master spark://sparkmaster:7077 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/submit-query-to-spark-cluster-using-spark-sql-tp17182p17183.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

hive timestamp column always returns null

2014-10-22 Thread tridib
:00 2013-11-11T00:00:00 2012-11-11T00:00:00Z when I query using select * from date_test it returns: NULL NULL NULL Could you please help me to resolve this issue? Thanks Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/hive-timestamp-column-always

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread tridib
val sqlContext = new org.apache.spark.sql.SQLContext(sc) val personPath = /hdd/spark/person.json val person = sqlContext.jsonFile(personPath) person.printSchema() person.registerTempTable(person) val addressPath = /hdd/spark/address.json val address = sqlContext.jsonFile(addressPath)

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread tridib
Hmm... I thought HiveContext will only worki if Hive is present. I am curious to know when to use HiveContext and when to use SqlContext. Thanks Regards Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread tridib
Thank for pointing that out. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16933.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

spark sql: sqlContext.jsonFile date type detection and perforormance

2014-10-21 Thread tridib
Any help? or comments? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16939.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark sql: sqlContext.jsonFile date type detection and perforormance

2014-10-21 Thread tridib
Yes, I am unable to use jsonFile() so that it can detect date type automatically from json data. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16974.html Sent from the Apache

spark sql: timestamp in json - fails

2014-10-20 Thread tridib
= sqlCtx.jsonFile(path, createStructType()); sqlCtx.registerRDDAsTable(test, test); execSql(sqlCtx, select * from test, 1); } Input file has a single record: {timestamp:2014-10-10T01:01:01} Thanks Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3

Spark SQL : sqlContext.jsonFile date type detection and perforormance

2014-10-20 Thread tridib
Function and creating schema RDD from the parsed JavaRDD. Is there any performance impact not using inbuilt jsonFile()? Thanks Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881

Re: spark sql: timestamp in json - fails

2014-10-20 Thread tridib
Stack trace for my second case: 2014-10-20 23:00:36,903 ERROR [Executor task launch worker-0] executor.Executor (Logging.scala:logError(96)) - Exception in task 0.0 in stage 0.0 (TID 0) scala.MatchError: TimestampType (of class org.apache.spark.sql.catalyst.types.TimestampType$) at

RE: spark sql: timestamp in json - fails

2014-10-20 Thread tridib
Spark 1.1.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-timestamp-in-json-fails-tp16864p16888.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-20 Thread tridib
Hello Experts, I have two tables build using jsonFile(). I can successfully run join query on these tables. But once I cacheTable(), all join query fails? Here is stackstrace: java.lang.NullPointerException at