ISDATE Function
Hi Everyone, In MSSQL server suppprt "ISDATE()" function is used to fine current column values date or not?. Is any possible to achieve current column values date or not? Regards, Ravi
pagination spark sq
Hi everyone, Can you please share optimized query for pagination spark sql? In Ms SQL Server, They have supported "offset" method query for specific row selection. Please find the following query Select BusinessEntityID,[FirstName], [LastName],[JobTitle] from HumanResources.vEmployee Order By BusinessEntityID --OFFSET 10 ROWS FETCH NEXT 10 ROWS ONLY Is this support OFFSET method in spark sql? Kindly share the useful details. Regards, Ravi
Having Clause with variation and stddev
Hi, Exception thrown when using Having Clause with variation or stddev. It works perfectly when using other aggregate functions(Like sum,count,min,max..) SELECT SUM(1) AS `sum_number_of_records_ok` FROM `some_db`.`some_table` `some_table` GROUP BY 1 HAVING (STDDEV(1) 0) SELECT SUM(1) AS `sum_number_of_records_ok` FROM `some_db`.`some_table` `some_table` GROUP BY 1 HAVING (VARIANCE(1) 0) Could you please share any other way for using this kind query? Regards, Ravi
Exception in spark
Hi all, I got an exception like “org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object” when using some where condition queries. I am using 1.4.0 version spark. But its perfectly working in hive . Please refer the following query. I have applied date time casting in where clause. Could please help to resolve this exception? SELECT from_unixtime(unix_timestamp(`leavedetails`.`created`,'MMdd'),'/MM/dd') AS `leavedetails_created` FROM `default`.`leavedetails` AS leavedetails WHERE (( (from_unixtime(unix_timestamp(`leavedetails`.`created`,'MMdd'),'/MM/dd') IN ('2015/01/01','2015/01/02')) AND (NOT(`leavedetails`.`created` IS NULL GROUP BY from_unixtime(unix_timestamp(`leavedetails`.`created`,'MMdd'),'/MM/dd') Regards. Ravi
Re: Exception in spark
Hi Josh Please ignore the last mail stack trace. Kindly refer the exception details. {org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'Sheet1.Teams} Regards, Ravi On Wed, Aug 12, 2015 at 1:34 AM, Ravisankar Mani rrav...@gmail.com wrote: Hi Rosan, Thanks for your response. Kindly refer the following query and stack trace. I have checked same query in hive, It works perfectly. In case i have removed in in where class, it works in spark SELECT If(ISNOTNULL(SUM(`Sheet1`.`Runs`)),SUM(`Sheet1`.`Runs`),0) AS `Sheet1_Runs` ,`Sheet1`.`Teams` AS `Sheet1_Teams` FROM default.Dashboard_Sheet1_20150812010131201 AS `Sheet1` INNER JOIN (SELECT `Sheet1`.`Teams` AS `Teams_0`,If(ISNOTNULL(COUNT(`Sheet1`.`Teams`)),COUNT(`Sheet1`.`Teams`),0) AS `Measure_0` FROM default.Dashboard_Sheet1_20150812010131201 AS `Sheet1` WHERE `Sheet1`.`Teams` IN ('Chennai Super Kings')) AND (`Sheet1`.`Teams` '' )) AND (NOT(`Sheet1`.`Teams` IS NULL GROUP BY `Sheet1`.`Teams` ORDER BY `Measure_0` DESC LIMIT 5 ) `T_0` ON ( `Sheet1`.`Teams` =`T_0`.`Teams_0` ) WHERE `Sheet1`.`Teams` IN ('Chennai Super Kings')) AND (`Sheet1`.`Teams` '' )) AND (NOT(`Sheet1`.`Teams` IS NULL GROUP BY `Sheet1`.`Teams` at Syncfusion.ThriftHive.Base.HqlCommand.ExecuteReader() at Syncfusion.Dashboard.Base.Data.HiveQueryBuilder.ExecuteReaderQuery(String query, String connectionString) in f:\Back_To_Svn\source\base\dashboard.base\src\Data\HiveServer.cs:line 409 at Syncfusion.Dashboard.Base.Data.ServerDataProvider.GetTable(String tableName, List`1 schemaInfoCollection, List`1 originalSchemaInfoCollection, List`1 initialFilterList, List`1 viewerFilterList, Boolean isSelectQueryForServerModeFilterPopup, Boolean isNestedFilter) in f:\Back_To_Svn\source\base\dashboard.base\src\Data\Data.cs:line 536 at Syncfusion.Dashboard.Base.Engine.RelationalEngine.GetDataFromMainSource(List`1 schemaInfos, List`1 viewerFilterList) in f:\Back_To_Svn\source\base\dashboard.base\src\Engine\RelationDataEngine.cs:line 902 at DashboardService.DashboardService.ApplyFilterServerMode(RelationalEngine engine, String ReportName, Boolean Drilled, Boolean useDefaultProperties, Boolean isParamsColumn, SchemaInfo paramsSchema) in f:\Back_To_Svn\source\js\dashboardsamples\wcf\Common\DataProcessor.cs:line 1030 at DashboardService.DashboardService.ApplyFilter(RelationalEngine engine, String ReportName, Boolean Drilled, Boolean useDefaultProperties, Boolean isParamsColumn, SchemaInfo paramsSchema) in f:\Back_To_Svn\source\js\dashboardsamples\wcf\Common\DataProcessor.cs:line 1047 at DashboardService.DashboardService.GetProcessedControlData(List`1 elements, RelationalEngine dataEngine, DashboardItem control, RelationalReport currentReport) in f:\Back_To_Svn\source\js\dashboardsamples\wcf\Common\DataProcessor.cs:line 205 at DashboardService.DashboardService.GetGridControlData(DashboardItem controlObj, List`1 columnList, List`1 gridSchemaInfos, Dictionary`2 barData, Dictionary`2 formattedColumns, List`1 colourSaturationColumns) in f:\Back_To_Svn\source\js\dashboardsamples\wcf\Controls\DashboardGrid.cs:line 115 at DashboardService.DashboardService.GetGridData(DashboardItem controlObj, Boolean isLoadGrid) in f:\Back_To_Svn\source\js\dashboardsamples\wcf\Controls\DashboardGrid.cs:line 47 at DashboardService.DashboardService.ControlConfigHelper(DashboardItem control, Boolean IsLoad, DataFetcherGrid fn) in f:\Back_To_Svn\source\js\dashboardsamples\wcf\DashboardService.svc.cs:line 1269 Regards, Ravi On Tue, Aug 11, 2015 at 11:53 PM, Josh Rosen rosenvi...@gmail.com wrote: Can you share a query or stack trace? More information would make this question easier to answer. On Tue, Aug 11, 2015 at 8:50 PM, Ravisankar Mani rrav...@gmail.com wrote: Hi all, We got an exception like “org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object” when using some where condition queries. I am using 1.4.0 version spark. Is this exception resolved in latest spark? Regards, Ravi
Re: Exception in spark
Hi Rosan, Thanks for your response. Kindly refer the following query and stack trace. I have checked same query in hive, It works perfectly. In case i have removed in in where class, it works in spark SELECT If(ISNOTNULL(SUM(`Sheet1`.`Runs`)),SUM(`Sheet1`.`Runs`),0) AS `Sheet1_Runs` ,`Sheet1`.`Teams` AS `Sheet1_Teams` FROM default.Dashboard_Sheet1_20150812010131201 AS `Sheet1` INNER JOIN (SELECT `Sheet1`.`Teams` AS `Teams_0`,If(ISNOTNULL(COUNT(`Sheet1`.`Teams`)),COUNT(`Sheet1`.`Teams`),0) AS `Measure_0` FROM default.Dashboard_Sheet1_20150812010131201 AS `Sheet1` WHERE `Sheet1`.`Teams` IN ('Chennai Super Kings')) AND (`Sheet1`.`Teams` '' )) AND (NOT(`Sheet1`.`Teams` IS NULL GROUP BY `Sheet1`.`Teams` ORDER BY `Measure_0` DESC LIMIT 5 ) `T_0` ON ( `Sheet1`.`Teams` =`T_0`.`Teams_0` ) WHERE `Sheet1`.`Teams` IN ('Chennai Super Kings')) AND (`Sheet1`.`Teams` '' )) AND (NOT(`Sheet1`.`Teams` IS NULL GROUP BY `Sheet1`.`Teams` at Syncfusion.ThriftHive.Base.HqlCommand.ExecuteReader() at Syncfusion.Dashboard.Base.Data.HiveQueryBuilder.ExecuteReaderQuery(String query, String connectionString) in f:\Back_To_Svn\source\base\dashboard.base\src\Data\HiveServer.cs:line 409 at Syncfusion.Dashboard.Base.Data.ServerDataProvider.GetTable(String tableName, List`1 schemaInfoCollection, List`1 originalSchemaInfoCollection, List`1 initialFilterList, List`1 viewerFilterList, Boolean isSelectQueryForServerModeFilterPopup, Boolean isNestedFilter) in f:\Back_To_Svn\source\base\dashboard.base\src\Data\Data.cs:line 536 at Syncfusion.Dashboard.Base.Engine.RelationalEngine.GetDataFromMainSource(List`1 schemaInfos, List`1 viewerFilterList) in f:\Back_To_Svn\source\base\dashboard.base\src\Engine\RelationDataEngine.cs:line 902 at DashboardService.DashboardService.ApplyFilterServerMode(RelationalEngine engine, String ReportName, Boolean Drilled, Boolean useDefaultProperties, Boolean isParamsColumn, SchemaInfo paramsSchema) in f:\Back_To_Svn\source\js\dashboardsamples\wcf\Common\DataProcessor.cs:line 1030 at DashboardService.DashboardService.ApplyFilter(RelationalEngine engine, String ReportName, Boolean Drilled, Boolean useDefaultProperties, Boolean isParamsColumn, SchemaInfo paramsSchema) in f:\Back_To_Svn\source\js\dashboardsamples\wcf\Common\DataProcessor.cs:line 1047 at DashboardService.DashboardService.GetProcessedControlData(List`1 elements, RelationalEngine dataEngine, DashboardItem control, RelationalReport currentReport) in f:\Back_To_Svn\source\js\dashboardsamples\wcf\Common\DataProcessor.cs:line 205 at DashboardService.DashboardService.GetGridControlData(DashboardItem controlObj, List`1 columnList, List`1 gridSchemaInfos, Dictionary`2 barData, Dictionary`2 formattedColumns, List`1 colourSaturationColumns) in f:\Back_To_Svn\source\js\dashboardsamples\wcf\Controls\DashboardGrid.cs:line 115 at DashboardService.DashboardService.GetGridData(DashboardItem controlObj, Boolean isLoadGrid) in f:\Back_To_Svn\source\js\dashboardsamples\wcf\Controls\DashboardGrid.cs:line 47 at DashboardService.DashboardService.ControlConfigHelper(DashboardItem control, Boolean IsLoad, DataFetcherGrid fn) in f:\Back_To_Svn\source\js\dashboardsamples\wcf\DashboardService.svc.cs:line 1269 Regards, Ravi On Tue, Aug 11, 2015 at 11:53 PM, Josh Rosen rosenvi...@gmail.com wrote: Can you share a query or stack trace? More information would make this question easier to answer. On Tue, Aug 11, 2015 at 8:50 PM, Ravisankar Mani rrav...@gmail.com wrote: Hi all, We got an exception like “org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object” when using some where condition queries. I am using 1.4.0 version spark. Is this exception resolved in latest spark? Regards, Ravi
Exception in spark
Hi all, We got an exception like “org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object” when using some where condition queries. I am using 1.4.0 version spark. Is this exception resolved in latest spark? Regards, Ravi
Hive Query(Top N)
Hi everyone I have tried to to achieve hierarchical based (index mode) top n creation using spark query. it taken more time when i execute following query Select SUM(`adventurepersoncontacts`.`contactid`) AS `adventurepersoncontacts_contactid` , `adventurepersoncontacts`.`fullname` AS `adventurepersoncontacts_fullname` FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts JOIN ( SELECT `F_0`.`fullname_0_0` AS `fullname_0_0`, ROW_NUMBER() OVER( ORDER BY `F_0`.`Measure_0` DESC) AS `R_N_0` FROM( SELECT `adventurepersoncontacts`.`fullname` AS `fullname_0_0`, SUM(adventurepersoncontacts.contactid) AS `Measure_0` FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts GROUP BY `adventurepersoncontacts`.`fullname` )`F_0` ) `T_0` on ((`adventurepersoncontacts`.`fullname` = `T_0`.`fullname_0_0` ) AND (`T_0`.`R_N_0` = 5)) GROUP BY `adventurepersoncontacts`.`fullname In mentioned query, I have set row index in every group according to the aggregation type. Row_number calculation according to the aggregation like Row_Number() Over( order by sum( column name) order by DESC) not directly supported in spark . So i have using sub query like SELECT `F_0`.`fullname_0_0` AS `fullname_0_0`, ROW_NUMBER() OVER( ORDER BY `F_0`.`Measure_0` DESC) AS `R_N_0` FROM( SELECT `adventurepersoncontacts`.`fullname` AS `fullname_0_0`, SUM(adventurepersoncontacts.contactid) AS `Measure_0` FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts GROUP BY `adventurepersoncontacts`.`fullname` )`F_0` ) `T_0` The execution time was getting slowed when using following way. In case i removed main group by( remove aggregation) method , it getting very fast execution. Refer following query. I mentioned this query only for slow execution when i include main group by Select `adventurepersoncontacts`.`contactid` AS `adventurepersoncontacts_contactid` , `adventurepersoncontacts`.`fullname` AS `adventurepersoncontacts_fullname` FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts JOIN ( SELECT `F_0`.`fullname_0_0` AS `fullname_0_0`, ROW_NUMBER() OVER( ORDER BY `F_0`.`Measure_0` DESC) AS `R_N_0` FROM( SELECT `adventurepersoncontacts`.`fullname` AS `fullname_0_0`, SUM(adventurepersoncontacts.contactid) AS `Measure_0` FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts GROUP BY `adventurepersoncontacts`.`fullname` )`F_0` ) `T_0` on ((`adventurepersoncontacts`.`fullname` = `T_0`.`fullname_0_0` ) AND (`T_0`.`R_N_0` = 5)) I have found other way for getting this hierarchical query. I have created a temp table ( The table refers inner sub query(refer previous query) ) CREATE external table IF NOT EXISTS temp_table AS SELECT SUM(adventurepersoncontacts.contactid) as contactid, adventurepersoncontacts.fullname as fullname FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts GROUP BY `adventurepersoncontacts`.`fullname`; SELECT SUM(`adventurepersoncontacts`.`contactid`) AS `adventurepersoncontacts_contactid` ,`adventurepersoncontacts`.`fullname` AS `adventurepersoncontacts_fullname` FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts JOIN (SELECT `fullname` AS `fullname_0_0`,ROW_NUMBER() over( order by `contactid` desc) AS `R_N_0` FROM default.temp_table) `T_0` on ((`adventurepersoncontacts`.`fullname` = `T_0`.`fullname_0_0` ) AND (`T_0`.`R_N_0` = 2)) GROUP BY `adventurepersoncontacts`.`fullname`; This is other way. But this execution time was delayed when using create table using select statement Please help any other optimized way to achieve my requirement. Regards, Ravi
Udf's in spark
Hi Everyone, As mentioned in Spark sQL programming guide, Spark SQL support Hive UDFs. I have built the UDF's in hive meta store. It working perfectly in hive connection. But it is not working in spark (java.lang.RuntimeException: Couldn't find function DATE_FORMAT). Could you please help how to use this hive UDF' s in sprak? Regards, Ravisankar M R
Spark performance
Hi everyone, I have planned to move mssql server to spark?. I have using around 50,000 to 1l records. The spark performance is slow when compared to mssql server. What is the best data base(Spark or sql) to store or retrieve data around 50,000 to 1l records ? regards, Ravi
Performance slow
Hi everyone, More time to be taken when i execute query using (group by + order by) or (group by + cast + order by) in same query. Kindly refer the following query Could you please provide any solution regarding thisd performance issue? SELECT If(ISNOTNULL(SUM(CAST(adventurepersoncontacts.contactid AS decimal(38,6,SUM(CAST(adventurepersoncontacts.contactid AS decimal(38,6))),0) AS adventurepersoncontacts_contactid, adventurepersoncontacts.fullname as adventurepersoncontacts_fullname FROM default.adventurepersoncontacts AS adventurepersoncontacts order by adventurepersoncontacts.fullname asc Regards, Ravisankar M R
Day of year
Hi everyone, I can't get 'day of year' when using spark sql query. Can you help any way to achieve day of year? Regards, Ravi
Spark query
Hi everyone, I can't get 'day of year' when using spark query. Can you help any way to achieve day of year? Regards, Ravi
Spark SQL - Setting YARN Classpath for primordial class loader
Hi, The response to the below thread for making yarn-client mode work by adding the JDBC driver JAR to spark.{driver,executor}.extraClassPath works fine. http://mail-archives.us.apache.org/mod_mbox/spark-user/201504.mbox/%3CCAAOnQ7vHeBwDU2_EYeMuQLyVZ77+N_jDGuinxOB=sff2lkc...@mail.gmail.com%3E But for yarn-cluster mode I am still getting below error ERROR yarn.ApplicationMaster: User class threw exception: oracle.jdbc.OracleDriver Is there anything more needs to be done ?
Help with Spark SQL Hash Distribution
I am trying to distribute a table using a particular column which is the key that I’ll be using to perform join operations on the table. Is it possible to do this with Spark SQL? I checked the method partitionBy() for rdds. But not sure how to specify which column is the key? Can anyone give an example? Thanks Mani Graduate Student, Department of Computer Science Virginia Tech
Question on Spark SQL performance of Range Queries on Large Datasets
Hi, I am a graduate student from Virginia Tech (USA) pursuing my Masters in Computer Science. I’ve been researching on parallel and distributed databases and their performance for running some Range queries involving simple joins and group by on large datasets. As part of my research, I tried evaluating query performance of Spark SQL on the data set that I have. It would be really great if you could please confirm on the numbers that I get from Spark SQL? Following is the type of query that am running, Table 1 - 22,000,483 records Table 2 - 10,173,311 records Query : SELECT b.x, count(b.y) FROM Table1 a, Table2 b WHERE a.y=b.y AND a.z=‘' GROUP BY b.x ORDER BY b.x Total Running Time 4 Worker Nodes:177.68s 8 Worker Nodes: 186.72s I am using Apache Spark 1.3.0 with the default configuration. Is the query running time reasonable? Is it because of non-availability of indexes increasing the query run time? Can you please clarify? Thanks Mani Graduate Student, Department of Computer Science Virginia Tech