Re: [SparkSQL, SparkUI, RESTAPI] How to extract the WholeStageCodeGen ids from SparkUI

2023-04-12 Thread Jacek Laskowski
visualization from the Spark UI when running a SparkSQL. (e.g., under > the link > node:18088/history/application_1663600377480_62091/stages/stage/?id=1=0). > > However, I have trouble extracting the WholeStageCodegen ids from the DAG > visualization via the RESTAPIs. Is there any othe

Re: [SparkSQL, SparkUI, RESTAPI] How to extract the WholeStageCodeGen ids from SparkUI

2023-04-11 Thread Chitral Verma
try explain codegen on your DF and then pardee the string On Fri, 7 Apr, 2023, 3:53 pm Chenghao Lyu, wrote: > Hi, > > The detailed stage page shows the involved WholeStageCodegen Ids in its > DAG visualization from the Spark UI when running a SparkSQL. (e.g., under > the li

[SparkSQL, SparkUI, RESTAPI] How to extract the WholeStageCodeGen ids from SparkUI

2023-04-07 Thread Chenghao Lyu
Hi, The detailed stage page shows the involved WholeStageCodegen Ids in its DAG visualization from the Spark UI when running a SparkSQL. (e.g., under the link node:18088/history/application_1663600377480_62091/stages/stage/?id=1=0). However, I have trouble extracting the WholeStageCodegen ids

Re: Adding OpenSearch as a secondary index provider to SparkSQL

2023-03-24 Thread Mich Talebzadeh
h loss, damage or destruction. On Fri, 24 Mar 2023 at 07:03, Anirudha Jadhav wrote: > Hello community, wanted your opinion on this implementation demo. > > / support for Materialized views, skipping indices and covered indices > with bloom filter optimizations with opensearch via

Adding OpenSearch as a secondary index provider to SparkSQL

2023-03-24 Thread Anirudha Jadhav
Hello community, wanted your opinion on this implementation demo. / support for Materialized views, skipping indices and covered indices with bloom filter optimizations with opensearch via SparkSQL https://github.com/opensearch-project/sql/discussions/1465 ( see video with voice over ) Ani

Re: [New Project] sparksql-ml : Distributed Machine Learning using SparkSQL.

2023-02-27 Thread Russell Jurney
; datasyndrome.com Book a time on Calendly <https://calendly.com/rjurney_personal/30min> On Mon, Feb 27, 2023 at 10:16 AM Chitral Verma wrote: > Hi All, > I worked on this idea a few years back as a pet project to bridge > *SparkSQL* and *SparkML* and empower anyone to implemen

Fwd: [New Project] sparksql-ml : Distributed Machine Learning using SparkSQL.

2023-02-27 Thread Chitral Verma
Hi All, I worked on this idea a few years back as a pet project to bridge *SparkSQL* and *SparkML* and empower anyone to implement production grade, distributed machine learning over Apache Spark as long as they have SQL skills. In principle the idea works exactly like Google's BigQueryML

Re: [pyspark/sparksql]: How to overcome redundant/repetitive code? Is a for loop over an sql statement with a variable a bad idea?

2023-01-06 Thread Sean Owen
Right, nothing wrong with a for loop here. Seems like just the right thing. On Fri, Jan 6, 2023, 3:20 PM Joris Billen wrote: > Hello Community, > I am working in pyspark with sparksql and have a very similar very complex > list of dataframes that Ill have to execute several time

[pyspark/sparksql]: How to overcome redundant/repetitive code? Is a for loop over an sql statement with a variable a bad idea?

2023-01-06 Thread Joris Billen
Hello Community, I am working in pyspark with sparksql and have a very similar very complex list of dataframes that Ill have to execute several times for all the “models” I have. Suppose the code is exactly the same for all models, only the table it reads from and some values in the where

Re: Can we upload a csv dataset into Hive using SparkSQL?

2022-12-13 Thread Artemis User
Your DDL statement doesn't look right.  You may want to check the Spark SQL Reference online for how to create table in Hive format (https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-hiveformat.html). You should be able to populate the table directly using CREATE by

Can we upload a csv dataset into Hive using SparkSQL?

2022-12-10 Thread sam smith
Hello, I want to create a table in Hive and then load a CSV file content into it all by means of Spark SQL. I saw in the docs the example with the .txt file BUT can we do instead something like the following to accomplish what i want? : String warehouseLocation = new

Re: external table with parquet files: problem querying in sparksql since data is stored as integer while hive schema expects a timestamp

2022-07-24 Thread Gourav Sengupta
Hi, please try to query the table directly by loading the hive metastore (we can do that quite easily in AWS EMR, but we can do things quite easily with everything in AWS), rather than querying the s3 location directly. Regards, Gourav On Wed, Jul 20, 2022 at 9:51 PM Joris Billen wrote: >

external table with parquet files: problem querying in sparksql since data is stored as integer while hive schema expects a timestamp

2022-07-20 Thread Joris Billen
Hi, below sounds like something that someone will have experienced... I have external tables of parquet files with a hive table defined on top of the data. I dont manage/know the details of how the data lands. For some tables no issues when querying through spark. But for others there is an

Re: Using Avro file format with SparkSQL

2022-02-17 Thread Artemis User
Please try these two corrections: 1. The --packages isn't the right command line argument for spark-submit.  Please use --conf spark.jars.packages=your-package to specify Maven packages or define your configuration parameters in the spark-defaults.conf file 2. Please check the version

RE: Re: Using Avro file format with SparkSQL

2022-02-14 Thread Morven Huang
Hi Steve, You’re correct about the '--packages' option, seems my memory does not serve me well :) On 2022/02/15 07:04:27 Stephen Coy wrote: > Hi Morven, > > We use —packages for all of our spark jobs. Spark downloads the specified jar > and all of its dependencies from a Maven repository. >

Re: Using Avro file format with SparkSQL

2022-02-14 Thread Stephen Coy
Hi Morven, We use —packages for all of our spark jobs. Spark downloads the specified jar and all of its dependencies from a Maven repository. This means we never have to build fat or uber jars. It does mean that the Apache Ivy configuration has to be set up correctly though. Cheers, Steve C

RE: Using Avro file format with SparkSQL

2022-02-14 Thread Morven Huang
I wrote a toy spark job and ran it within my IDE, same error if I don’t add spark-avro to my pom.xml. After putting spark-avro dependency to my pom.xml, everything works fine. Another thing is, if my memory serves me right, the spark-submit options for extra jars is ‘--jars’ , not

Re: Using Avro file format with SparkSQL

2022-02-11 Thread Gourav Sengupta
Hi Anna, Avro libraries should be inbuilt in SPARK in case I am not wrong. Any particular reason why you are using a deprecated or soon to be deprecated version of SPARK? SPARK 3.2.1 is fantastic. Please do let us know about your set up if possible. Regards, Gourav Sengupta On Thu, Feb 10,

Re: Using Avro file format with SparkSQL

2022-02-09 Thread frakass
Have you added the dependency in the build.sbt? Can you 'sbt package' the source successfully? regards frakass On 2022/2/10 11:25, Karanika, Anna wrote: For context, I am invoking spark-submit and adding arguments --packages org.apache.spark:spark-avro_2.12:3.2.0.

Using Avro file format with SparkSQL

2022-02-09 Thread Karanika, Anna
Hello, I have been trying to use spark SQL’s operations that are related to the Avro file format, e.g., stored as, save, load, in a Java class but they keep failing with the following stack trace: Exception in thread "main" org.apache.spark.sql.AnalysisException: Failed to find data source:

Re: SparkSQL vs Dataframe vs Dataset

2021-12-06 Thread yonghua
@spark" Envoyé: lundi 6 Décembre 2021 21:49 Objet : SparkSQL vs Dataframe vs Dataset   Hi Users, Is there any use case when we need to use SQL vs Dataframe vs Dataset? Is there any recommended approach or any advantage/performance gain over others? Thanks Rajat  

SparkSQL vs Dataframe vs Dataset

2021-12-06 Thread rajat kumar
Hi Users, Is there any use case when we need to use SQL vs Dataframe vs Dataset? Is there any recommended approach or any advantage/performance gain over others? Thanks Rajat

Re: [SparkSQL] Full Join Return Null Value For Funtion-Based Column

2021-01-18 Thread 刘 欢
Sorry, I know the reason. closed 发件人: 刘 欢 日期: 2021年1月18日 星期一 下午1:39 收件人: "user@spark.apache.org" 主题: [SparkSQL] Full Join Return Null Value For Funtion-Based Column Hi All: Here I got two tables: Table A name num tom 2 jerry 3 jerry 4 null null Table B name sc

[SparkSQL] Full Join Return Null Value For Funtion-Based Column

2021-01-17 Thread 刘 欢
Hi All: Here I got two tables: Table A name num tom 2 jerry 3 jerry 4 null null Table B name score tom 12 jerry 10 jerry 8 null null When i use spark.sql() to get result from A and B with sql : select a.name as aName, a.date, b.name as bName from (

sparksql 2.4.0 java.lang.NoClassDefFoundError: com/esotericsoftware/minlog/Log

2020-07-09 Thread Ivan Petrov
Hi there! I'm seeing this exception in Spark Driver log. Executor log stays empty. No exceptions, nothing. 8 tasks out of 402 failed with this exception. What is the right way to debug it? Thank you. I see that spark/jars -> minlog-1.3.0.jar is in driver classpath at least...

Re: sparksql 2.4.0 java.lang.NoClassDefFoundError: com/esotericsoftware/minlog/Log

2020-07-09 Thread Ivan Petrov
spark/jars -> minlog-1.3.0.jar I see that jar is there. What do I do wrong? чт, 9 июл. 2020 г. в 20:43, Ivan Petrov : > Hi there! > I'm seeing this exception in Spark Driver log. > Executor log stays empty. No exceptions, nothing. > 8 tasks out of 402 failed with this exception. > What is the

sparksql 2.4.0 java.lang.NoClassDefFoundError: com/esotericsoftware/minlog/Log

2020-07-09 Thread Ivan Petrov
Hi there! I'm seeing this exception in Spark Driver log. Executor log stays empty. No exceptions, nothing. 8 tasks out of 402 failed with this exception. What is the right way to debug it? Thank you. java.lang.NoClassDefFoundError: com/esotericsoftware/minlog/Log at

Re: pyspark(sparksql-v 2.4) cannot read hive table which is created

2020-03-16 Thread dominic kim
I solved the problem with the option below spark.sql ("SET spark.hadoop.metastore.catalog.default = hive") spark.sql ("SET spark.sql.hive.convertMetastoreOrc = false") -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: pyspark(sparksql-v 2.4) cannot read hive table which is created

2020-03-16 Thread dominic kim
I solved the problem with the option below spark.sql ("SET spark.hadoop.metastore.catalog.default = hive") spark.sql ("SET spark.sql.hive.convertMetastoreOrc = false") -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

pyspark(sparksql-v 2.4) cannot read hive table which is created

2020-03-16 Thread dominic kim
I use related spark config value but not works like below(success in spark 2.1.1) : spark.hive.mapred.supports.subdirectories=true spark.hive.supports.subdirectories=true spark.mapred.input.dir.recursive=true spark.hive.mapred.supports.subdirectories=true And when I query, I also use related hive

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Mich Talebzadeh
gt; Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade < >>> rishikeshg1...@gmail.com>: >>> >>> Hi. >>> I am using Spark 2.3.2 and Hive 3.1.0. >>> Even if i use parquet files the result would be same, because after all >>> sparkSQL isn

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Patrick McCarthy
ions >> there? Can you share some code? >> >> Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade > >: >> >> Hi. >> I am using Spark 2.3.2 and Hive 3.1.0. >> Even if i use parquet files the result would be same, because after all >> sparkSQL

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Rishikesh Gawade
Do you configure the same options > there? Can you share some code? > > Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade >: > > Hi. > I am using Spark 2.3.2 and Hive 3.1.0. > Even if i use parquet files the result would be same, because after all > sparkSQL isn't able to d

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Jörn Franke
Do you use the HiveContext in Spark? Do you configure the same options there? Can you share some code? > Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade : > > Hi. > I am using Spark 2.3.2 and Hive 3.1.0. > Even if i use parquet files the result would be same, because after

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Rishikesh Gawade
Hi. I am using Spark 2.3.2 and Hive 3.1.0. Even if i use parquet files the result would be same, because after all sparkSQL isn't able to descend into the subdirectories over which the table is created. Could there be any other way? Thanks, Rishikesh On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-06 Thread Mich Talebzadeh
d.supports.subdirectories=TRUE* and > *mapred.input.dir.recursive**=TRUE*. > As a result of this, when i fire the simplest query of *select count(*) > from ExtTable* via the Hive CLI, it successfully gives me the expected > count of records in the table. > However, when i fire the same que

Hive external table not working in sparkSQL when subdirectories are present

2019-08-06 Thread Rishikesh Gawade
, it successfully gives me the expected count of records in the table. However, when i fire the same query via sparkSQL, i get count = 0. I think the sparkSQL isn't able to descend into the subdirectories for getting the data while hive is able to do so. Are there any configurations needed to be set

Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-12 Thread Reynold Xin
; alemmontree@ 126. com ( >> alemmont...@126.com ) > wrote: >> >>> I have a question about the limit(biggest) of SQL's length that is >>> supported in SparkSQL. I can't find the answer in the documents of Spark. >>> >>> >>> Maybe Interger.MAX_VALUE or not ? >>> >> >> > >

Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-12 Thread Gourav Sengupta
he limit(biggest) of SQL's length that is >> supported in SparkSQL. I can't find the answer in the documents of Spark. >> >> Maybe Interger.MAX_VALUE or not ? >> >> >

Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-11 Thread Reynold Xin
< alemmont...@126.com > wrote: > > I have a question about the limit(biggest) of SQL's length that is > supported in SparkSQL. I can't find the answer in the documents of Spark. > > > Maybe Interger.MAX_VALUE or not ? > > > >

Re: sparksql in sparkR?

2019-06-07 Thread Felix Cheung
This seem to be more a question of spark-sql shell? I may suggest you change the email title to get more attention. From: ya Sent: Wednesday, June 5, 2019 11:48:17 PM To: user@spark.apache.org Subject: sparksql in sparkR? Dear list, I am trying to use sparksql

sparksql in sparkR?

2019-06-06 Thread ya
Dear list, I am trying to use sparksql within my R, I am having the following questions, could you give me some advice please? Thank you very much. 1. I connect my R and spark using the library sparkR, probably some of the members here also are R users? Do I understand correctly that SparkSQL

[SparkSQL, user-defined Hadoop, K8s] Hadoop free spark on kubernetes => NoClassDefFound

2019-03-07 Thread Sommer Tobias
Hi all, we are having problems with using a custom hadoop lib in a spark image when running it on a kubernetes cluster while following the steps of the documentation. Details in the description below. Does anyone else had similar problems? Is there something missing in the setup below? Or

SparkSql query on a port and peocess queries

2019-01-15 Thread Soheil Pourbafrani
Hi, In my problem data is stored on both Database and HDFS. I create an application that according to the query, Spark load data, process the query and return the answer. I'm looking for a service that gets SQL queries and returns the answers (like Databases command line). Is there a way that my

Re: Need help with SparkSQL Query

2018-12-17 Thread Ramandeep Singh Nanda
You can use analytical functions in spark sql. Something like select * from (select id, row_number() over (partition by id order by timestamp ) as rn from root) where rn=1 On Mon, Dec 17, 2018 at 4:03 PM Nikhil Goyal wrote: > Hi guys, > > I have a dataframe of type Record (id: Long, timestamp:

Re: Need help with SparkSQL Query

2018-12-17 Thread Patrick McCarthy
Untested, but something like the below should work: from pyspark.sql import functions as F from pyspark.sql import window as W (record .withColumn('ts_rank', F.dense_rank().over(W.Window.orderBy('timestamp').partitionBy("id")) .filter(F.col('ts_rank')==1) .drop('ts_rank') ) On Mon, Dec 17,

Need help with SparkSQL Query

2018-12-17 Thread Nikhil Goyal
Hi guys, I have a dataframe of type Record (id: Long, timestamp: Long, isValid: Boolean, other metrics) Schema looks like this: root |-- id: long (nullable = true) |-- timestamp: long (nullable = true) |-- isValid: boolean (nullable = true) . I need to find the earliest valid record

Re: SparkSQL read Hive transactional table

2018-10-17 Thread Gourav Sengupta
- *发件人:* "Gourav Sengupta"; *发送时间:* 2018年10月16日(星期二) 晚上6:35 *收件人:* "daily"; *抄送:* "user"; "dev"; *主题:* Re: SparkSQL read Hive transactional table Hi, can I please ask which version of Hive and Spark are you using? Regards, Gourav Sengupta On Tue, Oct

?????? SparkSQL read Hive transactional table

2018-10-16 Thread daily
Hi, Spark version: 2.3.0 Hive version: 2.1.0 Best regards. -- -- ??: "Gourav Sengupta"; : 2018??10??16??(??) 6:35 ??: "daily"; : "user"; "dev"; : Re: SparkSQL read Hive

Re: SparkSQL read Hive transactional table

2018-10-16 Thread Gourav Sengupta
Hi, can I please ask which version of Hive and Spark are you using? Regards, Gourav Sengupta On Tue, Oct 16, 2018 at 2:42 AM daily wrote: > Hi, > > I use HCatalog Streaming Mutation API to write data to hive transactional > table, and then, I use SparkSQL to read data f

SparkSQL read Hive transactional table

2018-10-15 Thread daily
Hi, I use HCatalog Streaming Mutation API to write data to hive transactional table, and then, I use SparkSQL to read data from the hive transactional table. I get the right result. However, SparkSQL uses more time to read hive orc bucket transactional table, beacause SparkSQL

SparkSQL read Hive transactional table

2018-10-13 Thread wys372b
Hi, I use HCatalog Streaming Mutation API to write data to hive transactional table, and then, I use SparkSQL to read data from the hive transactional table. I get the right result. However, SparkSQL uses more time to read hive orc bucket transactional table, beacause SparkSQL read all columns

SparkSQL read Hive transactional table

2018-10-12 Thread daily
Hi, I use HCatalog Streaming Mutation API to write data to hive transactional table, and then, I use SparkSQL to read data from the hive transactional table. I get the right result. However, SparkSQL uses more time to read hive orc bucket transactional table, beacause SparkSQL read all columns

sparksql exception when using regexp_replace

2018-10-10 Thread 付涛
Hi, sparks: I am using sparksql to insert some values into directory,the sql seems like this: insert overwrite directory '/temp/test_spark' ROW FORMAT DELIMITED FIELDS TERMINATED BY '~' select regexp_replace('a~b~c', '~', ''), 123456 however,some exceptions has throwed

sparksql exception when using regexp_replace

2018-10-10 Thread 付涛
Hi, sparks: I am using sparksql to insert some values into directory,the sql seems like this: insert overwrite directory '/temp/test_spark' ROW FORMAT DELIMITED FIELDS TERMINATED BY '~' select regexp_replace('a~b~c', '~', ''), 123456 however,some exceptions has

Re: [SparkSQL] Count Distinct issue

2018-09-17 Thread kathleen li
Hi, I can't reproduce your issue: scala> spark.sql("select distinct * from dfv").show() ++++++++++++++++---+ | a| b| c| d| e| f| g| h| i| j| k| l| m| n| o| p|

Re: Is there any open source framework that converts Cypher to SparkSQL?

2018-09-16 Thread Matei Zaharia
GraphFrames (https://graphframes.github.io) offers a Cypher-like syntax that then executes on Spark SQL. > On Sep 14, 2018, at 2:42 AM, kant kodali wrote: > > Hi All, > > Is there any open source framework that converts Cypher to SparkS

[SparkSQL] Count Distinct issue

2018-09-14 Thread Daniele Foroni
Hi all, I am having some troubles in doing a count distinct over multiple columns. This is an example of my data: ++++---+ |a |b |c |d | ++++---+ |null|null|null|1 | |null|null|null|2 | |null|null|null|3 | |null|null|null|4 | |null|null|null|5 |

Is there any open source framework that converts Cypher to SparkSQL?

2018-09-14 Thread kant kodali
Hi All, Is there any open source framework that converts Cypher to SparkSQL? Thanks!

Re: Where can I read the Kafka offsets in SparkSQL application

2018-07-24 Thread Gourav Sengupta
Hi, can you see whether using the option for checkPointLocation would work in case you are using structured streaming? Regards, Gourav Sengupta On Tue, Jul 24, 2018 at 12:30 PM, John, Vishal (Agoda) < vishal.j...@agoda.com.invalid> wrote: > > Hello all, > > > I have to read data from Kafka

Where can I read the Kafka offsets in SparkSQL application

2018-07-24 Thread John, Vishal (Agoda)
Hello all, I have to read data from Kafka topic at regular intervals. I create the dataframe as shown below. I don’t want to start reading from the beginning on each run. At the same time, I don’t want to miss the messages between run intervals. val queryDf = sqlContext .read

Error on fetchin mass data from cassandra using SparkSQL

2018-05-28 Thread Soheil Pourbafrani
I tried to fetch some data from Cassandra using SparkSql. For small tables, all things go well but trying to fetch data from big tables I got the following error: java.lang.NoSuchMethodError: com.datastax.driver.core.ResultSet.fetchMoreResults()Lshade/com/datastax/spark/connector/google/common

Re: How to use disk instead of just InMemoryRelation when use JDBC datasource in SPARKSQL?

2018-04-12 Thread Takeshi Yamamuro
You want to use `Dataset.persist(StorageLevel.MEMORY_AND_DISK)`? On Thu, Apr 12, 2018 at 1:12 PM, Louis Hust <louis.h...@gmail.com> wrote: > We want to extract data from mysql, and calculate in sparksql. > The sql explain like below. > > > REGIONKEY#177,N_COMME

How to use disk instead of just InMemoryRelation when use JDBC datasource in SPARKSQL?

2018-04-11 Thread Louis Hust
We want to extract data from mysql, and calculate in sparksql. The sql explain like below. REGIONKEY#177,N_COMMENT#178] PushedFilters: [], ReadSchema: struct<N_NATIONKEY:int,N_NAME:string,N_REGIONKEY:int,N_COMMENT:string> +- *(20) Sort [r_regionkey#203 ASC NULLS FIRST],

How to use disk instead of just InMemoryRelation when use JDBC datasource in SPARKSQL?

2018-04-10 Thread Louis Hust
We want to extract data from mysql, and calculate in sparksql. The sql explain like below. == Parsed Logical Plan == > 'Sort ['revenue DESC NULLS LAST], true > +- 'Aggregate ['n_name], ['n_name, 'SUM(('l_extendedprice * (1 - > 'l_discount))) AS revenue#329] >+- 'Filter (((

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-31 Thread Tin Vu
elect * from table_name"). 3. Hadoop 2.9.0 I am using JDBS connector to Drill from Hive Metastore. SparkSQL is also connecting to ORC database by Hive. Thanks so much! Tin On Sat, Mar 31, 2018 at 11:41 AM, Gourav Sengupta <gourav.sengu...@gmail.com > wrote: > Hi Tin, > > This so

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-31 Thread Gourav Sengupta
and different used cases. Have you tried using JDBC connector to Drill from within SPARKSQL? Regards, Gourav Sengupta On Thu, Mar 29, 2018 at 1:03 AM, Tin Vu <tvu...@ucr.edu> wrote: > Hi, > > I am executing a benchmark to compare performance of SparkSQL, Apache > Drill and Presto. My

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-29 Thread Lalwani, Jayesh
ser@spark.apache.org" <user@spark.apache.org> Subject: Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto You are right. There are too much tasks was created. How can we reduce the number of tasks? On Thu, Mar 29, 2018, 7:44 AM Lalwani,

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-29 Thread Tin Vu
*Wednesday, March 28, 2018 at 8:04 PM > *To: *"user@spark.apache.org" <user@spark.apache.org> > *Subject: *[SparkSQL] SparkSQL performance on small TPCDS tables is very > low when compared to Drill or Presto > > > > Hi, > > > > I am execut

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-29 Thread Lalwani, Jayesh
UI. From: Tin Vu <tvu...@ucr.edu> Date: Wednesday, March 28, 2018 at 8:04 PM To: "user@spark.apache.org" <user@spark.apache.org> Subject: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto Hi, I am executing a benchmark

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Tin Vu
see that you don’t do anything in the > query and immediately return (similarly count might immediately return by > using some statistics). > > On 29. Mar 2018, at 02:03, Tin Vu <tvu...@ucr.edu> wrote: > > Hi, > > I am executing a benchmark to compare performance of SparkS

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Jörn Franke
lt;tvu...@ucr.edu> wrote: > > Hi, > > I am executing a benchmark to compare performance of SparkSQL, Apache Drill > and Presto. My experimental setup: > TPCDS dataset with scale factor 100 (size 100GB). > Spark, Drill, Presto have a same number of workers: 12. > Each worked ha

[SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Tin Vu
Hi, I am executing a benchmark to compare performance of SparkSQL, Apache Drill and Presto. My experimental setup: - TPCDS dataset with scale factor 100 (size 100GB). - Spark, Drill, Presto have a same number of workers: 12. - Each worked has same allocated amount of memory: 4GB

Why SparkSQL changes the table owner when performing alter table opertations?

2018-03-12 Thread 张万新
Hi, When using spark.sql() to perform alter table operations I found that spark changes the table owner property to the execution user. Then I digged into the source code and found that in HiveClientImpl, the alterTable function will set the owner of table to the current execution user. Besides,

AM restart in a other node make SparkSQL job into a state of feign death

2017-12-20 Thread Bang Xiao
I run "spark-sql --master yarn --deploy-mode client -f 'SQLs' " in shell, The application is stuck when the AM is down and restart in other nodes. It seems the driver wait for the next sql. Is this a bug?In my opinion,Either the application execute the failed sql or exit with a failure when

AM restart in a other node makes SparkSQL jobs into a state of feign death

2017-12-20 Thread Bang Xiao
I run "spark-sql --master yarn --deploy-mode client -f 'SQLs' " in shell, The application is stuck when the AM is down and restart in other nodes. It seems the driver wait for the next sql. Is this a bug?In my opinion,Either the application execute the failed sql or exit with a failure when

Re: SparkSQL not support CharType

2017-11-23 Thread Jörn Franke
Or bytetype depending on the use case > On 23. Nov 2017, at 10:18, Herman van Hövell tot Westerflier > wrote: > > You need to use a StringType. The CharType and VarCharType are there to > ensure compatibility with Hive and ORC; they should not be used anywhere

Re: SparkSQL not support CharType

2017-11-23 Thread Herman van Hövell tot Westerflier
You need to use a StringType. The CharType and VarCharType are there to ensure compatibility with Hive and ORC; they should not be used anywhere else. On Thu, Nov 23, 2017 at 4:09 AM, 163 wrote: > Hi, > when I use Dataframe with table schema, It goes wrong: > > val

SparkSQL not support CharType

2017-11-22 Thread 163
Hi, when I use Dataframe with table schema, It goes wrong: val test_schema = StructType(Array( StructField("id", IntegerType, false), StructField("flag", CharType(1), false), StructField("time", DateType, false))); val df = spark.read.format("com.databricks.spark.csv")

Re: Dynamic data ingestion into SparkSQL - Interesting question

2017-11-21 Thread Aakash Basu
017 6:58 PM, "Aakash Basu" <aakash.spark@gmail.com> wrote: >> >>> Hi all, >>> >>> I have a table which will have 4 columns - >>> >>> | Expression|filter_condition| from_clause| >>> group_by_columns|

Re: Dynamic data ingestion into SparkSQL - Interesting question

2017-11-21 Thread Fernando Pereira
ondition| from_clause| >> group_by_columns| >> >> >> This file may have variable number of rows depending on the no. of KPIs I >> need to calculate. >> >> I need to write a SparkSQL program which will have to read this file and >> run eac

Re: Dynamic data ingestion into SparkSQL - Interesting question

2017-11-20 Thread Aakash Basu
> > > This file may have variable number of rows depending on the no. of KPIs I > need to calculate. > > I need to write a SparkSQL program which will have to read this file and > run each line of queries dynamically by fetching each column value for a > particular row and cr

Dynamic data ingestion into SparkSQL - Interesting question

2017-11-20 Thread Aakash Basu
Hi all, I have a table which will have 4 columns - | Expression|filter_condition| from_clause| group_by_columns| This file may have variable number of rows depending on the no. of KPIs I need to calculate. I need to write a SparkSQL program which will have to read

Re: Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-11-03 Thread ayan guha
p in the below please? > > Thanks, > Aakash. > > > -- Forwarded message -- > From: Aakash Basu <aakash.spark@gmail.com> > Date: Tue, Oct 31, 2017 at 9:17 PM > Subject: Regarding column partitioning IDs and names as per hierarchical > level

Re: Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-11-03 Thread Jean Georges Perrin
> From: Aakash Basu <aakash.spark@gmail.com > <mailto:aakash.spark@gmail.com>> > Date: Tue, Oct 31, 2017 at 9:17 PM > Subject: Regarding column partitioning IDs and names as per hierarchical > level SparkSQL > To: user <user@spark.apache.org <mailto:us

Fwd: Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-10-31 Thread Aakash Basu
Hey all, Any help in the below please? Thanks, Aakash. -- Forwarded message -- From: Aakash Basu <aakash.spark@gmail.com> Date: Tue, Oct 31, 2017 at 9:17 PM Subject: Regarding column partitioning IDs and names as per hierarchical level SparkSQL To: user

Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-10-31 Thread Aakash Basu
Hi all, I have to generate a table with Spark-SQL with the following columns - Level One Id: VARCHAR(20) NULL Level One Name: VARCHAR( 50) NOT NULL Level Two Id: VARCHAR( 20) NULL Level Two Name: VARCHAR(50) NULL Level Thr ee Id: VARCHAR(20) NULL Level Thr ee Name: VARCHAR(50) NULL Level Four

Re: Spark Streaming - Multiple Spark Contexts (SparkSQL) Performance

2017-10-01 Thread Gerard Maas
ue, I recursively process them as follows (below > code section will repeat in Question statement) > > stream.foreachRDD(rdd -> { > //process here - below two scenarions code is inserted here > > }); > > > *Question starts here:* > > Since I need to apply SparkSQL to r

Fwd: Spark Streaming - Multiple Spark Contexts (SparkSQL) Performance

2017-10-01 Thread Hammad
ConsumerStrategies.<String, String>Subscribe(*topics*, kafkaParams) ); when messages arrive in queue, I recursively process them as follows (below code section will repeat in Question statement) stream.foreachRDD(rdd -> { //process here - below two scenarions code is inserted her

hive2 query using SparkSQL seems wrong

2017-09-25 Thread Cinyoung Hur
Hi, I'm using hive 2.3.0, spark 2.1.1, and zeppelin 0.7.2. When I submit query in hive interpreter, it works fine. I could see exactly same query in zeppelin notebook and hiveserver2 web UI. However, when I submitted query using sparksql, query seemed wrong. For example, every columns

transactional data in sparksql

2017-07-31 Thread luohui20001
. There are serveral questions here:1. To deal with this kind of transaction, What is the most sensible way?Does UDAF help? Or does sparksql provide transactional support? I remembered that hive has some kind of support towards transaction, like https://cwiki.apache.org/confluence/display/Hive/Hive

Re: SparkSQL to read XML Blob data to create multiple rows

2017-07-08 Thread Amol Talap
--+ > |Description| Title| > +---++ > |Description_1.1|Title1.1| > |Description_1.2|Title1.2| > |Description_1.3|Title1.3| > +---++ > > > > > From: Talap, Amol <amol.ta...@capgemini.co

RE: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Talap, Amol
Thanks so much Zhang. This definitely helps. From: Yong Zhang [mailto:java8...@hotmail.com] Sent: Thursday, June 29, 2017 4:59 PM To: Talap, Amol; Judit Planas; user@spark.apache.org Subject: Re: SparkSQL to read XML Blob data to create multiple rows scala>spark.version res6: String = 2.

Re: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Yong Zhang
+---++ |Description_1.1|Title1.1| |Description_1.2|Title1.2| |Description_1.3|Title1.3| +---++ From: Talap, Amol <amol.ta...@capgemini.com> Sent: Thursday, June 29, 2017 9:38 AM To: Judit Planas; user@spark.apache.org Subje

Re: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Judit Planas
rets, Eva Regards, Amol *From:*Judit Planas [mailto:judit.pla...@epfl.ch] *Sent:* Thursday, June 29, 2017 3:46 AM *To:* user@spark.apache.org *Subject:* Re: SparkSQL to read XML Blob data to create multiple rows Hi Amol, Not sure I understand completely your question, but the SQL function "ex

RE: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Talap, Amol
: Judit Planas [mailto:judit.pla...@epfl.ch] Sent: Thursday, June 29, 2017 3:46 AM To: user@spark.apache.org Subject: Re: SparkSQL to read XML Blob data to create multiple rows Hi Amol, Not sure I understand completely your question, but the SQL function "explode" may help you: http://s

Re: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Judit Planas
Hi Amol, Not sure I understand completely your question, but the SQL function "explode" may help you: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode Here you can find a nice example: https://stackoverflow.com/questions/38210507/explode-in-pyspark

Re: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread ayan guha
Hi Not sure if I follow your issue. Can you please post output of books_inexp.show()? On Thu, Jun 29, 2017 at 2:30 PM, Talap, Amol wrote: > Hi: > > > > We are trying to parse XML data to get below output from given input > sample. > > Can someone suggest a way to pass

SparkSQL to read XML Blob data to create multiple rows

2017-06-28 Thread Talap, Amol
Hi: We are trying to parse XML data to get below output from given input sample. Can someone suggest a way to pass one DFrames output into load() function or any other alternative to get this output. Input Data from Oracle Table XMLBlob: SequenceID Name City XMLComment 1 Amol Kolhapur

Re: Bizzare diff in behavior between scala REPL and sparkSQL UDF

2017-06-20 Thread jeff saremi
a REPL and sparkSQL UDF I have this function which does a regex matching in scala. I test it in the REPL I get expected results. I use it as a UDF in sparkSQL i get completely incorrect results. Function: class UrlFilter (filters: Seq[String]) extends Serializable { val regexFilters = filte

  1   2   3   4   5   6   7   8   9   10   >