Re: [pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty

2022-08-02 Thread Sean Owen
That isn't the issue - the table does not exist anyway, but the storage
path does.

On Tue, Aug 2, 2022 at 6:48 AM Stelios Philippou  wrote:

> HI Kumba.
>
> SQL Structure is a bit different for
> CREATE OR REPLACE TABLE
>
>
> You can only do the following
> CREATE TABLE IF NOT EXISTS
>
>
>
>
> https://spark.apache.org/docs/3.3.0/sql-ref-syntax-ddl-create-table-datasource.html
>
> On Tue, 2 Aug 2022 at 14:38, Sean Owen  wrote:
>
>> I don't think "CREATE OR REPLACE TABLE" exists (in SQL?); this isn't a
>> VIEW.
>> Delete the path first; that's simplest.
>>
>> On Tue, Aug 2, 2022 at 12:55 AM Kumba Janga  wrote:
>>
>>> Thanks Sean! That was a simple fix. I changed it to "Create or Replace
>>> Table" but now I am getting the following error. I am still researching
>>> solutions but so far no luck.
>>>
>>> ParseException:
>>> mismatched input '' expecting {'ADD', 'AFTER', 'ALL', 'ALTER', 
>>> 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 
>>> 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 
>>> 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 
>>> 'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 
>>> 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 
>>> 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 
>>> 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', 
>>> DATABASES, 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 
>>> 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 
>>> 'DIV', 'DROP', 'ELSE', 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 
>>> 'EXISTS', 'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 
>>> 'FETCH', 'FIELDS', 'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 
>>> 'FOREIGN', 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 
>>> 'GLOBAL', 'GRANT', 'GROUP', 'GROUPING', 'HAVING', 'IF', 'IGNORE', 'IMPORT', 
>>> 'IN', 'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 
>>> 'INTERSECT', 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 
>>> 'LATERAL', 'LAZY', 'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 
>>> 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 
>>> 'MATCHED', 'MERGE', 'MSCK', 'NAMESPACE', 'NAMESPACES', 'NATURAL', 'NO', 
>>> NOT, 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 'OPTION', 'OPTIONS', 'OR', 
>>> 'ORDER', 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', 'OVERLAY', 
>>> 'OVERWRITE', 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', 'PIVOT', 
>>> 'PLACING', 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS', 'PROPERTIES', 
>>> 'PURGE', 'QUERY', 'RANGE', 'RECORDREADER', 'RECORDWRITER', 'RECOVER', 
>>> 'REDUCE', 'REFERENCES', 'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', 'RESET', 
>>> 'RESTRICT', 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 
>>> 'ROLLUP', 'ROW', 'ROWS', 'SCHEMA', 'SELECT', 'SEMI', 'SEPARATED', 'SERDE', 
>>> 'SERDEPROPERTIES', 'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', 
>>> 'SKEWED', 'SOME', 'SORT', 'SORTED', 'START', 'STATISTICS', 'STORED', 
>>> 'STRATIFY', 'STRUCT', 'SUBSTR', 'SUBSTRING', 'TABLE', 'TABLES', 
>>> 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 'TO', 
>>> 'TOUCH', 'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRIM', 
>>> 'TRUE', 'TRUNCATE', 'TYPE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', 
>>> 'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 'UPDATE', 'USE', 'USER', 'USING', 
>>> 'VALUES', 'VIEW', 'VIEWS', 'WHEN', 'WHERE', 'WINDOW', 'WITH', IDENTIFIER, 
>>> BACKQUOTED_IDENTIFIER}(line 1, pos 23)
>>>
>>> == SQL ==
>>> CREATE OR REPLACE TABLE
>>>
>>>
>>> On Mon, Aug 1, 2022 at 8:32 PM Sean Owen  wrote:
>>>
 Pretty much what it says? you are creating a table over a path that
 already has data in it. You can't do that without mode=overwrite at least,
 if that's what you intend.

 On Mon, Aug 1, 2022 at 7:29 PM Kumba Janga  wrote:

>
>
>- Component: Spark Delta, Spark SQL
>- Level: Beginner
>- Scenario: Debug, How-to
>
> *Python in Jupyter:*
>
> import pyspark
> import pyspark.sql.functions
>
> from pyspark.sql import SparkSession
> spark = (
> SparkSession
> .builder
> .appName("programming")
> .master("local")
> .config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0")
> .config("spark.sql.extensions", 
> "io.delta.sql.DeltaSparkSessionExtension")
> .config("spark.sql.catalog.spark_catalog", 
> "org.apache.spark.sql.delta.catalog.DeltaCatalog")
> .config('spark.ui.port', '4050')
> .getOrCreate()
>
> )
> from delta import *
>
> string_20210609 = '''worked_date,worker_id,delete_flag,hours_worked
> 2021-06-09,1001,Y,7
> 2021-06-09,1002,Y,3.75
> 2021-06-09,1003,Y,7.5
> 2021-06-09,1004,Y,6.25'''
>

Re: [pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty

2022-08-02 Thread Stelios Philippou
HI Kumba.

SQL Structure is a bit different for
CREATE OR REPLACE TABLE


You can only do the following
CREATE TABLE IF NOT EXISTS



https://spark.apache.org/docs/3.3.0/sql-ref-syntax-ddl-create-table-datasource.html

On Tue, 2 Aug 2022 at 14:38, Sean Owen  wrote:

> I don't think "CREATE OR REPLACE TABLE" exists (in SQL?); this isn't a
> VIEW.
> Delete the path first; that's simplest.
>
> On Tue, Aug 2, 2022 at 12:55 AM Kumba Janga  wrote:
>
>> Thanks Sean! That was a simple fix. I changed it to "Create or Replace
>> Table" but now I am getting the following error. I am still researching
>> solutions but so far no luck.
>>
>> ParseException:
>> mismatched input '' expecting {'ADD', 'AFTER', 'ALL', 'ALTER', 
>> 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 
>> 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 
>> 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 
>> 'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 
>> 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 
>> 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 
>> 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', 
>> DATABASES, 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 
>> 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 
>> 'DIV', 'DROP', 'ELSE', 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 
>> 'EXISTS', 'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 
>> 'FETCH', 'FIELDS', 'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 
>> 'FOREIGN', 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 
>> 'GLOBAL', 'GRANT', 'GROUP', 'GROUPING', 'HAVING', 'IF', 'IGNORE', 'IMPORT', 
>> 'IN', 'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 
>> 'INTERSECT', 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 
>> 'LATERAL', 'LAZY', 'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 
>> 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 
>> 'MATCHED', 'MERGE', 'MSCK', 'NAMESPACE', 'NAMESPACES', 'NATURAL', 'NO', NOT, 
>> 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 'OPTION', 'OPTIONS', 'OR', 'ORDER', 
>> 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', 'OVERLAY', 'OVERWRITE', 
>> 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', 'PIVOT', 'PLACING', 
>> 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS', 'PROPERTIES', 'PURGE', 
>> 'QUERY', 'RANGE', 'RECORDREADER', 'RECORDWRITER', 'RECOVER', 'REDUCE', 
>> 'REFERENCES', 'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', 'RESET', 'RESTRICT', 
>> 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 'ROLLUP', 'ROW', 
>> 'ROWS', 'SCHEMA', 'SELECT', 'SEMI', 'SEPARATED', 'SERDE', 'SERDEPROPERTIES', 
>> 'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', 'SKEWED', 'SOME', 'SORT', 
>> 'SORTED', 'START', 'STATISTICS', 'STORED', 'STRATIFY', 'STRUCT', 'SUBSTR', 
>> 'SUBSTRING', 'TABLE', 'TABLES', 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 
>> 'TERMINATED', 'THEN', 'TO', 'TOUCH', 'TRAILING', 'TRANSACTION', 
>> 'TRANSACTIONS', 'TRANSFORM', 'TRIM', 'TRUE', 'TRUNCATE', 'TYPE', 
>> 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', 'UNIQUE', 'UNKNOWN', 'UNLOCK', 
>> 'UNSET', 'UPDATE', 'USE', 'USER', 'USING', 'VALUES', 'VIEW', 'VIEWS', 
>> 'WHEN', 'WHERE', 'WINDOW', 'WITH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 
>> 1, pos 23)
>>
>> == SQL ==
>> CREATE OR REPLACE TABLE
>>
>>
>> On Mon, Aug 1, 2022 at 8:32 PM Sean Owen  wrote:
>>
>>> Pretty much what it says? you are creating a table over a path that
>>> already has data in it. You can't do that without mode=overwrite at least,
>>> if that's what you intend.
>>>
>>> On Mon, Aug 1, 2022 at 7:29 PM Kumba Janga  wrote:
>>>


- Component: Spark Delta, Spark SQL
- Level: Beginner
- Scenario: Debug, How-to

 *Python in Jupyter:*

 import pyspark
 import pyspark.sql.functions

 from pyspark.sql import SparkSession
 spark = (
 SparkSession
 .builder
 .appName("programming")
 .master("local")
 .config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0")
 .config("spark.sql.extensions", 
 "io.delta.sql.DeltaSparkSessionExtension")
 .config("spark.sql.catalog.spark_catalog", 
 "org.apache.spark.sql.delta.catalog.DeltaCatalog")
 .config('spark.ui.port', '4050')
 .getOrCreate()

 )
 from delta import *

 string_20210609 = '''worked_date,worker_id,delete_flag,hours_worked
 2021-06-09,1001,Y,7
 2021-06-09,1002,Y,3.75
 2021-06-09,1003,Y,7.5
 2021-06-09,1004,Y,6.25'''

 rdd_20210609 = spark.sparkContext.parallelize(string_20210609.split('\n'))

 # FILES WILL SHOW UP ON THE LEFT UNDER THE FOLDER ICON IF YOU WANT TO 
 BROWSE THEM
 OUTPUT_DELTA_PATH = './output/delta/'

 spark.sql('CREATE DATABASE IF 

Re: [pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty

2022-08-02 Thread Sean Owen
I don't think "CREATE OR REPLACE TABLE" exists (in SQL?); this isn't a VIEW.
Delete the path first; that's simplest.

On Tue, Aug 2, 2022 at 12:55 AM Kumba Janga  wrote:

> Thanks Sean! That was a simple fix. I changed it to "Create or Replace
> Table" but now I am getting the following error. I am still researching
> solutions but so far no luck.
>
> ParseException:
> mismatched input '' expecting {'ADD', 'AFTER', 'ALL', 'ALTER', 
> 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 
> 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 
> 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 
> 'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 
> 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 
> 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 
> 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', 
> DATABASES, 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 
> 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 
> 'DIV', 'DROP', 'ELSE', 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 
> 'EXISTS', 'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 
> 'FETCH', 'FIELDS', 'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 
> 'FOREIGN', 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 
> 'GLOBAL', 'GRANT', 'GROUP', 'GROUPING', 'HAVING', 'IF', 'IGNORE', 'IMPORT', 
> 'IN', 'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 
> 'INTERSECT', 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 
> 'LATERAL', 'LAZY', 'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 
> 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 
> 'MATCHED', 'MERGE', 'MSCK', 'NAMESPACE', 'NAMESPACES', 'NATURAL', 'NO', NOT, 
> 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 'OPTION', 'OPTIONS', 'OR', 'ORDER', 
> 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', 'OVERLAY', 'OVERWRITE', 
> 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', 'PIVOT', 'PLACING', 
> 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS', 'PROPERTIES', 'PURGE', 
> 'QUERY', 'RANGE', 'RECORDREADER', 'RECORDWRITER', 'RECOVER', 'REDUCE', 
> 'REFERENCES', 'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', 'RESET', 'RESTRICT', 
> 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 'ROLLUP', 'ROW', 
> 'ROWS', 'SCHEMA', 'SELECT', 'SEMI', 'SEPARATED', 'SERDE', 'SERDEPROPERTIES', 
> 'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', 'SKEWED', 'SOME', 'SORT', 
> 'SORTED', 'START', 'STATISTICS', 'STORED', 'STRATIFY', 'STRUCT', 'SUBSTR', 
> 'SUBSTRING', 'TABLE', 'TABLES', 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 
> 'TERMINATED', 'THEN', 'TO', 'TOUCH', 'TRAILING', 'TRANSACTION', 
> 'TRANSACTIONS', 'TRANSFORM', 'TRIM', 'TRUE', 'TRUNCATE', 'TYPE', 'UNARCHIVE', 
> 'UNBOUNDED', 'UNCACHE', 'UNION', 'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 
> 'UPDATE', 'USE', 'USER', 'USING', 'VALUES', 'VIEW', 'VIEWS', 'WHEN', 'WHERE', 
> 'WINDOW', 'WITH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 23)
>
> == SQL ==
> CREATE OR REPLACE TABLE
>
>
> On Mon, Aug 1, 2022 at 8:32 PM Sean Owen  wrote:
>
>> Pretty much what it says? you are creating a table over a path that
>> already has data in it. You can't do that without mode=overwrite at least,
>> if that's what you intend.
>>
>> On Mon, Aug 1, 2022 at 7:29 PM Kumba Janga  wrote:
>>
>>>
>>>
>>>- Component: Spark Delta, Spark SQL
>>>- Level: Beginner
>>>- Scenario: Debug, How-to
>>>
>>> *Python in Jupyter:*
>>>
>>> import pyspark
>>> import pyspark.sql.functions
>>>
>>> from pyspark.sql import SparkSession
>>> spark = (
>>> SparkSession
>>> .builder
>>> .appName("programming")
>>> .master("local")
>>> .config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0")
>>> .config("spark.sql.extensions", 
>>> "io.delta.sql.DeltaSparkSessionExtension")
>>> .config("spark.sql.catalog.spark_catalog", 
>>> "org.apache.spark.sql.delta.catalog.DeltaCatalog")
>>> .config('spark.ui.port', '4050')
>>> .getOrCreate()
>>>
>>> )
>>> from delta import *
>>>
>>> string_20210609 = '''worked_date,worker_id,delete_flag,hours_worked
>>> 2021-06-09,1001,Y,7
>>> 2021-06-09,1002,Y,3.75
>>> 2021-06-09,1003,Y,7.5
>>> 2021-06-09,1004,Y,6.25'''
>>>
>>> rdd_20210609 = spark.sparkContext.parallelize(string_20210609.split('\n'))
>>>
>>> # FILES WILL SHOW UP ON THE LEFT UNDER THE FOLDER ICON IF YOU WANT TO 
>>> BROWSE THEM
>>> OUTPUT_DELTA_PATH = './output/delta/'
>>>
>>> spark.sql('CREATE DATABASE IF NOT EXISTS EXERCISE')
>>>
>>> spark.sql('''
>>> CREATE TABLE IF NOT EXISTS EXERCISE.WORKED_HOURS(
>>> worked_date date
>>> , worker_id int
>>> , delete_flag string
>>> , hours_worked double
>>> ) USING DELTA
>>>
>>>
>>> PARTITIONED BY (worked_date)
>>> LOCATION "{0}"
>>> '''.format(OUTPUT_DELTA_PATH)
>>> )
>>>
>>> *Error 

Re: [pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty

2022-08-02 Thread ayan guha
Hi

I strongly suggest to use print prepared sqls and try them in raw form. The
error you posted points to a syntax error.

On Tue, 2 Aug 2022 at 3:56 pm, Kumba Janga  wrote:

> Thanks Sean! That was a simple fix. I changed it to "Create or Replace
> Table" but now I am getting the following error. I am still researching
> solutions but so far no luck.
>
> ParseException:
> mismatched input '' expecting {'ADD', 'AFTER', 'ALL', 'ALTER', 
> 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 
> 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 
> 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 
> 'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 
> 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 
> 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 
> 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', 
> DATABASES, 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 
> 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 
> 'DIV', 'DROP', 'ELSE', 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 
> 'EXISTS', 'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 
> 'FETCH', 'FIELDS', 'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 
> 'FOREIGN', 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 
> 'GLOBAL', 'GRANT', 'GROUP', 'GROUPING', 'HAVING', 'IF', 'IGNORE', 'IMPORT', 
> 'IN', 'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 
> 'INTERSECT', 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 
> 'LATERAL', 'LAZY', 'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 
> 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 
> 'MATCHED', 'MERGE', 'MSCK', 'NAMESPACE', 'NAMESPACES', 'NATURAL', 'NO', NOT, 
> 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 'OPTION', 'OPTIONS', 'OR', 'ORDER', 
> 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', 'OVERLAY', 'OVERWRITE', 
> 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', 'PIVOT', 'PLACING', 
> 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS', 'PROPERTIES', 'PURGE', 
> 'QUERY', 'RANGE', 'RECORDREADER', 'RECORDWRITER', 'RECOVER', 'REDUCE', 
> 'REFERENCES', 'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', 'RESET', 'RESTRICT', 
> 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 'ROLLUP', 'ROW', 
> 'ROWS', 'SCHEMA', 'SELECT', 'SEMI', 'SEPARATED', 'SERDE', 'SERDEPROPERTIES', 
> 'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', 'SKEWED', 'SOME', 'SORT', 
> 'SORTED', 'START', 'STATISTICS', 'STORED', 'STRATIFY', 'STRUCT', 'SUBSTR', 
> 'SUBSTRING', 'TABLE', 'TABLES', 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 
> 'TERMINATED', 'THEN', 'TO', 'TOUCH', 'TRAILING', 'TRANSACTION', 
> 'TRANSACTIONS', 'TRANSFORM', 'TRIM', 'TRUE', 'TRUNCATE', 'TYPE', 'UNARCHIVE', 
> 'UNBOUNDED', 'UNCACHE', 'UNION', 'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 
> 'UPDATE', 'USE', 'USER', 'USING', 'VALUES', 'VIEW', 'VIEWS', 'WHEN', 'WHERE', 
> 'WINDOW', 'WITH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 23)
>
> == SQL ==
> CREATE OR REPLACE TABLE
>
>
> On Mon, Aug 1, 2022 at 8:32 PM Sean Owen  wrote:
>
>> Pretty much what it says? you are creating a table over a path that
>> already has data in it. You can't do that without mode=overwrite at least,
>> if that's what you intend.
>>
>> On Mon, Aug 1, 2022 at 7:29 PM Kumba Janga  wrote:
>>
>>>
>>>
>>>- Component: Spark Delta, Spark SQL
>>>- Level: Beginner
>>>- Scenario: Debug, How-to
>>>
>>> *Python in Jupyter:*
>>>
>>> import pyspark
>>> import pyspark.sql.functions
>>>
>>> from pyspark.sql import SparkSession
>>> spark = (
>>> SparkSession
>>> .builder
>>> .appName("programming")
>>> .master("local")
>>> .config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0")
>>> .config("spark.sql.extensions", 
>>> "io.delta.sql.DeltaSparkSessionExtension")
>>> .config("spark.sql.catalog.spark_catalog", 
>>> "org.apache.spark.sql.delta.catalog.DeltaCatalog")
>>> .config('spark.ui.port', '4050')
>>> .getOrCreate()
>>>
>>> )
>>> from delta import *
>>>
>>> string_20210609 = '''worked_date,worker_id,delete_flag,hours_worked
>>> 2021-06-09,1001,Y,7
>>> 2021-06-09,1002,Y,3.75
>>> 2021-06-09,1003,Y,7.5
>>> 2021-06-09,1004,Y,6.25'''
>>>
>>> rdd_20210609 = spark.sparkContext.parallelize(string_20210609.split('\n'))
>>>
>>> # FILES WILL SHOW UP ON THE LEFT UNDER THE FOLDER ICON IF YOU WANT TO 
>>> BROWSE THEM
>>> OUTPUT_DELTA_PATH = './output/delta/'
>>>
>>> spark.sql('CREATE DATABASE IF NOT EXISTS EXERCISE')
>>>
>>> spark.sql('''
>>> CREATE TABLE IF NOT EXISTS EXERCISE.WORKED_HOURS(
>>> worked_date date
>>> , worker_id int
>>> , delete_flag string
>>> , hours_worked double
>>> ) USING DELTA
>>>
>>>
>>> PARTITIONED BY (worked_date)
>>> LOCATION "{0}"
>>> '''.format(OUTPUT_DELTA_PATH)
>>> )
>>>
>>> 

Re: [pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty

2022-08-01 Thread Kumba Janga
Thanks Sean! That was a simple fix. I changed it to "Create or Replace
Table" but now I am getting the following error. I am still researching
solutions but so far no luck.

ParseException:
mismatched input '' expecting {'ADD', 'AFTER', 'ALL', 'ALTER',
'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC',
'AT', 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY',
'CACHE', 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR',
'CLUSTER', 'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN',
'COLUMNS', 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE',
'CONCATENATE', 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE',
'CURRENT', 'CURRENT_DATE', 'CURRENT_TIME', 'CURRENT_TIMESTAMP',
'CURRENT_USER', 'DATA', 'DATABASE', DATABASES, 'DBPROPERTIES',
'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 'DESCRIBE', 'DFS',
'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 'DIV', 'DROP',
'ELSE', 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS',
'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE',
'FETCH', 'FIELDS', 'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING',
'FOR', 'FOREIGN', 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION',
'FUNCTIONS', 'GLOBAL', 'GRANT', 'GROUP', 'GROUPING', 'HAVING', 'IF',
'IGNORE', 'IMPORT', 'IN', 'INDEX', 'INDEXES', 'INNER', 'INPATH',
'INPUTFORMAT', 'INSERT', 'INTERSECT', 'INTERVAL', 'INTO', 'IS',
'ITEMS', 'JOIN', 'KEYS', 'LAST', 'LATERAL', 'LAZY', 'LEADING', 'LEFT',
'LIKE', 'LIMIT', 'LINES', 'LIST', 'LOAD', 'LOCAL', 'LOCATION', 'LOCK',
'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 'MATCHED', 'MERGE', 'MSCK',
'NAMESPACE', 'NAMESPACES', 'NATURAL', 'NO', NOT, 'NULL', 'NULLS',
'OF', 'ON', 'ONLY', 'OPTION', 'OPTIONS', 'OR', 'ORDER', 'OUT',
'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', 'OVERLAY', 'OVERWRITE',
'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', 'PIVOT',
'PLACING', 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS',
'PROPERTIES', 'PURGE', 'QUERY', 'RANGE', 'RECORDREADER',
'RECORDWRITER', 'RECOVER', 'REDUCE', 'REFERENCES', 'REFRESH',
'RENAME', 'REPAIR', 'REPLACE', 'RESET', 'RESTRICT', 'REVOKE', 'RIGHT',
RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 'ROLLUP', 'ROW', 'ROWS', 'SCHEMA',
'SELECT', 'SEMI', 'SEPARATED', 'SERDE', 'SERDEPROPERTIES',
'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', 'SKEWED', 'SOME',
'SORT', 'SORTED', 'START', 'STATISTICS', 'STORED', 'STRATIFY',
'STRUCT', 'SUBSTR', 'SUBSTRING', 'TABLE', 'TABLES', 'TABLESAMPLE',
'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 'TO', 'TOUCH',
'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRIM',
'TRUE', 'TRUNCATE', 'TYPE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE',
'UNION', 'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 'UPDATE', 'USE',
'USER', 'USING', 'VALUES', 'VIEW', 'VIEWS', 'WHEN', 'WHERE', 'WINDOW',
'WITH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 23)

== SQL ==
CREATE OR REPLACE TABLE


On Mon, Aug 1, 2022 at 8:32 PM Sean Owen  wrote:

> Pretty much what it says? you are creating a table over a path that
> already has data in it. You can't do that without mode=overwrite at least,
> if that's what you intend.
>
> On Mon, Aug 1, 2022 at 7:29 PM Kumba Janga  wrote:
>
>>
>>
>>- Component: Spark Delta, Spark SQL
>>- Level: Beginner
>>- Scenario: Debug, How-to
>>
>> *Python in Jupyter:*
>>
>> import pyspark
>> import pyspark.sql.functions
>>
>> from pyspark.sql import SparkSession
>> spark = (
>> SparkSession
>> .builder
>> .appName("programming")
>> .master("local")
>> .config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0")
>> .config("spark.sql.extensions", 
>> "io.delta.sql.DeltaSparkSessionExtension")
>> .config("spark.sql.catalog.spark_catalog", 
>> "org.apache.spark.sql.delta.catalog.DeltaCatalog")
>> .config('spark.ui.port', '4050')
>> .getOrCreate()
>>
>> )
>> from delta import *
>>
>> string_20210609 = '''worked_date,worker_id,delete_flag,hours_worked
>> 2021-06-09,1001,Y,7
>> 2021-06-09,1002,Y,3.75
>> 2021-06-09,1003,Y,7.5
>> 2021-06-09,1004,Y,6.25'''
>>
>> rdd_20210609 = spark.sparkContext.parallelize(string_20210609.split('\n'))
>>
>> # FILES WILL SHOW UP ON THE LEFT UNDER THE FOLDER ICON IF YOU WANT TO BROWSE 
>> THEM
>> OUTPUT_DELTA_PATH = './output/delta/'
>>
>> spark.sql('CREATE DATABASE IF NOT EXISTS EXERCISE')
>>
>> spark.sql('''
>> CREATE TABLE IF NOT EXISTS EXERCISE.WORKED_HOURS(
>> worked_date date
>> , worker_id int
>> , delete_flag string
>> , hours_worked double
>> ) USING DELTA
>>
>>
>> PARTITIONED BY (worked_date)
>> LOCATION "{0}"
>> '''.format(OUTPUT_DELTA_PATH)
>> )
>>
>> *Error Message:*
>>
>> AnalysisException Traceback (most recent call 
>> last) in   4 spark.sql('CREATE 
>> DATABASE IF NOT EXISTS EXERCISE')  5 > 6 spark.sql('''  7 
>> CREATE TABLE IF NOT EXISTS EXERCISE.WORKED_HOURS(  8 worked_date 
>> date
>> 

Re: [pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty

2022-08-01 Thread Sean Owen
Pretty much what it says? you are creating a table over a path that already
has data in it. You can't do that without mode=overwrite at least, if
that's what you intend.

On Mon, Aug 1, 2022 at 7:29 PM Kumba Janga  wrote:

>
>
>- Component: Spark Delta, Spark SQL
>- Level: Beginner
>- Scenario: Debug, How-to
>
> *Python in Jupyter:*
>
> import pyspark
> import pyspark.sql.functions
>
> from pyspark.sql import SparkSession
> spark = (
> SparkSession
> .builder
> .appName("programming")
> .master("local")
> .config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0")
> .config("spark.sql.extensions", 
> "io.delta.sql.DeltaSparkSessionExtension")
> .config("spark.sql.catalog.spark_catalog", 
> "org.apache.spark.sql.delta.catalog.DeltaCatalog")
> .config('spark.ui.port', '4050')
> .getOrCreate()
>
> )
> from delta import *
>
> string_20210609 = '''worked_date,worker_id,delete_flag,hours_worked
> 2021-06-09,1001,Y,7
> 2021-06-09,1002,Y,3.75
> 2021-06-09,1003,Y,7.5
> 2021-06-09,1004,Y,6.25'''
>
> rdd_20210609 = spark.sparkContext.parallelize(string_20210609.split('\n'))
>
> # FILES WILL SHOW UP ON THE LEFT UNDER THE FOLDER ICON IF YOU WANT TO BROWSE 
> THEM
> OUTPUT_DELTA_PATH = './output/delta/'
>
> spark.sql('CREATE DATABASE IF NOT EXISTS EXERCISE')
>
> spark.sql('''
> CREATE TABLE IF NOT EXISTS EXERCISE.WORKED_HOURS(
> worked_date date
> , worker_id int
> , delete_flag string
> , hours_worked double
> ) USING DELTA
>
>
> PARTITIONED BY (worked_date)
> LOCATION "{0}"
> '''.format(OUTPUT_DELTA_PATH)
> )
>
> *Error Message:*
>
> AnalysisException Traceback (most recent call 
> last) in   4 spark.sql('CREATE 
> DATABASE IF NOT EXISTS EXERCISE')  5 > 6 spark.sql('''  7 
> CREATE TABLE IF NOT EXISTS EXERCISE.WORKED_HOURS(  8 worked_date 
> date
> /Users/kyjan/spark-3.0.3-bin-hadoop2.7\python\pyspark\sql\session.py in 
> sql(self, sqlQuery)647 [Row(f1=1, f2=u'row1'), Row(f1=2, 
> f2=u'row2'), Row(f1=3, f2=u'row3')]648 """--> 649 return 
> DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)650 651
>  @since(2.0)
> \Users\kyjan\spark-3.0.3-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py
>  in __call__(self, *args)   13021303 answer = 
> self.gateway_client.send_command(command)-> 1304 return_value = 
> get_return_value(   1305 answer, self.gateway_client, 
> self.target_id, self.name)   1306
> /Users/kyjan/spark-3.0.3-bin-hadoop2.7\python\pyspark\sql\utils.py in 
> deco(*a, **kw)132 # Hide where the exception came from 
> that shows a non-Pythonic133 # JVM exception message.--> 
> 134 raise_from(converted)135 else:136 
> raise
> /Users/kyjan/spark-3.0.3-bin-hadoop2.7\python\pyspark\sql\utils.py in 
> raise_from(e)
> AnalysisException: Cannot create table ('`EXERCISE`.`WORKED_HOURS`'). The 
> associated location ('output/delta') is not empty.;
>
>
> --
> Best Wishes,
> Kumba Janga
>
> "The only way of finding the limits of the possible is by going beyond
> them into the impossible"
> -Arthur C. Clarke
>


Re: [Spark SQL]: Does Spark SQL support WAITFOR?

2022-05-19 Thread Someshwar Kale
Hi Ram,

Have you seen this stackoverflow query and response-
https://stackoverflow.com/questions/39685744/apache-spark-how-to-cancel-job-in-code-and-kill-running-tasks
if not, please have a look. seems to have a similar problem .

*Regards,*
*Someshwar Kale*


On Fri, May 20, 2022 at 7:34 AM Artemis User  wrote:

> WAITFOR is part of the Transact-SQL and it's Microsoft SQL server
> specific, not supported by Spark SQL.  If you want to impose a delay in a
> Spark program, you may want to use the thread sleep function in Java or
> Scala.  Hope this helps...
>
> On 5/19/22 1:45 PM, K. N. Ramachandran wrote:
>
> Hi Sean,
>
> I'm trying to test a timeout feature in a tool that uses Spark SQL.
> Basically, if a long-running query exceeds a configured threshold, then the
> query should be canceled.
> I couldn't see a simple way to make a "sleep" SQL statement to test the
> timeout. Instead, I just ran a "select count(*) from table" on a large
> table to act as a query with a long duration.
>
> Is there any way to trigger a "sleep" like behavior in Spark SQL?
>
> Regards,
> Ram
>
> On Tue, May 17, 2022 at 4:23 PM Sean Owen  wrote:
>
>> I don't think that is standard SQL? what are you trying to do, and why
>> not do it outside SQL?
>>
>> On Tue, May 17, 2022 at 6:03 PM K. N. Ramachandran 
>> wrote:
>>
>>> Gentle ping. Any info here would be great.
>>>
>>> Regards,
>>> Ram
>>>
>>> On Sun, May 15, 2022 at 5:16 PM K. N. Ramachandran 
>>> wrote:
>>>
 Hello Spark Users Group,

 I've just recently started working on tools that use Apache Spark.
 When I try WAITFOR in the spark-sql command line, I just get:

 Error: Error running query:
 org.apache.spark.sql.catalyst.parser.ParseException:

 mismatched input 'WAITFOR' expecting (.. list of allowed commands..)


 1) Why is WAITFOR not allowed? Is there another way to get a process to
 sleep for a desired period of time? I'm trying to test a timeout issue and
 need to simulate a sleep behavior.


 2) Is there documentation that outlines why WAITFOR is not supported? I
 did not find any good matches searching online.

 Thanks,
 Ram

>>>
>>>
>>> --
>>> K.N.Ramachandran
>>> Ph: 814-441-4279
>>>
>>
>
> --
> K.N.Ramachandran
> Ph: 814-441-4279
>
>
>


Re: [Spark SQL]: Does Spark SQL support WAITFOR?

2022-05-19 Thread Artemis User
WAITFOR is part of the Transact-SQL and it's Microsoft SQL server 
specific, not supported by Spark SQL.  If you want to impose a delay in 
a Spark program, you may want to use the thread sleep function in Java 
or Scala.  Hope this helps...


On 5/19/22 1:45 PM, K. N. Ramachandran wrote:

Hi Sean,

I'm trying to test a timeout feature in a tool that uses Spark SQL. 
Basically, if a long-running query exceeds a configured threshold, 
then the query should be canceled.
I couldn't see a simple way to make a "sleep" SQL statement to test 
the timeout. Instead, I just ran a "select count(*) from table" on a 
large table to act as a query with a long duration.


Is there any way to trigger a "sleep" like behavior in Spark SQL?

Regards,
Ram

On Tue, May 17, 2022 at 4:23 PM Sean Owen  wrote:

I don't think that is standard SQL? what are you trying to do, and
why not do it outside SQL?

On Tue, May 17, 2022 at 6:03 PM K. N. Ramachandran
 wrote:

Gentle ping. Any info here would be great.

Regards,
Ram

On Sun, May 15, 2022 at 5:16 PM K. N. Ramachandran
 wrote:

Hello Spark Users Group,

I've just recently started working on tools that use
Apache Spark.
When I try WAITFOR in the spark-sql command line, I just get:

Error: Error running query:
org.apache.spark.sql.catalyst.parser.ParseException:

mismatched input 'WAITFOR' expecting (.. list of allowed
commands..)


1) Why is WAITFOR not allowed? Is there another way to get
a process to sleep for a desired period of time? I'm
trying to test a timeout issue and need to simulate a
sleep behavior.


2) Is there documentation that outlines why WAITFOR is not
supported? I did not find any good matches searching online.


Thanks,
Ram



-- 
K.N.Ramachandran

Ph: 814-441-4279



--
K.N.Ramachandran
Ph: 814-441-4279


Re: [Spark SQL]: Does Spark SQL support WAITFOR?

2022-05-19 Thread K. N. Ramachandran
Hi Sean,

I'm trying to test a timeout feature in a tool that uses Spark SQL.
Basically, if a long-running query exceeds a configured threshold, then the
query should be canceled.
I couldn't see a simple way to make a "sleep" SQL statement to test the
timeout. Instead, I just ran a "select count(*) from table" on a large
table to act as a query with a long duration.

Is there any way to trigger a "sleep" like behavior in Spark SQL?

Regards,
Ram

On Tue, May 17, 2022 at 4:23 PM Sean Owen  wrote:

> I don't think that is standard SQL? what are you trying to do, and why not
> do it outside SQL?
>
> On Tue, May 17, 2022 at 6:03 PM K. N. Ramachandran 
> wrote:
>
>> Gentle ping. Any info here would be great.
>>
>> Regards,
>> Ram
>>
>> On Sun, May 15, 2022 at 5:16 PM K. N. Ramachandran 
>> wrote:
>>
>>> Hello Spark Users Group,
>>>
>>> I've just recently started working on tools that use Apache Spark.
>>> When I try WAITFOR in the spark-sql command line, I just get:
>>>
>>> Error: Error running query:
>>> org.apache.spark.sql.catalyst.parser.ParseException:
>>>
>>> mismatched input 'WAITFOR' expecting (.. list of allowed commands..)
>>>
>>>
>>> 1) Why is WAITFOR not allowed? Is there another way to get a process to
>>> sleep for a desired period of time? I'm trying to test a timeout issue and
>>> need to simulate a sleep behavior.
>>>
>>>
>>> 2) Is there documentation that outlines why WAITFOR is not supported? I
>>> did not find any good matches searching online.
>>>
>>> Thanks,
>>> Ram
>>>
>>
>>
>> --
>> K.N.Ramachandran
>> Ph: 814-441-4279
>>
>

-- 
K.N.Ramachandran
Ph: 814-441-4279


Re: [Spark SQL]: Does Spark SQL support WAITFOR?

2022-05-17 Thread Sean Owen
I don't think that is standard SQL? what are you trying to do, and why not
do it outside SQL?

On Tue, May 17, 2022 at 6:03 PM K. N. Ramachandran 
wrote:

> Gentle ping. Any info here would be great.
>
> Regards,
> Ram
>
> On Sun, May 15, 2022 at 5:16 PM K. N. Ramachandran 
> wrote:
>
>> Hello Spark Users Group,
>>
>> I've just recently started working on tools that use Apache Spark.
>> When I try WAITFOR in the spark-sql command line, I just get:
>>
>> Error: Error running query:
>> org.apache.spark.sql.catalyst.parser.ParseException:
>>
>> mismatched input 'WAITFOR' expecting (.. list of allowed commands..)
>>
>>
>> 1) Why is WAITFOR not allowed? Is there another way to get a process to
>> sleep for a desired period of time? I'm trying to test a timeout issue and
>> need to simulate a sleep behavior.
>>
>>
>> 2) Is there documentation that outlines why WAITFOR is not supported? I
>> did not find any good matches searching online.
>>
>> Thanks,
>> Ram
>>
>
>
> --
> K.N.Ramachandran
> Ph: 814-441-4279
>


Re: [Spark SQL]: Does Spark SQL support WAITFOR?

2022-05-17 Thread K. N. Ramachandran
Gentle ping. Any info here would be great.

Regards,
Ram

On Sun, May 15, 2022 at 5:16 PM K. N. Ramachandran 
wrote:

> Hello Spark Users Group,
>
> I've just recently started working on tools that use Apache Spark.
> When I try WAITFOR in the spark-sql command line, I just get:
>
> Error: Error running query:
> org.apache.spark.sql.catalyst.parser.ParseException:
>
> mismatched input 'WAITFOR' expecting (.. list of allowed commands..)
>
>
> 1) Why is WAITFOR not allowed? Is there another way to get a process to
> sleep for a desired period of time? I'm trying to test a timeout issue and
> need to simulate a sleep behavior.
>
>
> 2) Is there documentation that outlines why WAITFOR is not supported? I
> did not find any good matches searching online.
>
> Thanks,
> Ram
>


-- 
K.N.Ramachandran
Ph: 814-441-4279


RE: DataSource API v2 & Spark-SQL

2021-07-02 Thread Lavelle, Shawn
Thanks for following up, I will give this a go!

 ~  Shawn

-Original Message-
From: roizaig 
Sent: Thursday, April 29, 2021 7:42 AM
To: user@spark.apache.org
Subject: Re: DataSource API v2 & Spark-SQL

You can create a custom data source following  this blog 
<http://roizaig.blogspot.com/2021/04/create-custom-data-source-with-spark-3x.html>
. It shows how to read a java log file using spark v3 api as an example.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org




[OSI Logo]
Shawn Lavelle

Software Development

OSI Digital Grid Solutions
4101 Arrowhead Drive
Medina, Minnesota 55340-9457
Phone: 763 551 0559
Email: shawn.lave...@osii.com
Website: www.osii.com<https://www.osii.com>
[Emerson Logo]
We are proud to
now be a part of
Emerson.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: DataSource API v2 & Spark-SQL

2021-04-29 Thread roizaig
You can create a custom data source following  this blog

 
. It shows how to read a java log file using spark v3 api as an example. 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark SQL]: Does Spark SQL can have better performance?

2021-04-29 Thread Mich Talebzadeh
Hi,

your query

parquetFile = spark.read.parquet("path/to/hdfs")
parquetFile.createOrReplaceTempView("parquetFile")
spark.sql("SELECT * FROM parquetFile WHERE field1 = 'value' ORDER BY
timestamp LIMIT 1")

will be lazily evaluated and won't do anything until the sql statement is
actioned with .show etc

In local mode, there is only one executor. Assuming you are actioning the
sql statement, it will have to do a full table scan to find field1 =
'value'

scala> spark.sql("""select * from tmp where 'Account Type' = 'abc' limit
1000""").explain()
== Physical Plan ==
LocalTableScan , [Date#16,  Type#17,  Description#18,  Value#19,
Balance#20,  Account Name#21,  Account Number#22]

Try actioning it and see what happens


HTH


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 29 Apr 2021 at 07:30, Amin Borjian 
wrote:

> Hi.
>
> We use spark 3.0.1 in HDFS cluster and we store our files as parquet with
> snappy compression and enabled dictionary. We try to perform a simple query:
>
> parquetFile = spark.read.parquet("path/to/hadf")
> parquetFile.createOrReplaceTempView("parquetFile")
> spark.sql("SELECT * FROM parquetFile WHERE field1 = 'value' ORDER BY
> timestamp LIMIT 1")
>
> Condition in 'where' clause is selected so that no record is selected
> (matched) for this query. (on purpose) However, this query takes about
> 5-6 minutes to complete on our cluster (with NODE_LOCAL) and a simple spark
> configuration. (Input data is about 8TB in the following tests but can be
> much more)
>
> We decided to test the consumption of disk, network, memory and CPU
> resources in order to detect bottlenecks in this query. However, we came
> to much more strange results, which we will discuss in the following.
>
> We provided dashboards of each network, disk, memory, and CPU usage by
> monitoring tools so that we could check the conditions when running the
> query.
>
> 1) First, we increased the amount of CPU allocated to Spark from the
> initial value to 2 and then about 2.5 times. Although the last increase
> in the total amount of dedicated CPU, all of it was not used, we did not
> see any change in the duration of the query. (As a general point, in all
> tests, the execution times were increased or decreased between 0 and 20
> seconds, but in 5 minutes, these cases were insignificant)
>
> 2) Then we similarly increased the amount of memory allocated to Spark to
> 2 to 2.5 times its original value. In this case, in the last increase,
> the entire memory allocated to the spark was not used by query. But
> again, we did not see any change in the duration of the query.
>
> 3) In all these tests, we monitored the amount of network consumption and
> sending and receiving it in the whole cluster. We can run a query whose
> network consumption is 2 or almost 3 times the consumption of the query
> mentioned in this email, and this shows that we have not reached the
> maximum of the cluster network capacity in this query. Of course, it was
> strange to us why we need the network in a query that has no record and is
> completely node local, but we assumed that it probably needs it for a
> number of reasons, and with this assumption we were still very far from the
> network capacity.
>
> 4) In all these tests, we also checked the amount of writing and reading
> from the disk. In the same way as in the third case, we can write a query
> that is about 4 times the write and read rate of the query mentioned in the
> email, and our disks are much stronger. But the observation in this query
> shows that the write rate is almost zero (We were expecting it) and the
> read rate is running at a medium speed, which is very far from the amount
> of disk rate capacity and therefore cannot be a bottleneck.
>
> After all these tests and the query running time of 5 minutes, we did not
> know exactly what could be more restrictive, and it was strange to us that
> the simple query stated in the email needed such a run time (because with
> such execution time, heavier queries take longer).
>
> Does it make sense that the duration of the query is so long? Is there
> something we need to pay attention to or can we improve by changing it?
>
> Thanks,
> Amin Borjian
>


Re: Integration testing Framework Spark SQL Scala

2020-11-02 Thread Lars Albertsson
Hi,

Sorry for the very slow reply - I am far behind in my mailing list
subscriptions.

You'll find a few slides covering the topic in this presentation:
https://www.slideshare.net/lallea/test-strategies-for-data-processing-pipelines-67244458

Video here: https://vimeo.com/192429554

Regards,

Lars Albertsson
Data engineering entrepreneur
www.scling.com, www.mapflat.com
https://twitter.com/lalleal
+46 70 7687109

On Tue, Feb 25, 2020 at 7:46 PM Ruijing Li  wrote:
>
> Just wanted to follow up on this. If anyone has any advice, I’d be interested 
> in learning more!
>
> On Thu, Feb 20, 2020 at 6:09 PM Ruijing Li  wrote:
>>
>> Hi all,
>>
>> I’m interested in hearing the community’s thoughts on best practices to do 
>> integration testing for spark sql jobs. We run a lot of our jobs with cloud 
>> infrastructure and hdfs - this makes debugging a challenge for us, 
>> especially with problems that don’t occur from just initializing a 
>> sparksession locally or testing with spark-shell. Ideally, we’d like some 
>> sort of docker container emulating hdfs and spark cluster mode, that you can 
>> run locally.
>>
>> Any test framework, tips, or examples people can share? Thanks!
>> --
>> Cheers,
>> Ruijing Li
>
> --
> Cheers,
> Ruijing Li

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: DataSource API v2 & Spark-SQL

2020-08-03 Thread Lavelle, Shawn
Thanks for clarifying, Russel.  Is spark native catalog reference on the 
roadmap for dsv2 or should I be trying to use something else?

~ Shawn

From: Russell Spitzer [mailto:russell.spit...@gmail.com]
Sent: Monday, August 3, 2020 8:27 AM
To: Lavelle, Shawn 
Cc: user 
Subject: Re: DataSource API v2 & Spark-SQL

<<<< EXTERNAL email. Do not open links or attachments unless you recognize the 
sender. If suspicious report 
here<https://osiinet/Global/IMS/SitePages/Reporting.aspx>. >>>>

That's a bad error message. Basically you can't make a spark native catalog 
reference for a dsv2 source. You have to use that Datasources catalog or use 
the programmatic API. Both dsv1 and dsv2 programattic apis work (plus or minus 
some options)

On Mon, Aug 3, 2020, 7:28 AM Lavelle, Shawn 
mailto:shawn.lave...@osii.com>> wrote:
Hello Spark community,
   I have a custom datasource in v1 API that I’m trying to port to v2 API, in 
Java.  Currently I have a DataSource registered via catalog.createTable(name, 
, schema, options map).  When trying to do this in data source API v2, 
I get an error saying my class (package) isn’t a valid data source Can you help 
me out?

Spark versions are 3.0.0 w/scala 2.12, artifacts are Spark-core, spark-sql, 
spark-hive, spark-hive-thriftserver, spark-catalyst

Here’s what the dataSource definition:  public class LogTableSource implements  
TableProvider,  SupportsRead,  DataSourceRegister, Serializable

I’m guessing that I am missing one of the required interfaces. Note, I did try 
this with using the LogTableSource below as “DefaultSource” but the behavior is 
the same.  Also, I keep reading about a DataSourceV2 Marker Interface, but it 
seems deprecated?

Also, I tried to add DataSourceV2ScanRelation but that won’t compile:
Output() in DataSourceV2ScanRelation cannot override Output() in QueryPlan 
return type Seq is not compatible with Seq

  I’m fairly stumped – everything I’ve read online says there’s a marker 
interface of some kind and yet I can’t find it in my package list.

  Looking forward to hearing from you,

~ Shawn





[Image removed by sender. OSI]
Shawn Lavelle

Software Development

4101 Arrowhead Drive
Medina, Minnesota 55340-9457
Phone: 763 551 0559
Email: shawn.lave...@osii.com<mailto:shawn.lave...@osii.com>
Website: www.osii.com<https://www.osii.com>


Re: DataSource API v2 & Spark-SQL

2020-08-03 Thread Russell Spitzer
That's a bad error message. Basically you can't make a spark native catalog
reference for a dsv2 source. You have to use that Datasources catalog or
use the programmatic API. Both dsv1 and dsv2 programattic apis work (plus
or minus some options)

On Mon, Aug 3, 2020, 7:28 AM Lavelle, Shawn  wrote:

> Hello Spark community,
>
>I have a custom datasource in v1 API that I’m trying to port to v2 API,
> in Java.  Currently I have a DataSource registered via
> catalog.createTable(name, , schema, options map).  When trying to
> do this in data source API v2, I get an error saying my class (package)
> isn’t a valid data source Can you help me out?
>
>
>
> Spark versions are 3.0.0 w/scala 2.12, artifacts are Spark-core,
> spark-sql, spark-hive, spark-hive-thriftserver, spark-catalyst
>
>
>
> Here’s what the dataSource definition:  *public class LogTableSource
> implements  TableProvider,  SupportsRead,  DataSourceRegister, Serializable*
>
>
>
> I’m guessing that I am missing one of the required interfaces. Note, I did
> try this with using the LogTableSource below as “DefaultSource” but the
> behavior is the same.  Also, I keep reading about a DataSourceV2 Marker
> Interface, but it seems deprecated?
>
>
>
> Also, I tried to add *DataSourceV2ScanRelation* but that won’t compile:
>
> Output() in DataSourceV2ScanRelation cannot override Output() in QueryPlan
> return type Seq is not compatible with Seq
>
>
>
>   I’m fairly stumped – everything I’ve read online says there’s a marker
> interface of some kind and yet I can’t find it in my package list.
>
>
>
>   Looking forward to hearing from you,
>
>
>
> ~ Shawn
>
>
>
>
>
>
>
>
> [image: OSI]
> Shawn Lavelle
>
> Software Development
>
> 4101 Arrowhead Drive
> Medina, Minnesota 55340-9457
> Phone: 763 551 0559
> *Email:* shawn.lave...@osii.com
> *Website:* www.osii.com
>


Re: Integration testing Framework Spark SQL Scala

2020-02-25 Thread Ruijing Li
Just wanted to follow up on this. If anyone has any advice, I’d be
interested in learning more!

On Thu, Feb 20, 2020 at 6:09 PM Ruijing Li  wrote:

> Hi all,
>
> I’m interested in hearing the community’s thoughts on best practices to do
> integration testing for spark sql jobs. We run a lot of our jobs with cloud
> infrastructure and hdfs - this makes debugging a challenge for us,
> especially with problems that don’t occur from just initializing a
> sparksession locally or testing with spark-shell. Ideally, we’d like some
> sort of docker container emulating hdfs and spark cluster mode, that you
> can run locally.
>
> Any test framework, tips, or examples people can share? Thanks!
> --
> Cheers,
> Ruijing Li
>
-- 
Cheers,
Ruijing Li


Re: Using Percentile in Spark SQL

2019-11-11 Thread Jerry Vinokurov
I don't think the Spark configuration is what you want to focus on. It's
hard to say without knowing the specifics of the job or the data volume,
but you should be able to accomplish this with the percent_rank function in
SparkSQL and a smart partitioning of the data. If your data has a lot of
skew, you can end up with a situation in which some executors are waiting
around to do work while others are stuck with processing larger partitions,
so you'll need to take a look at the actual stats of your data and figure
out if there's a more efficient partitioning strategy that you can use.

On Mon, Nov 11, 2019 at 10:34 AM Tzahi File  wrote:

> Currently, I'm using the percentile approx function with Hive.
> I'm looking for a better way to run this function or another way to get
> the same result with spark, but faster and not using gigantic instances..
>
> I'm trying to optimize this job by changing the Spark configuration. If
> you have any ideas how to approach this, it would be great (like instance
> type, number of instances, number of executers etc.)
>
>
> On Mon, Nov 11, 2019 at 5:16 PM Patrick McCarthy 
> wrote:
>
>> Depending on your tolerance for error you could also use
>> percentile_approx().
>>
>> On Mon, Nov 11, 2019 at 10:14 AM Jerry Vinokurov 
>> wrote:
>>
>>> Do you mean that you are trying to compute the percent rank of some
>>> data? You can use the SparkSQL percent_rank function for that, but I don't
>>> think that's going to give you any improvement over calling the percentRank
>>> function on the data frame. Are you currently using a user-defined function
>>> for this task? Because I bet that's what's slowing you down.
>>>
>>> On Mon, Nov 11, 2019 at 9:46 AM Tzahi File 
>>> wrote:
>>>
 Hi,

 Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a
 percentile function. I'm trying to improve this job by moving it to run
 with spark SQL.

 Any suggestions on how to use a percentile function in Spark?


 Thanks,
 --
 Tzahi File
 Data Engineer
 [image: ironSource] 

 email tzahi.f...@ironsrc.com
 mobile +972-546864835
 fax +972-77-5448273
 ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
 ironsrc.com 
 [image: linkedin] [image:
 twitter] [image: facebook]
 [image: googleplus]
 
 This email (including any attachments) is for the sole use of the
 intended recipient and may contain confidential information which may be
 protected by legal privilege. If you are not the intended recipient, or the
 employee or agent responsible for delivering it to the intended recipient,
 you are hereby notified that any use, dissemination, distribution or
 copying of this communication and/or its content is strictly prohibited. If
 you are not the intended recipient, please immediately notify us by reply
 email or by telephone, delete this email and destroy any copies. Thank you.

>>>
>>>
>>> --
>>> http://www.google.com/profiles/grapesmoker
>>>
>>
>>
>> --
>>
>>
>> *Patrick McCarthy  *
>>
>> Senior Data Scientist, Machine Learning Engineering
>>
>> Dstillery
>>
>> 470 Park Ave South, 17th Floor, NYC 10016
>>
>
>
> --
> Tzahi File
> Data Engineer
> [image: ironSource] 
>
> email tzahi.f...@ironsrc.com
> mobile +972-546864835
> fax +972-77-5448273
> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
> ironsrc.com 
> [image: linkedin] [image:
> twitter] [image: facebook]
> [image: googleplus]
> 
> This email (including any attachments) is for the sole use of the intended
> recipient and may contain confidential information which may be protected
> by legal privilege. If you are not the intended recipient, or the employee
> or agent responsible for delivering it to the intended recipient, you are
> hereby notified that any use, dissemination, distribution or copying of
> this communication and/or its content is strictly prohibited. If you are
> not the intended recipient, please immediately notify us by reply email or
> by telephone, delete this email and destroy any copies. Thank you.
>


-- 
http://www.google.com/profiles/grapesmoker


Re: Using Percentile in Spark SQL

2019-11-11 Thread Tzahi File
Currently, I'm using the percentile approx function with Hive.
I'm looking for a better way to run this function or another way to get the
same result with spark, but faster and not using gigantic instances..

I'm trying to optimize this job by changing the Spark configuration. If you
have any ideas how to approach this, it would be great (like instance type,
number of instances, number of executers etc.)


On Mon, Nov 11, 2019 at 5:16 PM Patrick McCarthy 
wrote:

> Depending on your tolerance for error you could also use
> percentile_approx().
>
> On Mon, Nov 11, 2019 at 10:14 AM Jerry Vinokurov 
> wrote:
>
>> Do you mean that you are trying to compute the percent rank of some data?
>> You can use the SparkSQL percent_rank function for that, but I don't think
>> that's going to give you any improvement over calling the percentRank
>> function on the data frame. Are you currently using a user-defined function
>> for this task? Because I bet that's what's slowing you down.
>>
>> On Mon, Nov 11, 2019 at 9:46 AM Tzahi File 
>> wrote:
>>
>>> Hi,
>>>
>>> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a
>>> percentile function. I'm trying to improve this job by moving it to run
>>> with spark SQL.
>>>
>>> Any suggestions on how to use a percentile function in Spark?
>>>
>>>
>>> Thanks,
>>> --
>>> Tzahi File
>>> Data Engineer
>>> [image: ironSource] 
>>>
>>> email tzahi.f...@ironsrc.com
>>> mobile +972-546864835
>>> fax +972-77-5448273
>>> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
>>> ironsrc.com 
>>> [image: linkedin] [image:
>>> twitter] [image: facebook]
>>> [image: googleplus]
>>> 
>>> This email (including any attachments) is for the sole use of the
>>> intended recipient and may contain confidential information which may be
>>> protected by legal privilege. If you are not the intended recipient, or the
>>> employee or agent responsible for delivering it to the intended recipient,
>>> you are hereby notified that any use, dissemination, distribution or
>>> copying of this communication and/or its content is strictly prohibited. If
>>> you are not the intended recipient, please immediately notify us by reply
>>> email or by telephone, delete this email and destroy any copies. Thank you.
>>>
>>
>>
>> --
>> http://www.google.com/profiles/grapesmoker
>>
>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>


-- 
Tzahi File
Data Engineer
[image: ironSource] 

email tzahi.f...@ironsrc.com
mobile +972-546864835
fax +972-77-5448273
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com 
[image: linkedin] [image:
twitter] [image: facebook]
[image: googleplus]

This email (including any attachments) is for the sole use of the intended
recipient and may contain confidential information which may be protected
by legal privilege. If you are not the intended recipient, or the employee
or agent responsible for delivering it to the intended recipient, you are
hereby notified that any use, dissemination, distribution or copying of
this communication and/or its content is strictly prohibited. If you are
not the intended recipient, please immediately notify us by reply email or
by telephone, delete this email and destroy any copies. Thank you.


Re: Using Percentile in Spark SQL

2019-11-11 Thread Muthu Jayakumar
If you would require higher precision, you may have to write a custom udaf.
In my case, I ended up storing the data as a key-value ordered list of
histograms.

Thanks
Muthu

On Mon, Nov 11, 2019, 20:46 Patrick McCarthy
 wrote:

> Depending on your tolerance for error you could also use
> percentile_approx().
>
> On Mon, Nov 11, 2019 at 10:14 AM Jerry Vinokurov 
> wrote:
>
>> Do you mean that you are trying to compute the percent rank of some data?
>> You can use the SparkSQL percent_rank function for that, but I don't think
>> that's going to give you any improvement over calling the percentRank
>> function on the data frame. Are you currently using a user-defined function
>> for this task? Because I bet that's what's slowing you down.
>>
>> On Mon, Nov 11, 2019 at 9:46 AM Tzahi File 
>> wrote:
>>
>>> Hi,
>>>
>>> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a
>>> percentile function. I'm trying to improve this job by moving it to run
>>> with spark SQL.
>>>
>>> Any suggestions on how to use a percentile function in Spark?
>>>
>>>
>>> Thanks,
>>> --
>>> Tzahi File
>>> Data Engineer
>>> [image: ironSource] 
>>>
>>> email tzahi.f...@ironsrc.com
>>> mobile +972-546864835
>>> fax +972-77-5448273
>>> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
>>> ironsrc.com 
>>> [image: linkedin] [image:
>>> twitter] [image: facebook]
>>> [image: googleplus]
>>> 
>>> This email (including any attachments) is for the sole use of the
>>> intended recipient and may contain confidential information which may be
>>> protected by legal privilege. If you are not the intended recipient, or the
>>> employee or agent responsible for delivering it to the intended recipient,
>>> you are hereby notified that any use, dissemination, distribution or
>>> copying of this communication and/or its content is strictly prohibited. If
>>> you are not the intended recipient, please immediately notify us by reply
>>> email or by telephone, delete this email and destroy any copies. Thank you.
>>>
>>
>>
>> --
>> http://www.google.com/profiles/grapesmoker
>>
>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>


Re: Using Percentile in Spark SQL

2019-11-11 Thread Patrick McCarthy
Depending on your tolerance for error you could also use
percentile_approx().

On Mon, Nov 11, 2019 at 10:14 AM Jerry Vinokurov 
wrote:

> Do you mean that you are trying to compute the percent rank of some data?
> You can use the SparkSQL percent_rank function for that, but I don't think
> that's going to give you any improvement over calling the percentRank
> function on the data frame. Are you currently using a user-defined function
> for this task? Because I bet that's what's slowing you down.
>
> On Mon, Nov 11, 2019 at 9:46 AM Tzahi File  wrote:
>
>> Hi,
>>
>> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a
>> percentile function. I'm trying to improve this job by moving it to run
>> with spark SQL.
>>
>> Any suggestions on how to use a percentile function in Spark?
>>
>>
>> Thanks,
>> --
>> Tzahi File
>> Data Engineer
>> [image: ironSource] 
>>
>> email tzahi.f...@ironsrc.com
>> mobile +972-546864835
>> fax +972-77-5448273
>> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
>> ironsrc.com 
>> [image: linkedin] [image:
>> twitter] [image: facebook]
>> [image: googleplus]
>> 
>> This email (including any attachments) is for the sole use of the
>> intended recipient and may contain confidential information which may be
>> protected by legal privilege. If you are not the intended recipient, or the
>> employee or agent responsible for delivering it to the intended recipient,
>> you are hereby notified that any use, dissemination, distribution or
>> copying of this communication and/or its content is strictly prohibited. If
>> you are not the intended recipient, please immediately notify us by reply
>> email or by telephone, delete this email and destroy any copies. Thank you.
>>
>
>
> --
> http://www.google.com/profiles/grapesmoker
>


-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016


Re: Using Percentile in Spark SQL

2019-11-11 Thread Jerry Vinokurov
Do you mean that you are trying to compute the percent rank of some data?
You can use the SparkSQL percent_rank function for that, but I don't think
that's going to give you any improvement over calling the percentRank
function on the data frame. Are you currently using a user-defined function
for this task? Because I bet that's what's slowing you down.

On Mon, Nov 11, 2019 at 9:46 AM Tzahi File  wrote:

> Hi,
>
> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a
> percentile function. I'm trying to improve this job by moving it to run
> with spark SQL.
>
> Any suggestions on how to use a percentile function in Spark?
>
>
> Thanks,
> --
> Tzahi File
> Data Engineer
> [image: ironSource] 
>
> email tzahi.f...@ironsrc.com
> mobile +972-546864835
> fax +972-77-5448273
> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
> ironsrc.com 
> [image: linkedin] [image:
> twitter] [image: facebook]
> [image: googleplus]
> 
> This email (including any attachments) is for the sole use of the intended
> recipient and may contain confidential information which may be protected
> by legal privilege. If you are not the intended recipient, or the employee
> or agent responsible for delivering it to the intended recipient, you are
> hereby notified that any use, dissemination, distribution or copying of
> this communication and/or its content is strictly prohibited. If you are
> not the intended recipient, please immediately notify us by reply email or
> by telephone, delete this email and destroy any copies. Thank you.
>


-- 
http://www.google.com/profiles/grapesmoker


Re: how to get spark-sql lineage

2019-05-16 Thread Arun Mahadevan
You can check out
https://github.com/hortonworks-spark/spark-atlas-connector/

On Wed, 15 May 2019 at 19:44, lk_spark  wrote:

> hi,all:
> When I use spark , if I run some SQL to do ETL how can I get
> lineage info. I found that , CDH spark have some config about lineage :
> spark.lineage.enabled=true
> spark.lineage.log.dir=/var/log/spark2/lineage
> Are they also work for apache spark ?
>
> 2019-05-16
> --
> lk_spark
>


Re: how to get spark-sql lineage

2019-05-16 Thread Gabor Somogyi
Hi,

spark.lineage.enabled is Cloudera specific and doesn't work with vanilla
Spark.

BR,
G


On Thu, May 16, 2019 at 4:44 AM lk_spark  wrote:

> hi,all:
> When I use spark , if I run some SQL to do ETL how can I get
> lineage info. I found that , CDH spark have some config about lineage :
> spark.lineage.enabled=true
> spark.lineage.log.dir=/var/log/spark2/lineage
> Are they also work for apache spark ?
>
> 2019-05-16
> --
> lk_spark
>


Re: [Spark SQL]: Does Spark SQL 2.3+ suppor UDT?

2018-11-26 Thread Suny Tyagi
Thanks and Regards,
Suny Tyagi
Phone No : 9885027192


On Mon, Nov 26, 2018 at 10:31 PM Suny Tyagi  wrote:

> Hi Team,
>
>
> I was going through this ticket
> https://issues.apache.org/jira/browse/SPARK-7768?jql=text%20~%20%22public%20udt%22
>  and
> could not understand that if spark support UDT in  2.3+ version in any
> language (scala, python , java, R) ?
>
> I have class something like this
>
> Class Test{
>
> string name;
>
> int age;
>
> }
>
>
> And My UDF method is:
>
> public Test UDFMethod(string name, int age){
>
>Test ob = new Test();
>
>ob.name = name;
>
>ob.age = age;
>
> }
>
> Sample Spark query-   `Select *, UDFMethod(name, age) From SomeTable;`
>
> Now UDFMethod(name, age) will return Class Test object. So will this work
> in Spark SQL after using SQLUserDefinedType tag and extending
> UserDefinedType class.
>
>  As UserDefinedType is private in Spark 2.0. I am just want to know if UDT
> is support in Spark 2.3+. If yes what is the best to use UserDefinedType or
> UDTRegisteration. As of now both are private.
>
>
> Thanks,
>
> Suny Tyagi
>
>
>
>
> Thanks and Regards,
> Suny Tyagi
> Phone No : 9885027192
>


Re: [Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

2018-09-28 Thread Thakrar, Jayesh
Not sure I get what you mean….

I ran the query that you had – and don’t get the same hash as you.


From: Gokula Krishnan D 
Date: Friday, September 28, 2018 at 10:40 AM
To: "Thakrar, Jayesh" 
Cc: user 
Subject: Re: [Spark SQL] why spark sql hash() are returns the same hash value 
though the keys/expr are not same

Hello Jayesh,

I have masked the input values with .


Thanks & Regards,
Gokula Krishnan (Gokul)


On Wed, Sep 26, 2018 at 2:20 PM Thakrar, Jayesh 
mailto:jthak...@conversantmedia.com>> wrote:
Cannot reproduce your situation.
Can you share Spark version?

Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("select hash('40514X'),hash('41751')").show()
++---+
|hash(40514X)|hash(41751)|
++---+
| -1898845883|  916273350|
++---+


scala> spark.sql("select hash('14589'),hash('40004')").show()
+---+---+
|hash(14589)|hash(40004)|
+---+---+
|  777096871|-1593820563|
+---+---+


scala>

From: Gokula Krishnan D mailto:email2...@gmail.com>>
Date: Tuesday, September 25, 2018 at 8:57 PM
To: user mailto:user@spark.apache.org>>
Subject: [Spark SQL] why spark sql hash() are returns the same hash value 
though the keys/expr are not same

Hello All,

I am calculating the hash value  of few columns and determining whether its an 
Insert/Delete/Update Record but found a scenario which is little weird since 
some of the records returns same hash value though the key's are totally 
different.

For the instance,


scala> spark.sql("select hash('40514X'),hash('41751')").show()

+---+---+

|hash(40514)|hash(41751)|

+---+---+

|  976573657|  976573657|

+---+---+


scala> spark.sql("select hash('14589'),hash('40004')").show()

+---+---+

|hash(14589)|hash(40004)|

+---+---+

|  777096871|  777096871|

+---+---+
I do understand that hash() returns an integer, are these reached the max 
value?.

Thanks & Regards,
Gokula Krishnan (Gokul)


Re: [Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

2018-09-28 Thread Gokula Krishnan D
Hello Jayesh,

I have masked the input values with .


Thanks & Regards,
Gokula Krishnan* (Gokul)*


On Wed, Sep 26, 2018 at 2:20 PM Thakrar, Jayesh <
jthak...@conversantmedia.com> wrote:

> Cannot reproduce your situation.
>
> Can you share Spark version?
>
>
>
> Welcome to
>
>     __
>
>  / __/__  ___ _/ /__
>
> _\ \/ _ \/ _ `/ __/  '_/
>
>/___/ .__/\_,_/_/ /_/\_\   version 2.2.0
>
>   /_/
>
>
>
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.8.0_92)
>
> Type in expressions to have them evaluated.
>
> Type :help for more information.
>
>
>
> scala> spark.sql("select hash('40514X'),hash('41751')").show()
>
> ++---+
>
> |hash(40514X)|hash(41751)|
>
> ++---+
>
> | -1898845883|  916273350|
>
> ++---+
>
>
>
>
>
> scala> spark.sql("select hash('14589'),hash('40004')").show()
>
> +---+---+
>
> |hash(14589)|hash(40004)|
>
> +---+---+
>
> |  777096871|-1593820563|
>
> +---+---+
>
>
>
>
>
> scala>
>
>
>
> *From: *Gokula Krishnan D 
> *Date: *Tuesday, September 25, 2018 at 8:57 PM
> *To: *user 
> *Subject: *[Spark SQL] why spark sql hash() are returns the same hash
> value though the keys/expr are not same
>
>
>
> Hello All,
>
>
>
> I am calculating the hash value  of few columns and determining whether
> its an Insert/Delete/Update Record but found a scenario which is little
> weird since some of the records returns same hash value though the key's
> are totally different.
>
>
>
> For the instance,
>
>
>
> scala> spark.sql("select hash('40514X'),hash('41751')").show()
>
> +---+---+
>
> |hash(40514)|hash(41751)|
>
> +---+---+
>
> |  976573657|  976573657|
>
> +---+---+
>
>
>
> scala> spark.sql("select hash('14589'),hash('40004')").show()
>
> +---+---+
>
> |hash(14589)|hash(40004)|
>
> +---+---+
>
> |  777096871|  777096871|
>
> +---+---+
>
> I do understand that hash() returns an integer, are these reached the max
> value?.
>
>
>
> Thanks & Regards,
>
> Gokula Krishnan* (Gokul)*
>


Re: [Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

2018-09-26 Thread Thakrar, Jayesh
Cannot reproduce your situation.
Can you share Spark version?

Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("select hash('40514X'),hash('41751')").show()
++---+
|hash(40514X)|hash(41751)|
++---+
| -1898845883|  916273350|
++---+


scala> spark.sql("select hash('14589'),hash('40004')").show()
+---+---+
|hash(14589)|hash(40004)|
+---+---+
|  777096871|-1593820563|
+---+---+


scala>

From: Gokula Krishnan D 
Date: Tuesday, September 25, 2018 at 8:57 PM
To: user 
Subject: [Spark SQL] why spark sql hash() are returns the same hash value 
though the keys/expr are not same

Hello All,

I am calculating the hash value  of few columns and determining whether its an 
Insert/Delete/Update Record but found a scenario which is little weird since 
some of the records returns same hash value though the key's are totally 
different.

For the instance,


scala> spark.sql("select hash('40514X'),hash('41751')").show()

+---+---+

|hash(40514)|hash(41751)|

+---+---+

|  976573657|  976573657|

+---+---+


scala> spark.sql("select hash('14589'),hash('40004')").show()

+---+---+

|hash(14589)|hash(40004)|

+---+---+

|  777096871|  777096871|

+---+---+
I do understand that hash() returns an integer, are these reached the max 
value?.

Thanks & Regards,
Gokula Krishnan (Gokul)


Re: [SPARK-SQL] Does spark-sql have Authorization built in?

2017-09-18 Thread Arun Khetarpal
Ping. 

I did some digging around in the code base - I see that this is not present 
currently. Just looking for an acknowledgement

Regards,
Arun


> On 15-Sep-2017, at 8:43 PM, Arun Khetarpal  wrote:
> 
> Hi - 
> 
> Wanted to understand if spark sql has GRANT and REVOKE statements available? 
> Is anyone working on making that available? 
> 
> Regards,
> Arun


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [SPARK-SQL] Does spark-sql have Authorization built in?

2017-09-16 Thread Jörn Franke
It depends on the permissions the user has on the local file system or HDFS, so 
there is no need to have grant/revoke.

> On 15. Sep 2017, at 17:13, Arun Khetarpal  wrote:
> 
> Hi - 
> 
> Wanted to understand if spark sql has GRANT and REVOKE statements available? 
> Is anyone working on making that available? 
> 
> Regards,
> Arun
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [SPARK-SQL] Does spark-sql have Authorization built in?

2017-09-16 Thread Akhil Das
I guess no. I came across a test case where they are marked as Unsupported,
you can see it here.

However,
the one running inside Databricks has support for this.
https://docs.databricks.com/spark/latest/spark-sql/structured-data-access-controls.html

On Fri, Sep 15, 2017 at 10:13 PM, Arun Khetarpal 
wrote:

> Hi -
>
> Wanted to understand if spark sql has GRANT and REVOKE statements
> available?
> Is anyone working on making that available?
>
> Regards,
> Arun
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Cheers!


Re: Apache Drill vs Spark SQL

2017-04-07 Thread Pierce Lamb
Hi Kant,

If you are interested in using Spark alongside a database to serve real
time queries, there are many options. Almost every popular database has
built some sort of connector to Spark. I've listed a majority of them and
tried to delineate them in some way in this StackOverflow answer:

http://stackoverflow.com/a/39753976/3723346

As an employee of SnappyData ,
I'm biased toward it's solution in which Spark and the database are deeply
integrated and run on the same JVM. But there are many options depending on
your needs.

I'm not sure if the above link also answers your second question, but there
are two graph databases listed that connect to Spark as well.

Hope this helps,

Pierce

On Thu, Apr 6, 2017 at 10:34 PM, kant kodali  wrote:

> Hi All,
>
> I am very impressed with the work done on Spark SQL however when I have to
> pick something to serve real time queries I am in a dilemma for the
> following reasons.
>
> 1. Even though Spark Sql has logical plans, physical plans and run time
> code generation and all that it still doesn't look like the tool to serve
> real time queries like we normally do from a database. I tend to think this
> is because the queries had to go through job submission first. I don't want
> to call this overhead or anything but this is what it seems to do.
> comparing this, on the other hand we have the data that we want to serve
> sitting on a database where we simply issue an SQL query and get the
> response back so for this use case what would be an appropriate tool? I
> tend to think its Drill but would like to hear if there are any interesting
> arguments.
>
> 2. I can see a case for Spark SQL such as queries that need to be
> expressed in a iterative fashion. For example doing a graph traversal such
> BFS, DFS or say even a simple pre order, in order , post order Traversals
> on a BST. All this will be very hard to express on a Declarative syntax
> like SQL. I also tend to think Ad-hoc distributed joins (By Ad-hoc I mean
> one is not certain about their query patterns) are also better off
> expressing it in map-reduce style than say SQL unless one know their query
> patterns well ahead such that the possibility of queries that require
> redistribution is so low. I am also sure there are plenty of other cases
> where Spark SQL will excel but I wanted to see what is good choice to
> simple serve the data?
>
> Any suggestions are appreciated.
>
> Thanks!
>
>


Re: Un-exploding / denormalizing Spark SQL help

2017-02-08 Thread Everett Anderson
On Wed, Feb 8, 2017 at 1:14 PM, ayan guha  wrote:

> Will a sql solution will be acceptable?
>

I'm very curious to see how it'd be done in raw SQL if you're up for it!

I think the 2 programmatic solutions so far are viable, though, too.

(By the way, thanks everyone for the great suggestions!)





>
> On Thu, 9 Feb 2017 at 4:01 am, Xiaomeng Wan  wrote:
>
>> You could also try pivot.
>>
>> On 7 February 2017 at 16:13, Everett Anderson 
>> wrote:
>>
>>
>>
>> On Tue, Feb 7, 2017 at 2:21 PM, Michael Armbrust 
>> wrote:
>>
>> I think the fastest way is likely to use a combination of conditionals
>> (when / otherwise), first (ignoring nulls), while grouping by the id.
>> This should get the answer with only a single shuffle.
>>
>> Here is an example
>> 
>> .
>>
>>
>> Very cool! Using the simpler aggregates feels cleaner.
>>
>>
>>
>> On Tue, Feb 7, 2017 at 5:07 PM, Jacek Laskowski  wrote:
>>
>> Hi Everett,
>>
>> That's pretty much what I'd do. Can't think of a way to beat your
>> solution. Why do you "feel vaguely uneasy about it"?
>>
>>
>> Maybe it felt like I was unnecessarily grouping-by twice, but probably
>> mostly that I hadn't used pivot before.
>>
>> Interestingly, the physical plans are not especially different between
>> these two solutions after the rank column is added. They both have two
>> SortAggregates that seem to be figuring out where to put results based on
>> the rank:
>>
>> My original one:
>>
>> == Physical Plan ==
>> *Project [id#279, name#280, 1#372.extra AS extra1#409, 1#372.data AS
>> data1#435, 1#372.priority AS priority1#462, 2#374.extra AS extra2#490,
>> 2#374.data AS data2#519, 2#374.priority AS priority2#549, 3#376.extra AS
>> extra3#580, 3#376.data AS data3#612, 3#376.priority AS priority3#645]
>> +- SortAggregate(key=[id#279,name#280], functions=[first(if
>> ((cast(rank#292 as double) = 1.0)) temp_struct#312 else null,
>> true),first(if ((cast(rank#292 as double) = 2.0)) temp_struct#312 else
>> null, true),first(if ((cast(rank#292 as double) = 3.0)) temp_struct#312
>> else null, true)])
>>+- SortAggregate(key=[id#279,name#280], functions=[partial_first(if
>> ((cast(rank#292 as double) = 1.0)) temp_struct#312 else null,
>> true),partial_first(if ((cast(rank#292 as double) = 2.0)) temp_struct#312
>> else null, true),partial_first(if ((cast(rank#292 as double) = 3.0))
>> temp_struct#312 else null, true)])
>>   +- *Project [id#279, name#280, rank#292, struct(extra#281,
>> data#282, priority#283) AS temp_struct#312]
>>  +- Window [denserank(priority#283) windowspecdefinition(id#279,
>> name#280, priority#283 ASC, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
>> ROW) AS rank#292], [id#279, name#280], [priority#283 ASC]
>> +- *Sort [id#279 ASC, name#280 ASC, priority#283 ASC], false,
>> 0
>>+- Exchange hashpartitioning(id#279, name#280, 200)
>>   +- Scan ExistingRDD[id#279,name#280,
>> extra#281,data#282,priority#283]
>>
>>
>> And modifying Michael's slightly to use a rank:
>>
>> import org.apache.spark.sql.functions._
>>
>> def getColumnWithRank(column: String, rank: Int) = {
>>   first(when(col("rank") === lit(rank), col(column)).otherwise(null),
>> ignoreNulls = true)
>> }
>>
>> val withRankColumn = data.withColumn("rank", 
>> functions.dense_rank().over(Window.partitionBy("id",
>> "name").orderBy("priority")))
>>
>> val modCollapsed = withRankColumn
>>   .groupBy($"id", $"name")
>>   .agg(
>> getColumnWithRank("data", 1) as 'data1,
>> getColumnWithRank("data", 2) as 'data2,
>> getColumnWithRank("data", 3) as 'data3,
>> getColumnWithRank("extra", 1) as 'extra1,
>> getColumnWithRank("extra", 2) as 'extra2,
>> getColumnWithRank("extra", 3) as 'extra3)
>>
>>
>> modCollapsed.explain
>>
>> == Physical Plan ==
>> SortAggregate(key=[id#279,name#280], functions=[first(CASE WHEN
>> (rank#965 = 1) THEN data#282 ELSE null END, true),first(CASE WHEN (rank#965
>> = 2) THEN data#282 ELSE null END, true),first(CASE WHEN (rank#965 = 3) THEN
>> data#282 ELSE null END, true),first(CASE WHEN (rank#965 = 1) THEN extra#281
>> ELSE null END, true),first(CASE WHEN (rank#965 = 2) THEN extra#281 ELSE
>> null END, true),first(CASE WHEN (rank#965 = 3) THEN extra#281 ELSE null
>> END, true)])
>> +- SortAggregate(key=[id#279,name#280], functions=[partial_first(CASE
>> WHEN (rank#965 = 1) THEN data#282 ELSE null END, true),partial_first(CASE
>> WHEN (rank#965 = 2) THEN data#282 ELSE null END, true),partial_first(CASE
>> WHEN (rank#965 = 3) THEN data#282 ELSE null END, true),partial_first(CASE
>> WHEN (rank#965 = 1) THEN extra#281 ELSE null END, true),partial_first(CASE
>> WHEN (rank#965 = 2) THEN extra#281 ELSE null END, true),partial_first(CASE
>> WHEN (rank#965 = 

Re: Un-exploding / denormalizing Spark SQL help

2017-02-08 Thread ayan guha
Will a sql solution will be acceptable?
On Thu, 9 Feb 2017 at 4:01 am, Xiaomeng Wan  wrote:

> You could also try pivot.
>
> On 7 February 2017 at 16:13, Everett Anderson 
> wrote:
>
>
>
> On Tue, Feb 7, 2017 at 2:21 PM, Michael Armbrust 
> wrote:
>
> I think the fastest way is likely to use a combination of conditionals
> (when / otherwise), first (ignoring nulls), while grouping by the id.
> This should get the answer with only a single shuffle.
>
> Here is an example
> 
> .
>
>
> Very cool! Using the simpler aggregates feels cleaner.
>
>
>
> On Tue, Feb 7, 2017 at 5:07 PM, Jacek Laskowski  wrote:
>
> Hi Everett,
>
> That's pretty much what I'd do. Can't think of a way to beat your
> solution. Why do you "feel vaguely uneasy about it"?
>
>
> Maybe it felt like I was unnecessarily grouping-by twice, but probably
> mostly that I hadn't used pivot before.
>
> Interestingly, the physical plans are not especially different between
> these two solutions after the rank column is added. They both have two
> SortAggregates that seem to be figuring out where to put results based on
> the rank:
>
> My original one:
>
> == Physical Plan ==
> *Project [id#279, name#280, 1#372.extra AS extra1#409, 1#372.data AS
> data1#435, 1#372.priority AS priority1#462, 2#374.extra AS extra2#490,
> 2#374.data AS data2#519, 2#374.priority AS priority2#549, 3#376.extra AS
> extra3#580, 3#376.data AS data3#612, 3#376.priority AS priority3#645]
> +- SortAggregate(key=[id#279,name#280], functions=[first(if
> ((cast(rank#292 as double) = 1.0)) temp_struct#312 else null,
> true),first(if ((cast(rank#292 as double) = 2.0)) temp_struct#312 else
> null, true),first(if ((cast(rank#292 as double) = 3.0)) temp_struct#312
> else null, true)])
>+- SortAggregate(key=[id#279,name#280], functions=[partial_first(if
> ((cast(rank#292 as double) = 1.0)) temp_struct#312 else null,
> true),partial_first(if ((cast(rank#292 as double) = 2.0)) temp_struct#312
> else null, true),partial_first(if ((cast(rank#292 as double) = 3.0))
> temp_struct#312 else null, true)])
>   +- *Project [id#279, name#280, rank#292, struct(extra#281, data#282,
> priority#283) AS temp_struct#312]
>  +- Window [denserank(priority#283) windowspecdefinition(id#279,
> name#280, priority#283 ASC, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
> ROW) AS rank#292], [id#279, name#280], [priority#283 ASC]
> +- *Sort [id#279 ASC, name#280 ASC, priority#283 ASC], false, 0
>+- Exchange hashpartitioning(id#279, name#280, 200)
>   +- Scan
> ExistingRDD[id#279,name#280,extra#281,data#282,priority#283]
>
>
> And modifying Michael's slightly to use a rank:
>
> import org.apache.spark.sql.functions._
>
> def getColumnWithRank(column: String, rank: Int) = {
>   first(when(col("rank") === lit(rank), col(column)).otherwise(null),
> ignoreNulls = true)
> }
>
> val withRankColumn = data.withColumn("rank",
> functions.dense_rank().over(Window.partitionBy("id",
> "name").orderBy("priority")))
>
> val modCollapsed = withRankColumn
>   .groupBy($"id", $"name")
>   .agg(
> getColumnWithRank("data", 1) as 'data1,
> getColumnWithRank("data", 2) as 'data2,
> getColumnWithRank("data", 3) as 'data3,
> getColumnWithRank("extra", 1) as 'extra1,
> getColumnWithRank("extra", 2) as 'extra2,
> getColumnWithRank("extra", 3) as 'extra3)
>
>
> modCollapsed.explain
>
> == Physical Plan ==
> SortAggregate(key=[id#279,name#280], functions=[first(CASE WHEN (rank#965
> = 1) THEN data#282 ELSE null END, true),first(CASE WHEN (rank#965 = 2) THEN
> data#282 ELSE null END, true),first(CASE WHEN (rank#965 = 3) THEN data#282
> ELSE null END, true),first(CASE WHEN (rank#965 = 1) THEN extra#281 ELSE
> null END, true),first(CASE WHEN (rank#965 = 2) THEN extra#281 ELSE null
> END, true),first(CASE WHEN (rank#965 = 3) THEN extra#281 ELSE null END,
> true)])
> +- SortAggregate(key=[id#279,name#280], functions=[partial_first(CASE WHEN
> (rank#965 = 1) THEN data#282 ELSE null END, true),partial_first(CASE WHEN
> (rank#965 = 2) THEN data#282 ELSE null END, true),partial_first(CASE WHEN
> (rank#965 = 3) THEN data#282 ELSE null END, true),partial_first(CASE WHEN
> (rank#965 = 1) THEN extra#281 ELSE null END, true),partial_first(CASE WHEN
> (rank#965 = 2) THEN extra#281 ELSE null END, true),partial_first(CASE WHEN
> (rank#965 = 3) THEN extra#281 ELSE null END, true)])
>+- *Project [id#279, name#280, extra#281, data#282, rank#965]
>   +- Window [denserank(priority#283) windowspecdefinition(id#279,
> name#280, priority#283 ASC, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
> ROW) AS rank#965], [id#279, name#280], [priority#283 ASC]
>  +- *Sort [id#279 ASC, name#280 ASC, priority#283 ASC], false, 0
>   

Re: Un-exploding / denormalizing Spark SQL help

2017-02-08 Thread Xiaomeng Wan
You could also try pivot.

On 7 February 2017 at 16:13, Everett Anderson 
wrote:

>
>
> On Tue, Feb 7, 2017 at 2:21 PM, Michael Armbrust 
> wrote:
>
>> I think the fastest way is likely to use a combination of conditionals
>> (when / otherwise), first (ignoring nulls), while grouping by the id.
>> This should get the answer with only a single shuffle.
>>
>> Here is an example
>> 
>> .
>>
>
> Very cool! Using the simpler aggregates feels cleaner.
>
>
>>
>> On Tue, Feb 7, 2017 at 5:07 PM, Jacek Laskowski  wrote:
>>
>>> Hi Everett,
>>>
>>> That's pretty much what I'd do. Can't think of a way to beat your
>>> solution. Why do you "feel vaguely uneasy about it"?
>>>
>>
> Maybe it felt like I was unnecessarily grouping-by twice, but probably
> mostly that I hadn't used pivot before.
>
> Interestingly, the physical plans are not especially different between
> these two solutions after the rank column is added. They both have two
> SortAggregates that seem to be figuring out where to put results based on
> the rank:
>
> My original one:
>
> == Physical Plan ==
> *Project [id#279, name#280, 1#372.extra AS extra1#409, 1#372.data AS
> data1#435, 1#372.priority AS priority1#462, 2#374.extra AS extra2#490,
> 2#374.data AS data2#519, 2#374.priority AS priority2#549, 3#376.extra AS
> extra3#580, 3#376.data AS data3#612, 3#376.priority AS priority3#645]
> +- SortAggregate(key=[id#279,name#280], functions=[first(if
> ((cast(rank#292 as double) = 1.0)) temp_struct#312 else null,
> true),first(if ((cast(rank#292 as double) = 2.0)) temp_struct#312 else
> null, true),first(if ((cast(rank#292 as double) = 3.0)) temp_struct#312
> else null, true)])
>+- SortAggregate(key=[id#279,name#280], functions=[partial_first(if
> ((cast(rank#292 as double) = 1.0)) temp_struct#312 else null,
> true),partial_first(if ((cast(rank#292 as double) = 2.0)) temp_struct#312
> else null, true),partial_first(if ((cast(rank#292 as double) = 3.0))
> temp_struct#312 else null, true)])
>   +- *Project [id#279, name#280, rank#292, struct(extra#281, data#282,
> priority#283) AS temp_struct#312]
>  +- Window [denserank(priority#283) windowspecdefinition(id#279,
> name#280, priority#283 ASC, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
> ROW) AS rank#292], [id#279, name#280], [priority#283 ASC]
> +- *Sort [id#279 ASC, name#280 ASC, priority#283 ASC], false, 0
>+- Exchange hashpartitioning(id#279, name#280, 200)
>   +- Scan ExistingRDD[id#279,name#280,
> extra#281,data#282,priority#283]
>
>
> And modifying Michael's slightly to use a rank:
>
> import org.apache.spark.sql.functions._
>
> def getColumnWithRank(column: String, rank: Int) = {
>   first(when(col("rank") === lit(rank), col(column)).otherwise(null),
> ignoreNulls = true)
> }
>
> val withRankColumn = data.withColumn("rank", 
> functions.dense_rank().over(Window.partitionBy("id",
> "name").orderBy("priority")))
>
> val modCollapsed = withRankColumn
>   .groupBy($"id", $"name")
>   .agg(
> getColumnWithRank("data", 1) as 'data1,
> getColumnWithRank("data", 2) as 'data2,
> getColumnWithRank("data", 3) as 'data3,
> getColumnWithRank("extra", 1) as 'extra1,
> getColumnWithRank("extra", 2) as 'extra2,
> getColumnWithRank("extra", 3) as 'extra3)
>
>
> modCollapsed.explain
>
> == Physical Plan ==
> SortAggregate(key=[id#279,name#280], functions=[first(CASE WHEN (rank#965
> = 1) THEN data#282 ELSE null END, true),first(CASE WHEN (rank#965 = 2) THEN
> data#282 ELSE null END, true),first(CASE WHEN (rank#965 = 3) THEN data#282
> ELSE null END, true),first(CASE WHEN (rank#965 = 1) THEN extra#281 ELSE
> null END, true),first(CASE WHEN (rank#965 = 2) THEN extra#281 ELSE null
> END, true),first(CASE WHEN (rank#965 = 3) THEN extra#281 ELSE null END,
> true)])
> +- SortAggregate(key=[id#279,name#280], functions=[partial_first(CASE
> WHEN (rank#965 = 1) THEN data#282 ELSE null END, true),partial_first(CASE
> WHEN (rank#965 = 2) THEN data#282 ELSE null END, true),partial_first(CASE
> WHEN (rank#965 = 3) THEN data#282 ELSE null END, true),partial_first(CASE
> WHEN (rank#965 = 1) THEN extra#281 ELSE null END, true),partial_first(CASE
> WHEN (rank#965 = 2) THEN extra#281 ELSE null END, true),partial_first(CASE
> WHEN (rank#965 = 3) THEN extra#281 ELSE null END, true)])
>+- *Project [id#279, name#280, extra#281, data#282, rank#965]
>   +- Window [denserank(priority#283) windowspecdefinition(id#279,
> name#280, priority#283 ASC, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
> ROW) AS rank#965], [id#279, name#280], [priority#283 ASC]
>  +- *Sort [id#279 ASC, name#280 ASC, priority#283 ASC], false, 0
> +- Exchange hashpartitioning(id#279, name#280, 200)
>+- Scan 

Re: Un-exploding / denormalizing Spark SQL help

2017-02-07 Thread Everett Anderson
On Tue, Feb 7, 2017 at 2:21 PM, Michael Armbrust 
wrote:

> I think the fastest way is likely to use a combination of conditionals
> (when / otherwise), first (ignoring nulls), while grouping by the id.
> This should get the answer with only a single shuffle.
>
> Here is an example
> 
> .
>

Very cool! Using the simpler aggregates feels cleaner.


>
> On Tue, Feb 7, 2017 at 5:07 PM, Jacek Laskowski  wrote:
>
>> Hi Everett,
>>
>> That's pretty much what I'd do. Can't think of a way to beat your
>> solution. Why do you "feel vaguely uneasy about it"?
>>
>
Maybe it felt like I was unnecessarily grouping-by twice, but probably
mostly that I hadn't used pivot before.

Interestingly, the physical plans are not especially different between
these two solutions after the rank column is added. They both have two
SortAggregates that seem to be figuring out where to put results based on
the rank:

My original one:

== Physical Plan ==
*Project [id#279, name#280, 1#372.extra AS extra1#409, 1#372.data AS
data1#435, 1#372.priority AS priority1#462, 2#374.extra AS extra2#490,
2#374.data AS data2#519, 2#374.priority AS priority2#549, 3#376.extra AS
extra3#580, 3#376.data AS data3#612, 3#376.priority AS priority3#645]
+- SortAggregate(key=[id#279,name#280], functions=[first(if ((cast(rank#292
as double) = 1.0)) temp_struct#312 else null, true),first(if
((cast(rank#292 as double) = 2.0)) temp_struct#312 else null,
true),first(if ((cast(rank#292 as double) = 3.0)) temp_struct#312 else
null, true)])
   +- SortAggregate(key=[id#279,name#280], functions=[partial_first(if
((cast(rank#292 as double) = 1.0)) temp_struct#312 else null,
true),partial_first(if ((cast(rank#292 as double) = 2.0)) temp_struct#312
else null, true),partial_first(if ((cast(rank#292 as double) = 3.0))
temp_struct#312 else null, true)])
  +- *Project [id#279, name#280, rank#292, struct(extra#281, data#282,
priority#283) AS temp_struct#312]
 +- Window [denserank(priority#283) windowspecdefinition(id#279,
name#280, priority#283 ASC, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
ROW) AS rank#292], [id#279, name#280], [priority#283 ASC]
+- *Sort [id#279 ASC, name#280 ASC, priority#283 ASC], false, 0
   +- Exchange hashpartitioning(id#279, name#280, 200)
  +- Scan
ExistingRDD[id#279,name#280,extra#281,data#282,priority#283]


And modifying Michael's slightly to use a rank:

import org.apache.spark.sql.functions._

def getColumnWithRank(column: String, rank: Int) = {
  first(when(col("rank") === lit(rank), col(column)).otherwise(null),
ignoreNulls = true)
}

val withRankColumn = data.withColumn("rank",
functions.dense_rank().over(Window.partitionBy("id",
"name").orderBy("priority")))

val modCollapsed = withRankColumn
  .groupBy($"id", $"name")
  .agg(
getColumnWithRank("data", 1) as 'data1,
getColumnWithRank("data", 2) as 'data2,
getColumnWithRank("data", 3) as 'data3,
getColumnWithRank("extra", 1) as 'extra1,
getColumnWithRank("extra", 2) as 'extra2,
getColumnWithRank("extra", 3) as 'extra3)


modCollapsed.explain

== Physical Plan ==
SortAggregate(key=[id#279,name#280], functions=[first(CASE WHEN (rank#965 =
1) THEN data#282 ELSE null END, true),first(CASE WHEN (rank#965 = 2) THEN
data#282 ELSE null END, true),first(CASE WHEN (rank#965 = 3) THEN data#282
ELSE null END, true),first(CASE WHEN (rank#965 = 1) THEN extra#281 ELSE
null END, true),first(CASE WHEN (rank#965 = 2) THEN extra#281 ELSE null
END, true),first(CASE WHEN (rank#965 = 3) THEN extra#281 ELSE null END,
true)])
+- SortAggregate(key=[id#279,name#280], functions=[partial_first(CASE WHEN
(rank#965 = 1) THEN data#282 ELSE null END, true),partial_first(CASE WHEN
(rank#965 = 2) THEN data#282 ELSE null END, true),partial_first(CASE WHEN
(rank#965 = 3) THEN data#282 ELSE null END, true),partial_first(CASE WHEN
(rank#965 = 1) THEN extra#281 ELSE null END, true),partial_first(CASE WHEN
(rank#965 = 2) THEN extra#281 ELSE null END, true),partial_first(CASE WHEN
(rank#965 = 3) THEN extra#281 ELSE null END, true)])
   +- *Project [id#279, name#280, extra#281, data#282, rank#965]
  +- Window [denserank(priority#283) windowspecdefinition(id#279,
name#280, priority#283 ASC, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
ROW) AS rank#965], [id#279, name#280], [priority#283 ASC]
 +- *Sort [id#279 ASC, name#280 ASC, priority#283 ASC], false, 0
+- Exchange hashpartitioning(id#279, name#280, 200)
   +- Scan
ExistingRDD[id#279,name#280,extra#281,data#282,priority#283]



>
>> I'd also check out the execution plan (with explain) to see how it's
>> gonna work at runtime. I may have seen groupBy + join be better than
>> window (there were more exchanges in play for windows I reckon).
>>
>> Pozdrawiam,
>> Jacek Laskowski

Re: Un-exploding / denormalizing Spark SQL help

2017-02-07 Thread Michael Armbrust
I think the fastest way is likely to use a combination of conditionals
(when / otherwise), first (ignoring nulls), while grouping by the id.  This
should get the answer with only a single shuffle.

Here is an example

.

On Tue, Feb 7, 2017 at 5:07 PM, Jacek Laskowski  wrote:

> Hi Everett,
>
> That's pretty much what I'd do. Can't think of a way to beat your
> solution. Why do you "feel vaguely uneasy about it"?
>
> I'd also check out the execution plan (with explain) to see how it's
> gonna work at runtime. I may have seen groupBy + join be better than
> window (there were more exchanges in play for windows I reckon).
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Feb 7, 2017 at 10:54 PM, Everett Anderson 
> wrote:
> >
> >
> > On Tue, Feb 7, 2017 at 12:50 PM, Jacek Laskowski 
> wrote:
> >>
> >> Hi,
> >>
> >> Could groupBy and withColumn or UDAF work perhaps? I think window could
> >> help here too.
> >
> >
> > This seems to work, but I do feel vaguely uneasy about it. :)
> >
> > // First add a 'rank' column which is priority order just in case
> priorities
> > aren't
> > // from 1 with no gaps.
> > val temp1 = data.withColumn("rank", functions.dense_rank()
> >.over(Window.partitionBy("id", "name").orderBy("priority")))
> >
> > +---++-+--+++
> > | id|name|extra|  data|priority|rank|
> > +---++-+--+++
> > |  1|Fred|8|value1|   1|   1|
> > |  1|Fred|8|value8|   2|   2|
> > |  1|Fred|8|value5|   3|   3|
> > |  2| Amy|9|value3|   1|   1|
> > |  2| Amy|9|value5|   2|   2|
> > +---++-+--+++
> >
> > // Now move all the columns we want to denormalize into a struct column
> to
> > keep them together.
> > val temp2 = temp1.withColumn("temp_struct", struct(temp1("extra"),
> > temp1("data"), temp1("priority")))
> >   .drop("extra", "data", "priority")
> >
> > +---++++
> > | id|name|rank| temp_struct|
> > +---++++
> > |  1|Fred|   1|[8,value1,1]|
> > |  1|Fred|   2|[8,value8,2]|
> > |  1|Fred|   3|[8,value5,3]|
> > |  2| Amy|   1|[9,value3,1]|
> > |  2| Amy|   2|[9,value5,2]|
> > +---++++
> >
> > // groupBy, again, but now pivot the rank column. We need an aggregate
> > function after pivot,
> > // so use first -- there will only ever be one element.
> > val temp3 = temp2.groupBy("id", "name")
> >   .pivot("rank", Seq("1", "2", "3"))
> >   .agg(functions.first("temp_struct"))
> >
> > +---+++++
> > | id|name|   1|   2|   3|
> > +---+++++
> > |  1|Fred|[8,value1,1]|[8,value8,2]|[8,value5,3]|
> > |  2| Amy|[9,value3,1]|[9,value5,2]|null|
> > +---+++++
> >
> > // Now just moving things out of the structs and clean up.
> > val output = temp3.withColumn("extra1", temp3("1").getField("extra"))
> >  .withColumn("data1", temp3("1").getField("data"))
> >  .withColumn("priority1", temp3("1").getField("priority"))
> >  .withColumn("extra2", temp3("2").getField("extra"))
> >  .withColumn("data2", temp3("2").getField("data"))
> >  .withColumn("priority2", temp3("2").getField("priority"))
> >  .withColumn("extra3", temp3("3").getField("extra"))
> >  .withColumn("data3", temp3("3").getField("data"))
> >  .withColumn("priority3", temp3("3").getField("priority"))
> >  .drop("1", "2", "3")
> >
> > +---++--+--+-+--+--+-+--
> +--+-+
> > | id|name|extra1| data1|priority1|extra2| data2|priority2|extra3|
> > data3|priority3|
> > +---++--+--+-+--+--+-+--
> +--+-+
> > |  1|Fred| 8|value1|1| 8|value8|2| 8|value5|
> > 3|
> > |  2| Amy| 9|value3|1| 9|value5|2|  null|  null|
> > null|
> > +---++--+--+-+--+--+-+--
> +--+-+
> >
> >
> >
> >
> >
> >
> >
> >>
> >>
> >> Jacek
> >>
> >> On 7 Feb 2017 8:02 p.m., "Everett Anderson" 
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I'm trying to un-explode or denormalize a table like
> >>>
> >>> +---++-+--++
> >>> |id |name|extra|data  |priority|
> >>> +---++-+--++
> >>> |1  |Fred|8|value1|1   |
> >>> |1  |Fred|8|value8|2   |
> >>> |1  |Fred|8|value5|3   |
> >>> |2  |Amy |9|value3|1   |
> >>> |2  |Amy |9|value5|2   |
> >>> +---++-+--++
> >>>
> >>> into something 

Re: Un-exploding / denormalizing Spark SQL help

2017-02-07 Thread Jacek Laskowski
Hi Everett,

That's pretty much what I'd do. Can't think of a way to beat your
solution. Why do you "feel vaguely uneasy about it"?

I'd also check out the execution plan (with explain) to see how it's
gonna work at runtime. I may have seen groupBy + join be better than
window (there were more exchanges in play for windows I reckon).

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Feb 7, 2017 at 10:54 PM, Everett Anderson  wrote:
>
>
> On Tue, Feb 7, 2017 at 12:50 PM, Jacek Laskowski  wrote:
>>
>> Hi,
>>
>> Could groupBy and withColumn or UDAF work perhaps? I think window could
>> help here too.
>
>
> This seems to work, but I do feel vaguely uneasy about it. :)
>
> // First add a 'rank' column which is priority order just in case priorities
> aren't
> // from 1 with no gaps.
> val temp1 = data.withColumn("rank", functions.dense_rank()
>.over(Window.partitionBy("id", "name").orderBy("priority")))
>
> +---++-+--+++
> | id|name|extra|  data|priority|rank|
> +---++-+--+++
> |  1|Fred|8|value1|   1|   1|
> |  1|Fred|8|value8|   2|   2|
> |  1|Fred|8|value5|   3|   3|
> |  2| Amy|9|value3|   1|   1|
> |  2| Amy|9|value5|   2|   2|
> +---++-+--+++
>
> // Now move all the columns we want to denormalize into a struct column to
> keep them together.
> val temp2 = temp1.withColumn("temp_struct", struct(temp1("extra"),
> temp1("data"), temp1("priority")))
>   .drop("extra", "data", "priority")
>
> +---++++
> | id|name|rank| temp_struct|
> +---++++
> |  1|Fred|   1|[8,value1,1]|
> |  1|Fred|   2|[8,value8,2]|
> |  1|Fred|   3|[8,value5,3]|
> |  2| Amy|   1|[9,value3,1]|
> |  2| Amy|   2|[9,value5,2]|
> +---++++
>
> // groupBy, again, but now pivot the rank column. We need an aggregate
> function after pivot,
> // so use first -- there will only ever be one element.
> val temp3 = temp2.groupBy("id", "name")
>   .pivot("rank", Seq("1", "2", "3"))
>   .agg(functions.first("temp_struct"))
>
> +---+++++
> | id|name|   1|   2|   3|
> +---+++++
> |  1|Fred|[8,value1,1]|[8,value8,2]|[8,value5,3]|
> |  2| Amy|[9,value3,1]|[9,value5,2]|null|
> +---+++++
>
> // Now just moving things out of the structs and clean up.
> val output = temp3.withColumn("extra1", temp3("1").getField("extra"))
>  .withColumn("data1", temp3("1").getField("data"))
>  .withColumn("priority1", temp3("1").getField("priority"))
>  .withColumn("extra2", temp3("2").getField("extra"))
>  .withColumn("data2", temp3("2").getField("data"))
>  .withColumn("priority2", temp3("2").getField("priority"))
>  .withColumn("extra3", temp3("3").getField("extra"))
>  .withColumn("data3", temp3("3").getField("data"))
>  .withColumn("priority3", temp3("3").getField("priority"))
>  .drop("1", "2", "3")
>
> +---++--+--+-+--+--+-+--+--+-+
> | id|name|extra1| data1|priority1|extra2| data2|priority2|extra3|
> data3|priority3|
> +---++--+--+-+--+--+-+--+--+-+
> |  1|Fred| 8|value1|1| 8|value8|2| 8|value5|
> 3|
> |  2| Amy| 9|value3|1| 9|value5|2|  null|  null|
> null|
> +---++--+--+-+--+--+-+--+--+-+
>
>
>
>
>
>
>
>>
>>
>> Jacek
>>
>> On 7 Feb 2017 8:02 p.m., "Everett Anderson" 
>> wrote:
>>>
>>> Hi,
>>>
>>> I'm trying to un-explode or denormalize a table like
>>>
>>> +---++-+--++
>>> |id |name|extra|data  |priority|
>>> +---++-+--++
>>> |1  |Fred|8|value1|1   |
>>> |1  |Fred|8|value8|2   |
>>> |1  |Fred|8|value5|3   |
>>> |2  |Amy |9|value3|1   |
>>> |2  |Amy |9|value5|2   |
>>> +---++-+--++
>>>
>>> into something that looks like
>>>
>>>
>>> +---++--+--+-+--+--+-+--+--+-+
>>> |id |name|extra1|data1 |priority1|extra2|data2 |priority2|extra3|data3
>>> |priority3|
>>>
>>> +---++--+--+-+--+--+-+--+--+-+
>>> |1  |Fred|8 |value1|1|8 |value8|2|8 |value5|3
>>> |
>>> |2  |Amy |9 |value3|1|9 |value5|2|null  |null
>>> |null |
>>>
>>> +---++--+--+-+--+--+-+--+--+-+
>>>
>>> If I were going the other direction, I'd create a new column with an
>>> array of structs, each with 'extra', 'data', and 'priority' fields and then
>>> 

Re: Un-exploding / denormalizing Spark SQL help

2017-02-07 Thread Everett Anderson
On Tue, Feb 7, 2017 at 12:50 PM, Jacek Laskowski  wrote:

> Hi,
>
> Could groupBy and withColumn or UDAF work perhaps? I think window could
> help here too.
>

This seems to work, but I do feel vaguely uneasy about it. :)

// First add a 'rank' column which is priority order just in case
priorities aren't
// from 1 with no gaps.
val temp1 = data.withColumn("rank", functions.dense_rank()
   .over(Window.partitionBy("id", "name").orderBy("priority")))

+---++-+--+++
| id|name|extra|  data|priority|rank|
+---++-+--+++
|  1|Fred|8|value1|   1|   1|
|  1|Fred|8|value8|   2|   2|
|  1|Fred|8|value5|   3|   3|
|  2| Amy|9|value3|   1|   1|
|  2| Amy|9|value5|   2|   2|
+---++-+--+++

// Now move all the columns we want to denormalize into a struct column to
keep them together.
val temp2 = temp1.withColumn("temp_struct", struct(temp1("extra"),
temp1("data"), temp1("priority")))
  .drop("extra", "data", "priority")

+---++++
| id|name|rank| temp_struct|
+---++++
|  1|Fred|   1|[8,value1,1]|
|  1|Fred|   2|[8,value8,2]|
|  1|Fred|   3|[8,value5,3]|
|  2| Amy|   1|[9,value3,1]|
|  2| Amy|   2|[9,value5,2]|
+---++++

// groupBy, again, but now pivot the rank column. We need an aggregate
function after pivot,
// so use first -- there will only ever be one element.
val temp3 = temp2.groupBy("id", "name")
  .pivot("rank", Seq("1", "2", "3"))
  .agg(functions.first("temp_struct"))

+---+++++
| id|name|   1|   2|   3|
+---+++++
|  1|Fred|[8,value1,1]|[8,value8,2]|[8,value5,3]|
|  2| Amy|[9,value3,1]|[9,value5,2]|null|
+---+++++

// Now just moving things out of the structs and clean up.
val output = temp3.withColumn("extra1", temp3("1").getField("extra"))
 .withColumn("data1", temp3("1").getField("data"))
 .withColumn("priority1", temp3("1").getField("priority"))
 .withColumn("extra2", temp3("2").getField("extra"))
 .withColumn("data2", temp3("2").getField("data"))
 .withColumn("priority2", temp3("2").getField("priority"))
 .withColumn("extra3", temp3("3").getField("extra"))
 .withColumn("data3", temp3("3").getField("data"))
 .withColumn("priority3", temp3("3").getField("priority"))
 .drop("1", "2", "3")

+---++--+--+-+--+--+-+--+--+-+
| id|name|extra1| data1|priority1|extra2| data2|priority2|extra3|
data3|priority3|
+---++--+--+-+--+--+-+--+--+-+
|  1|Fred| 8|value1|1| 8|value8|2| 8|value5|
 3|
|  2| Amy| 9|value3|1| 9|value5|2|  null|  null|
  null|
+---++--+--+-+--+--+-+--+--+-+








>
> Jacek
>
> On 7 Feb 2017 8:02 p.m., "Everett Anderson" 
> wrote:
>
>> Hi,
>>
>> I'm trying to un-explode or denormalize a table like
>>
>> +---++-+--++
>> |id |name|extra|data  |priority|
>> +---++-+--++
>> |1  |Fred|8|value1|1   |
>> |1  |Fred|8|value8|2   |
>> |1  |Fred|8|value5|3   |
>> |2  |Amy |9|value3|1   |
>> |2  |Amy |9|value5|2   |
>> +---++-+--++
>>
>> into something that looks like
>>
>> +---++--+--+-+--+--+-+--
>> +--+-+
>> |id |name|extra1|data1 |priority1|extra2|data2 |priority2|extra3|data3
>> |priority3|
>> +---++--+--+-+--+--+-+--
>> +--+-+
>> |1  |Fred|8 |value1|1|8 |value8|2|8 |value5|3
>>|
>> |2  |Amy |9 |value3|1|9 |value5|2|null  |null
>>  |null |
>> +---++--+--+-+--+--+-+--
>> +--+-+
>>
>> If I were going the other direction, I'd create a new column with an
>> array of structs, each with 'extra', 'data', and 'priority' fields and then
>> explode it.
>>
>> Going from the more normalized view, though, I'm having a harder time.
>>
>> I want to group or partition by (id, name) and order by priority, but
>> after that I can't figure out how to get multiple rows rotated into one.
>>
>> Any ideas?
>>
>> Here's the code to create the input table above:
>>
>> import org.apache.spark.sql.Row
>> import org.apache.spark.sql.Dataset
>> import org.apache.spark.sql.types._
>>
>> val rowsRDD = sc.parallelize(Seq(
>> Row(1, "Fred", 8, "value1", 1),
>> Row(1, "Fred", 8, "value8", 2),
>> Row(1, "Fred", 8, "value5", 3),
>> Row(2, "Amy", 9, "value3", 1),
>> Row(2, "Amy", 9, "value5", 2)))
>>
>> val schema = StructType(Seq(
>> StructField("id", IntegerType, nullable = true),
>> 

Re: Un-exploding / denormalizing Spark SQL help

2017-02-07 Thread Jacek Laskowski
Hi,

Could groupBy and withColumn or UDAF work perhaps? I think window could
help here too.

Jacek

On 7 Feb 2017 8:02 p.m., "Everett Anderson" 
wrote:

> Hi,
>
> I'm trying to un-explode or denormalize a table like
>
> +---++-+--++
> |id |name|extra|data  |priority|
> +---++-+--++
> |1  |Fred|8|value1|1   |
> |1  |Fred|8|value8|2   |
> |1  |Fred|8|value5|3   |
> |2  |Amy |9|value3|1   |
> |2  |Amy |9|value5|2   |
> +---++-+--++
>
> into something that looks like
>
> +---++--+--+-+--+--+-+--
> +--+-+
> |id |name|extra1|data1 |priority1|extra2|data2 |priority2|extra3|data3
> |priority3|
> +---++--+--+-+--+--+-+--
> +--+-+
> |1  |Fred|8 |value1|1|8 |value8|2|8 |value5|3
>|
> |2  |Amy |9 |value3|1|9 |value5|2|null  |null
>  |null |
> +---++--+--+-+--+--+-+--
> +--+-+
>
> If I were going the other direction, I'd create a new column with an array
> of structs, each with 'extra', 'data', and 'priority' fields and then
> explode it.
>
> Going from the more normalized view, though, I'm having a harder time.
>
> I want to group or partition by (id, name) and order by priority, but
> after that I can't figure out how to get multiple rows rotated into one.
>
> Any ideas?
>
> Here's the code to create the input table above:
>
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.Dataset
> import org.apache.spark.sql.types._
>
> val rowsRDD = sc.parallelize(Seq(
> Row(1, "Fred", 8, "value1", 1),
> Row(1, "Fred", 8, "value8", 2),
> Row(1, "Fred", 8, "value5", 3),
> Row(2, "Amy", 9, "value3", 1),
> Row(2, "Amy", 9, "value5", 2)))
>
> val schema = StructType(Seq(
> StructField("id", IntegerType, nullable = true),
> StructField("name", StringType, nullable = true),
> StructField("extra", IntegerType, nullable = true),
> StructField("data", StringType, nullable = true),
> StructField("priority", IntegerType, nullable = true)))
>
> val data = sqlContext.createDataFrame(rowsRDD, schema)
>
>
>
>


Re: Tableau BI on Spark SQL

2017-01-30 Thread Todd Nist
Hi Mich,

You could look at http://www.exasol.com/.  It works very well with Tableau
without the need to extract the data.  Also in V6, it has the virtual
schemas which would allow you to access data in Spark, Hive, Oracle, or
other sources.

May be outside of what you are looking for, it works well for us.  We did
the extract route originally, but with the native Exasol connector it is
just a performant as the extract.

HTH.

-Todd


On Mon, Jan 30, 2017 at 10:15 PM, Jörn Franke  wrote:

> With a lot of data (TB) it is not that good, hence the extraction.
> Otherwise you have to wait every time you do drag and drop. With the
> extracts it is better.
>
> On 30 Jan 2017, at 22:59, Mich Talebzadeh 
> wrote:
>
> Thanks Jorn,
>
> So Tableau uses its own in-memory representation as I guessed. Now the
> question is how is performance accessing data in Oracle tables>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 30 January 2017 at 21:51, Jörn Franke  wrote:
>
>> Depending on the size of the data i recommend to schedule regularly an
>> extract in tableau. There tableau converts it to an internal in-memory
>> representation outside of Spark (can also exist on disk if memory is too
>> small) and then use it within Tableau. Accessing directly  the database is
>> not so efficient.
>> Additionally use always the newest version of tableau..
>>
>> On 30 Jan 2017, at 21:57, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> Has anyone tried using Tableau on Spark SQL?
>>
>> Specifically how does Tableau handle in-memory capabilities of Spark.
>>
>> As I understand Tableau uses its own propriety SQL against say Oracle.
>> That is well established. So for each product Tableau will try to use its
>> own version of SQL against that product  like Spark
>> or Hive.
>>
>> However, when I last tried Tableau on Hive, the mapping and performance
>> was not that good in comparision with the same tables and data in Hive..
>>
>> My approach has been to take Oracle 11.g sh schema
>> containing
>> star schema and create and ingest the same tables and data  into Hive
>> tables. Then run Tableau against these tables and do the performance
>> comparison. Given that Oracle is widely used with Tableau this test makes
>> sense?
>>
>> Thanks.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>


Re: Tableau BI on Spark SQL

2017-01-30 Thread Jörn Franke
With a lot of data (TB) it is not that good, hence the extraction. Otherwise 
you have to wait every time you do drag and drop. With the extracts it is 
better.

> On 30 Jan 2017, at 22:59, Mich Talebzadeh  wrote:
> 
> Thanks Jorn,
> 
> So Tableau uses its own in-memory representation as I guessed. Now the 
> question is how is performance accessing data in Oracle tables>
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 30 January 2017 at 21:51, Jörn Franke  wrote:
>> Depending on the size of the data i recommend to schedule regularly an 
>> extract in tableau. There tableau converts it to an internal in-memory 
>> representation outside of Spark (can also exist on disk if memory is too 
>> small) and then use it within Tableau. Accessing directly  the database is 
>> not so efficient. 
>> Additionally use always the newest version of tableau..
>> 
>>> On 30 Jan 2017, at 21:57, Mich Talebzadeh  wrote:
>>> 
>>> Hi,
>>> 
>>> Has anyone tried using Tableau on Spark SQL?
>>> 
>>> Specifically how does Tableau handle in-memory capabilities of Spark.
>>> 
>>> As I understand Tableau uses its own propriety SQL against say Oracle. That 
>>> is well established. So for each product Tableau will try to use its own 
>>> version of SQL against that product  like Spark
>>> or Hive.
>>> 
>>> However, when I last tried Tableau on Hive, the mapping and performance was 
>>> not that good in comparision with the same tables and data in Hive..
>>> 
>>> My approach has been to take Oracle 11.g sh schema containing star schema 
>>> and create and ingest the same tables and data  into Hive tables. Then run 
>>> Tableau against these tables and do the performance comparison. Given that 
>>> Oracle is widely used with Tableau this test makes sense?
>>> 
>>> Thanks.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
> 


Re: Tableau BI on Spark SQL

2017-01-30 Thread Mich Talebzadeh
Thanks Jorn,

So Tableau uses its own in-memory representation as I guessed. Now the
question is how is performance accessing data in Oracle tables>

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 30 January 2017 at 21:51, Jörn Franke  wrote:

> Depending on the size of the data i recommend to schedule regularly an
> extract in tableau. There tableau converts it to an internal in-memory
> representation outside of Spark (can also exist on disk if memory is too
> small) and then use it within Tableau. Accessing directly  the database is
> not so efficient.
> Additionally use always the newest version of tableau..
>
> On 30 Jan 2017, at 21:57, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> Has anyone tried using Tableau on Spark SQL?
>
> Specifically how does Tableau handle in-memory capabilities of Spark.
>
> As I understand Tableau uses its own propriety SQL against say Oracle.
> That is well established. So for each product Tableau will try to use its
> own version of SQL against that product  like Spark
> or Hive.
>
> However, when I last tried Tableau on Hive, the mapping and performance
> was not that good in comparision with the same tables and data in Hive..
>
> My approach has been to take Oracle 11.g sh schema
> containing
> star schema and create and ingest the same tables and data  into Hive
> tables. Then run Tableau against these tables and do the performance
> comparison. Given that Oracle is widely used with Tableau this test makes
> sense?
>
> Thanks.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>


Re: Tableau BI on Spark SQL

2017-01-30 Thread Jörn Franke
Depending on the size of the data i recommend to schedule regularly an extract 
in tableau. There tableau converts it to an internal in-memory representation 
outside of Spark (can also exist on disk if memory is too small) and then use 
it within Tableau. Accessing directly  the database is not so efficient. 
Additionally use always the newest version of tableau..

> On 30 Jan 2017, at 21:57, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> Has anyone tried using Tableau on Spark SQL?
> 
> Specifically how does Tableau handle in-memory capabilities of Spark.
> 
> As I understand Tableau uses its own propriety SQL against say Oracle. That 
> is well established. So for each product Tableau will try to use its own 
> version of SQL against that product  like Spark
> or Hive.
> 
> However, when I last tried Tableau on Hive, the mapping and performance was 
> not that good in comparision with the same tables and data in Hive..
> 
> My approach has been to take Oracle 11.g sh schema containing star schema and 
> create and ingest the same tables and data  into Hive tables. Then run 
> Tableau against these tables and do the performance comparison. Given that 
> Oracle is widely used with Tableau this test makes sense?
> 
> Thanks.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  


Re: help!!!----issue with spark-sql type cast form long to longwritable

2017-01-30 Thread Alex
Hi All,

If I modify the code to below The hive UDF is working in spark-sql but it
is giving different results..Please let me know difference between these
two below codes..

1) public Object get(Object name) {
  int pos = getPos((String)name);
  if(pos<0) return null;
  String f = "string";
  Object obj= list.get(pos);
  Object result = null;
  if(obj==null) return null;
  ObjectInspector ins =
((StructField)colnames.get(pos)).getFieldObjectInspector();
  if(ins!=null) f = ins.getTypeName();
  PrimitiveObjectInspector ins2 = (PrimitiveObjectInspector)ins;
  switch (ins2.getPrimitiveCategory()) {
  case DOUBLE :{
  result = new Double(((DoubleObjectInspector)ins2).get(obj));
  break;
  }


  case LONG:
  result = new Long(((LongObjectInspector)ins2).get(obj));
  break;
  case STRING:
  result =
((StringObjectInspector)ins2).getPrimitiveJavaObject(obj);
  break;
default  :  result = obj;

  }
  return result;
}






2) public Object get(Object name) {
  int pos = getPos((String)name);
 if(pos<0) return null;
 String f = "string";
  Object obj= list.get(pos);
 if(obj==null) return null;
 ObjectInspector ins =
((StructField)colnames.get(pos)).getFieldObjectInspector();
 if(ins!=null) f = ins.getTypeName();
 switch (f) {
   case "double" :  return ((DoubleWritable)obj).get();
case "bigint" :  return ((LongWritable)obj).get();
case "string" :  return ((Text)obj).toString();
   default  :  return obj;
 }
}

On Tue, Jan 24, 2017 at 5:29 PM, Sirisha Cheruvu  wrote:

> Hi Team,
>
> I am trying to keep below code in get method and calling that get mthod in
> another hive UDF
> and running the hive UDF using Hive Context.sql procedure..
>
>
> switch (f) {
> case "double" :  return ((DoubleWritable)obj).get();
> case "bigint" :  return ((LongWritable)obj).get();
> case "string" :  return ((Text)obj).toString();
> default  :  return obj;
>   }
> }
>
> Suprisingly only LongWritable and Text convrsions are throwing error but
> DoubleWritable is working
> So I tried changing below code to
>
> switch (f) {
> case "double" :  return ((DoubleWritable)obj).get();
> case "bigint" :  return ((DoubleWritable)obj).get();
> case "string" :  return ((Text)obj).toString();
> default  :  return obj;
>   }
> }
>
> Still its throws error saying Java.Lang.Long cant be convrted
> to org.apache.hadoop.hive.serde2.io.DoubleWritable
>
>
>
> its working fine on hive but throwing error on spark-sql
>
> I am importing the below packages.
> import java.util.*;
> import org.apache.hadoop.hive.serde2.objectinspector.*;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.hive.serde2.io.DoubleWritable;
>
> .Please let me know why it is making issue in spark when perfectly running
> fine on hive
>


Re: help!!!----issue with spark-sql type cast form long to longwritable

2017-01-30 Thread Alex
How to debug Hive UDfs?!

On Jan 24, 2017 5:29 PM, "Sirisha Cheruvu"  wrote:

> Hi Team,
>
> I am trying to keep below code in get method and calling that get mthod in
> another hive UDF
> and running the hive UDF using Hive Context.sql procedure..
>
>
> switch (f) {
> case "double" :  return ((DoubleWritable)obj).get();
> case "bigint" :  return ((LongWritable)obj).get();
> case "string" :  return ((Text)obj).toString();
> default  :  return obj;
>   }
> }
>
> Suprisingly only LongWritable and Text convrsions are throwing error but
> DoubleWritable is working
> So I tried changing below code to
>
> switch (f) {
> case "double" :  return ((DoubleWritable)obj).get();
> case "bigint" :  return ((DoubleWritable)obj).get();
> case "string" :  return ((Text)obj).toString();
> default  :  return obj;
>   }
> }
>
> Still its throws error saying Java.Lang.Long cant be convrted
> to org.apache.hadoop.hive.serde2.io.DoubleWritable
>
>
>
> its working fine on hive but throwing error on spark-sql
>
> I am importing the below packages.
> import java.util.*;
> import org.apache.hadoop.hive.serde2.objectinspector.*;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.hive.serde2.io.DoubleWritable;
>
> .Please let me know why it is making issue in spark when perfectly running
> fine on hive
>


Re: help!!!----issue with spark-sql type cast form long to longwritable

2017-01-30 Thread Alex
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error:
java.lang.Double cannot be cast to
org.apache.hadoop.hive.serde2.io.DoubleWritable]

Getting below error while running hive UDF on spark but the UDF is working
perfectly fine in Hive..


public Object get(Object name) {
  int pos = getPos((String)name);
 if(pos<0) return null;
 String f = "string";
  Object obj= list.get(pos);
 if(obj==null) return null;
 ObjectInspector ins =
((StructField)colnames.get(pos)).getFieldObjectInspector();
 if(ins!=null) f = ins.getTypeName();
 switch (f) {
   case "double" :  return ((DoubleWritable)obj).get();
case "bigint" :  return ((LongWritable)obj).get();
case "string" :  return ((Text)obj).toString();
   default  :  return obj;
 }
}



On Tue, Jan 24, 2017 at 9:19 PM, Takeshi Yamamuro 
wrote:

> Hi,
>
> Could you show us the whole code to reproduce that?
>
> // maropu
>
> On Wed, Jan 25, 2017 at 12:02 AM, Deepak Sharma 
> wrote:
>
>> Can you try writing the UDF directly in spark and register it with spark
>> sql or hive context ?
>> Or do you want to reuse the existing UDF jar for hive in spark ?
>>
>> Thanks
>> Deepak
>>
>> On Jan 24, 2017 5:29 PM, "Sirisha Cheruvu"  wrote:
>>
>>> Hi Team,
>>>
>>> I am trying to keep below code in get method and calling that get mthod
>>> in another hive UDF
>>> and running the hive UDF using Hive Context.sql procedure..
>>>
>>>
>>> switch (f) {
>>> case "double" :  return ((DoubleWritable)obj).get();
>>> case "bigint" :  return ((LongWritable)obj).get();
>>> case "string" :  return ((Text)obj).toString();
>>> default  :  return obj;
>>>   }
>>> }
>>>
>>> Suprisingly only LongWritable and Text convrsions are throwing error but
>>> DoubleWritable is working
>>> So I tried changing below code to
>>>
>>> switch (f) {
>>> case "double" :  return ((DoubleWritable)obj).get();
>>> case "bigint" :  return ((DoubleWritable)obj).get();
>>> case "string" :  return ((Text)obj).toString();
>>> default  :  return obj;
>>>   }
>>> }
>>>
>>> Still its throws error saying Java.Lang.Long cant be convrted
>>> to org.apache.hadoop.hive.serde2.io.DoubleWritable
>>>
>>>
>>>
>>> its working fine on hive but throwing error on spark-sql
>>>
>>> I am importing the below packages.
>>> import java.util.*;
>>> import org.apache.hadoop.hive.serde2.objectinspector.*;
>>> import org.apache.hadoop.io.LongWritable;
>>> import org.apache.hadoop.io.Text;
>>> import org.apache.hadoop.hive.serde2.io.DoubleWritable;
>>>
>>> .Please let me know why it is making issue in spark when perfectly
>>> running fine on hive
>>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: help!!!----issue with spark-sql type cast form long to longwritable

2017-01-24 Thread Takeshi Yamamuro
Hi,

Could you show us the whole code to reproduce that?

// maropu

On Wed, Jan 25, 2017 at 12:02 AM, Deepak Sharma 
wrote:

> Can you try writing the UDF directly in spark and register it with spark
> sql or hive context ?
> Or do you want to reuse the existing UDF jar for hive in spark ?
>
> Thanks
> Deepak
>
> On Jan 24, 2017 5:29 PM, "Sirisha Cheruvu"  wrote:
>
>> Hi Team,
>>
>> I am trying to keep below code in get method and calling that get mthod
>> in another hive UDF
>> and running the hive UDF using Hive Context.sql procedure..
>>
>>
>> switch (f) {
>> case "double" :  return ((DoubleWritable)obj).get();
>> case "bigint" :  return ((LongWritable)obj).get();
>> case "string" :  return ((Text)obj).toString();
>> default  :  return obj;
>>   }
>> }
>>
>> Suprisingly only LongWritable and Text convrsions are throwing error but
>> DoubleWritable is working
>> So I tried changing below code to
>>
>> switch (f) {
>> case "double" :  return ((DoubleWritable)obj).get();
>> case "bigint" :  return ((DoubleWritable)obj).get();
>> case "string" :  return ((Text)obj).toString();
>> default  :  return obj;
>>   }
>> }
>>
>> Still its throws error saying Java.Lang.Long cant be convrted
>> to org.apache.hadoop.hive.serde2.io.DoubleWritable
>>
>>
>>
>> its working fine on hive but throwing error on spark-sql
>>
>> I am importing the below packages.
>> import java.util.*;
>> import org.apache.hadoop.hive.serde2.objectinspector.*;
>> import org.apache.hadoop.io.LongWritable;
>> import org.apache.hadoop.io.Text;
>> import org.apache.hadoop.hive.serde2.io.DoubleWritable;
>>
>> .Please let me know why it is making issue in spark when perfectly
>> running fine on hive
>>
>


-- 
---
Takeshi Yamamuro


Re: help!!!----issue with spark-sql type cast form long to longwritable

2017-01-24 Thread Deepak Sharma
Can you try writing the UDF directly in spark and register it with spark
sql or hive context ?
Or do you want to reuse the existing UDF jar for hive in spark ?

Thanks
Deepak

On Jan 24, 2017 5:29 PM, "Sirisha Cheruvu"  wrote:

> Hi Team,
>
> I am trying to keep below code in get method and calling that get mthod in
> another hive UDF
> and running the hive UDF using Hive Context.sql procedure..
>
>
> switch (f) {
> case "double" :  return ((DoubleWritable)obj).get();
> case "bigint" :  return ((LongWritable)obj).get();
> case "string" :  return ((Text)obj).toString();
> default  :  return obj;
>   }
> }
>
> Suprisingly only LongWritable and Text convrsions are throwing error but
> DoubleWritable is working
> So I tried changing below code to
>
> switch (f) {
> case "double" :  return ((DoubleWritable)obj).get();
> case "bigint" :  return ((DoubleWritable)obj).get();
> case "string" :  return ((Text)obj).toString();
> default  :  return obj;
>   }
> }
>
> Still its throws error saying Java.Lang.Long cant be convrted
> to org.apache.hadoop.hive.serde2.io.DoubleWritable
>
>
>
> its working fine on hive but throwing error on spark-sql
>
> I am importing the below packages.
> import java.util.*;
> import org.apache.hadoop.hive.serde2.objectinspector.*;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.hive.serde2.io.DoubleWritable;
>
> .Please let me know why it is making issue in spark when perfectly running
> fine on hive
>


Re: time to run Spark SQL query

2016-11-28 Thread ayan guha
They should take same time if everything else is constant
On 28 Nov 2016 23:41, "Hitesh Goyal"  wrote:

> Hi team, I am using spark SQL for accessing the amazon S3 bucket data.
>
> If I run a sql query by using normal SQL syntax like below
>
> 1)  DataFrame d=sqlContext.sql(i.e. Select * from tablename where
> column_condition);
>
>
>
> Secondly, if I use dataframe functions for the same query like below :-
>
> 2)  dataframe.select(column_name).where(column_condition);
>
>
>
> Now there is a question arising in my mind that which query would take
> more time to execute if I run both on the same dataset.
>
> Or both would execute in the same time duration. Please suggest your
> answer.
>
> Regards,
>
> *Hitesh Goyal*
>
> Simpli5d Technologies
>
> Cont No.: 9996588220
>
>
>


Re: How to use Spark SQL to connect to Cassandra from Spark-Shell?

2016-11-11 Thread Russell Spitzer
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md

Has all the information about Dataframes/ SparkSql

On Fri, Nov 11, 2016 at 8:52 AM kant kodali  wrote:

> Wait I cannot create CassandraSQLContext from spark-shell. is this only
> for enterprise versions?
>
> Thanks!
>
> On Fri, Nov 11, 2016 at 8:14 AM, kant kodali  wrote:
>
>
> https://academy.datastax.com/courses/ds320-analytics-apache-spark/spark-sql-spark-sql-basics
>
> On Fri, Nov 11, 2016 at 8:11 AM, kant kodali  wrote:
>
> Hi,
>
> This is spark-cassandra-connector
>  but I am looking
> more for how to use SPARK SQL and expose as a JDBC server for Cassandra.
>
> Thanks!
>
>
> On Fri, Nov 11, 2016 at 8:07 AM, Yong Zhang  wrote:
>
> Read the document on https://github.com/datastax/spark-cassandra-connector
>
>
> Yong
>
>
>
> --
> *From:* kant kodali 
> *Sent:* Friday, November 11, 2016 11:04 AM
> *To:* user @spark
> *Subject:* How to use Spark SQL to connect to Cassandra from Spark-Shell?
>
> How to use Spark SQL to connect to Cassandra from Spark-Shell?
>
> Any examples ? I use Java 8.
>
> Thanks!
> kant
>
>
>
>
>


Re: How to use Spark SQL to connect to Cassandra from Spark-Shell?

2016-11-11 Thread kant kodali
Wait I cannot create CassandraSQLContext from spark-shell. is this only for
enterprise versions?

Thanks!

On Fri, Nov 11, 2016 at 8:14 AM, kant kodali  wrote:

> https://academy.datastax.com/courses/ds320-analytics-
> apache-spark/spark-sql-spark-sql-basics
>
> On Fri, Nov 11, 2016 at 8:11 AM, kant kodali  wrote:
>
>> Hi,
>>
>> This is spark-cassandra-connector
>>  but I am looking
>> more for how to use SPARK SQL and expose as a JDBC server for Cassandra.
>>
>> Thanks!
>>
>>
>> On Fri, Nov 11, 2016 at 8:07 AM, Yong Zhang  wrote:
>>
>>> Read the document on https://github.com/datastax
>>> /spark-cassandra-connector
>>>
>>>
>>> Yong
>>>
>>>
>>>
>>> --
>>> *From:* kant kodali 
>>> *Sent:* Friday, November 11, 2016 11:04 AM
>>> *To:* user @spark
>>> *Subject:* How to use Spark SQL to connect to Cassandra from
>>> Spark-Shell?
>>>
>>> How to use Spark SQL to connect to Cassandra from Spark-Shell?
>>>
>>> Any examples ? I use Java 8.
>>>
>>> Thanks!
>>> kant
>>>
>>
>>
>


Re: How to use Spark SQL to connect to Cassandra from Spark-Shell?

2016-11-11 Thread kant kodali
https://academy.datastax.com/courses/ds320-analytics-apache-spark/spark-sql-spark-sql-basics

On Fri, Nov 11, 2016 at 8:11 AM, kant kodali  wrote:

> Hi,
>
> This is spark-cassandra-connector
>  but I am looking
> more for how to use SPARK SQL and expose as a JDBC server for Cassandra.
>
> Thanks!
>
>
> On Fri, Nov 11, 2016 at 8:07 AM, Yong Zhang  wrote:
>
>> Read the document on https://github.com/datastax
>> /spark-cassandra-connector
>>
>>
>> Yong
>>
>>
>>
>> --
>> *From:* kant kodali 
>> *Sent:* Friday, November 11, 2016 11:04 AM
>> *To:* user @spark
>> *Subject:* How to use Spark SQL to connect to Cassandra from Spark-Shell?
>>
>> How to use Spark SQL to connect to Cassandra from Spark-Shell?
>>
>> Any examples ? I use Java 8.
>>
>> Thanks!
>> kant
>>
>
>


Re: How to use Spark SQL to connect to Cassandra from Spark-Shell?

2016-11-11 Thread kant kodali
Hi,

This is spark-cassandra-connector
 but I am looking
more for how to use SPARK SQL and expose as a JDBC server for Cassandra.

Thanks!


On Fri, Nov 11, 2016 at 8:07 AM, Yong Zhang  wrote:

> Read the document on https://github.com/datastax/spark-cassandra-connector
>
>
> Yong
>
>
>
> --
> *From:* kant kodali 
> *Sent:* Friday, November 11, 2016 11:04 AM
> *To:* user @spark
> *Subject:* How to use Spark SQL to connect to Cassandra from Spark-Shell?
>
> How to use Spark SQL to connect to Cassandra from Spark-Shell?
>
> Any examples ? I use Java 8.
>
> Thanks!
> kant
>


Re: How to use Spark SQL to connect to Cassandra from Spark-Shell?

2016-11-11 Thread Yong Zhang
Read the document on https://github.com/datastax/spark-cassandra-connector


Yong




From: kant kodali 
Sent: Friday, November 11, 2016 11:04 AM
To: user @spark
Subject: How to use Spark SQL to connect to Cassandra from Spark-Shell?

How to use Spark SQL to connect to Cassandra from Spark-Shell?

Any examples ? I use Java 8.

Thanks!
kant


Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-02 Thread Michael David Pedersen
Awesome, thank you Michael for the detailed example!

I'll look into whether I can use this approach for my use case. If so, I
could avoid the overhead of repeatedly registering a temp table for one-off
queries, instead registering the table once and relying on the injected
strategy. Don't know how much of an impact this overhead has in praxis
though.

Cheers,
Michael


Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Michael Armbrust
registerTempTable is backed by an in-memory hash table that maps table name
(a string) to a logical query plan.  Fragments of that logical query plan
may or may not be cached (but calling register alone will not result in any
materialization of results).  In Spark 2.0 we renamed this function to
createOrReplaceTempView, since a traditional RDBMs view is a better analogy
here.

If I was trying to augment the engine to make better use of HBase's
internal ordering, I'd probably use the experimental ability to inject
extra strategies into the query planner.  Essentially, you could look for
filters on top of BaseRelations (the internal class used to map DataSources
into the query plan) where there is a range filter on some prefix of the
table's key.  When this is detected, you could return an RDD that contains
the already filtered result talking directly to HBase, which would override
the default execution pathway.

I wrote up a (toy) example of using this API
,
which might be helpful.

On Tue, Nov 1, 2016 at 4:11 AM, Mich Talebzadeh 
wrote:

> it would be great if we establish this.
>
> I know in Hive these temporary tables "CREATE TEMPRARY TABLE ..." are
> private to the session and are put in a hidden staging directory as below
>
> /user/hive/warehouse/.hive-staging_hive_2016-07-10_22-58-
> 47_319_5605745346163312826-10
>
> and removed when the session ends or table is dropped
>
> Not sure how Spark handles this.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 November 2016 at 10:50, Michael David Pedersen  googlemail.com> wrote:
>
>> Thanks for the link, I hadn't come across this.
>>
>> According to https://forums.databricks.com/questions/400/what-is-the-diff
>>> erence-between-registertemptable-a.html
>>>
>>> and I quote
>>>
>>> "registerTempTable()
>>>
>>> registerTempTable() creates an in-memory table that is scoped to the
>>> cluster in which it was created. The data is stored using Hive's
>>> highly-optimized, in-memory columnar format."
>>>
>> But then the last post in the thread corrects this, saying:
>> "registerTempTable does not create a 'cached' in-memory table, but rather
>> an alias or a reference to the DataFrame. It's akin to a pointer in C/C++
>> or a reference in Java".
>>
>> So - probably need to dig into the sources to get more clarity on this.
>>
>> Cheers,
>> Michael
>>
>
>


Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Mich Talebzadeh
it would be great if we establish this.

I know in Hive these temporary tables "CREATE TEMPRARY TABLE ..." are
private to the session and are put in a hidden staging directory as below

/user/hive/warehouse/.hive-staging_hive_2016-07-10_22-58-47_319_5605745346163312826-10

and removed when the session ends or table is dropped

Not sure how Spark handles this.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 November 2016 at 10:50, Michael David Pedersen <
michael.d.peder...@googlemail.com> wrote:

> Thanks for the link, I hadn't come across this.
>
> According to https://forums.databricks.com/questions/400/what-is-the-diff
>> erence-between-registertemptable-a.html
>>
>> and I quote
>>
>> "registerTempTable()
>>
>> registerTempTable() creates an in-memory table that is scoped to the
>> cluster in which it was created. The data is stored using Hive's
>> highly-optimized, in-memory columnar format."
>>
> But then the last post in the thread corrects this, saying:
> "registerTempTable does not create a 'cached' in-memory table, but rather
> an alias or a reference to the DataFrame. It's akin to a pointer in C/C++
> or a reference in Java".
>
> So - probably need to dig into the sources to get more clarity on this.
>
> Cheers,
> Michael
>


Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Michael David Pedersen
Thanks for the link, I hadn't come across this.

According to https://forums.databricks.com/questions/400/what-is-the-
> difference-between-registertemptable-a.html
>
> and I quote
>
> "registerTempTable()
>
> registerTempTable() creates an in-memory table that is scoped to the
> cluster in which it was created. The data is stored using Hive's
> highly-optimized, in-memory columnar format."
>
But then the last post in the thread corrects this, saying:
"registerTempTable does not create a 'cached' in-memory table, but rather
an alias or a reference to the DataFrame. It's akin to a pointer in C/C++
or a reference in Java".

So - probably need to dig into the sources to get more clarity on this.

Cheers,
Michael


Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Mich Talebzadeh
A bit of gray area here I am afraid, I was trying to experiment with it

According to
https://forums.databricks.com/questions/400/what-is-the-difference-between-registertemptable-a.html

and I quote

"registerTempTable()

registerTempTable() creates an in-memory table that is scoped to the
cluster in which it was created. The data is stored using Hive's
highly-optimized, in-memory columnar format."


So on the face of it tempTable is an in-memory table

HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 November 2016 at 10:01, Michael David Pedersen <
michael.d.peder...@googlemail.com> wrote:

> Hi again Mich,
>
> "But the thing is that I don't explicitly cache the tempTables ..".
>>
>> I believe tempTable is created in-memory and is already cached
>>
>
> That surprises me since there is a sqlContext.cacheTable method to
> explicitly cache a table in memory. Or am I missing something? This could
> explain why I'm seeing somewhat worse performance than I'd expect.
>
> Cheers,
> Michael
>


Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Michael David Pedersen
Hi again Mich,

"But the thing is that I don't explicitly cache the tempTables ..".
>
> I believe tempTable is created in-memory and is already cached
>

That surprises me since there is a sqlContext.cacheTable method to
explicitly cache a table in memory. Or am I missing something? This could
explain why I'm seeing somewhat worse performance than I'd expect.

Cheers,
Michael


Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Mich Talebzadeh
well I suppose one  can drop tempTable as below

scala> df.registerTempTable("tmp")

scala> spark.sql("select count(1) from tmp").show
++
|count(1)|
++
|  904180|
++

scala> spark.sql("drop table if exists tmp")
res22: org.apache.spark.sql.DataFrame = []

Also your point

"But the thing is that I don't explicitly cache the tempTables ..".

I believe tempTable is created in-memory and is already cached

HTH


























Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 31 October 2016 at 14:16, Michael David Pedersen <
michael.d.peder...@googlemail.com> wrote:

> Hi Mich,
>
> Thank you again for your reply.
>
> As I see you are caching the table already sorted
>>
>> val keyValRDDSorted = keyValRDD.sortByKey().cache
>>
>> and the next stage is you are creating multiple tempTables (different
>> ranges) that cache a subset of rows already cached in RDD. The data stored
>> in tempTable is in Hive columnar format (I assume that means ORC format)
>>
>
> But the thing is that I don't explicitly cache the tempTables, and I don't
> really want to because I'll only run a single query on each tempTable. So I
> expect the SQL query processor to operate directly on the underlying
> key-value RDD, and my concern is that this may be inefficient.
>
>
>> Well that is all you can do.
>>
>
> Ok, thanks - that's really what I wanted to get confirmation of.
>
>
>> Bear in mind that these tempTables are immutable and I do not know any
>> way of dropping tempTable to free more memory.
>>
>
> I'm assuming there won't be any (significant) memory overhead of
> registering the temp tables as long as I don't explicitly cache them. Am I
> wrong? In any case I'll be calling sqlContext.dropTempTable once the query
> has completed, which according to the documentation should also free up
> memory.
>
> Cheers,
> Michael
>


Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Michael David Pedersen
Hi Mich,

Thank you again for your reply.

As I see you are caching the table already sorted
>
> val keyValRDDSorted = keyValRDD.sortByKey().cache
>
> and the next stage is you are creating multiple tempTables (different
> ranges) that cache a subset of rows already cached in RDD. The data stored
> in tempTable is in Hive columnar format (I assume that means ORC format)
>

But the thing is that I don't explicitly cache the tempTables, and I don't
really want to because I'll only run a single query on each tempTable. So I
expect the SQL query processor to operate directly on the underlying
key-value RDD, and my concern is that this may be inefficient.


> Well that is all you can do.
>

Ok, thanks - that's really what I wanted to get confirmation of.


> Bear in mind that these tempTables are immutable and I do not know any way
> of dropping tempTable to free more memory.
>

I'm assuming there won't be any (significant) memory overhead of
registering the temp tables as long as I don't explicitly cache them. Am I
wrong? In any case I'll be calling sqlContext.dropTempTable once the query
has completed, which according to the documentation should also free up
memory.

Cheers,
Michael


Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Mich Talebzadeh
Hi Michael,

As I see you are caching the table already sorted

val keyValRDDSorted = keyValRDD.sortByKey().cache

and the next stage is you are creating multiple tempTables (different
ranges) that cache a subset of rows already cached in RDD. The data stored
in tempTable is in Hive columnar format (I assume that means ORC format)

Well that is all you can do. Bear in mind that these tempTables are
immutable and I do not know any way of dropping tempTable to free more
memory.

Depending on the size of the main table, caching the whole table may
require a lot of memory but you can see this in UI storage page.
Alternative is to use persist(StorageLevel.MEMORY_AND_DISK_SER()) with a
mix of cached and disk.

HTH







Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 31 October 2016 at 10:55, Michael David Pedersen <
michael.d.peder...@googlemail.com> wrote:

> Hi Mich,
>
> Thank you for your quick reply!
>
> What type of table is the underlying table? Is it Hbase, Hive ORC or what?
>>
>
> It is a custom datasource, but ultimately backed by HBase.
>
>
>> By Key you mean a UNIQUE ID or something similar and then you do multiple
>> scans on the tempTable which stores data using in-memory columnar format.
>>
>
> The key is a unique ID, yes. But note that I don't actually do multiple
> scans on the same temp table: I create a new temp table for every query I
> want to run, because each query will be based on a different key range. The
> caching is at the level of the full key-value RDD.
>
> If I did instead cache the temp table, I don't see a way of exploiting key
> ordering for key range filters?
>
>
>> That is the optimisation of tempTable storage as far as I know.
>>
>
> So it seems to me that my current solution won't be using this
> optimisation, as I'm caching the RDD rather than the temp table.
>
>
>> Have you tried it using predicate push-down on the underlying table
>> itself?
>>
>
> No, because I essentially want to load the entire table into memory before
> doing any queries. At that point I have nothing to push down.
>
> Cheers,
> Michael
>


Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Michael David Pedersen
Hi Mich,

Thank you for your quick reply!

What type of table is the underlying table? Is it Hbase, Hive ORC or what?
>

It is a custom datasource, but ultimately backed by HBase.


> By Key you mean a UNIQUE ID or something similar and then you do multiple
> scans on the tempTable which stores data using in-memory columnar format.
>

The key is a unique ID, yes. But note that I don't actually do multiple
scans on the same temp table: I create a new temp table for every query I
want to run, because each query will be based on a different key range. The
caching is at the level of the full key-value RDD.

If I did instead cache the temp table, I don't see a way of exploiting key
ordering for key range filters?


> That is the optimisation of tempTable storage as far as I know.
>

So it seems to me that my current solution won't be using this
optimisation, as I'm caching the RDD rather than the temp table.


> Have you tried it using predicate push-down on the underlying table itself?
>

No, because I essentially want to load the entire table into memory before
doing any queries. At that point I have nothing to push down.

Cheers,
Michael


Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Mich Talebzadeh
Hi Michael.

What type of table is the underlying table? Is it Hbase, Hive ORC or what?

By Key you mean a UNIQUE ID or something similar and then you do multiple
scans on the tempTable which stores data using in-memory columnar format.

That is the optimisation of tempTable storage as far as I know.

Have you tried it using predicate push-down on the underlying table itself?

HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 31 October 2016 at 10:06, Michael David Pedersen <
michael.d.peder...@googlemail.com> wrote:

> Hello,
>
> I've got a Spark SQL dataframe containing a "key" column. The queries I
> want to run start by filtering on the key range. My question in outline: is
> it possible to sort the dataset by key so as to do efficient key range
> filters, before subsequently running a more complex SQL query?
>
> I'm aware that such efficient filtering is possible for key-value RDDs,
> i.e. RDDs over Tuple2, using PairRDDFunctions. My workflow currently looks
> as follows:
>
> // Create a dataframe
> val df: DataFrame = sqlContext.sql("SELECT * FROM ...")
> val keyValRDD = df.rdd.map( (r: Row) => (r.getAs[String]("key"), r) )
>
> // Sort by key - and cache.
> val keyValRDDSorted = keyValRDD.sortByKey().cache
>
> // Define a function to run SQL query on a range.
> def queryRange(lower: String, upper: String, sql: String, tableName:
> String) = {
> val rangeRDD = keyValRDDSorted.filterByRange(lower, upper)
> val rangeDF = sqlContext.createDataFrame(rangeRDD.map{ _._2 },
> df.schema)
> rangeDF.createTempView(tableName)
> sqlContext.sql(sql)
> }
>
> // Invoke multiple times.
> queryRange(...)
> queryRange(...)
> ...
>
> This works, and is efficient in that only the partitions containing the
> relevant key range are processed. However, I understand that Spark SQL uses
> an optimised storage format as compared to plain RDDs. The above workflow
> can't take advantage of this, as it is the key-value RDD that is cached.
>
> So, my specific question: Is there a more efficient way of achieving the
> desired result?
>
> Any pointers would be much appreciated.
>
> Many thanks,
> Michael
>
> PS: This question was also asked on StackOverflow -
> http://stackoverflow.com/questions/40129411/efficient-
> filtering-on-spark-sql-dataframes-with-ordered-keys.
>


Re: PostgresSql queries vs spark sql

2016-10-23 Thread Selvam Raman
I found it. We can use pivot which is similar to cross tab
In postgres.

Thank you.
On Oct 17, 2016 10:00 PM, "Selvam Raman"  wrote:

> Hi,
>
> Please share me some idea if you work on this earlier.
> How can i develop postgres CROSSTAB function in spark.
>
> Postgres Example
>
> Example 1:
>
> SELECT mthreport.*
>   FROM
>   *crosstab*('SELECT i.item_name::text As row_name, 
> to_char(if.action_date, ''mon'')::text As bucket,
>   SUM(if.num_used)::integer As bucketvalue
>   FROM inventory As i INNER JOIN inventory_flow As if
>   ON i.item_id = if.item_id
> AND action_date BETWEEN date ''2007-01-01'' and date ''2007-12-31 
> 23:59''
>   GROUP BY i.item_name, to_char(if.action_date, ''mon''), 
> date_part(''month'', if.action_date)
>   ORDER BY i.item_name',
>   'SELECT to_char(date ''2007-01-01'' + (n || '' month'')::interval, 
> ''mon'') As short_mname
>   FROM generate_series(0,11) n')
>   As mthreport(item_name text, jan integer, feb integer, mar 
> integer,
>   apr integer, may integer, jun integer, jul integer,
>   aug integer, sep integer, oct integer, nov integer,
>   dec integer)
>
> The output of the above crosstab looks as follows:
> [image: crosstab source_sql cat_sql example]
>
> Example 2:
>
> CREATE TABLE ct(id SERIAL, rowid TEXT, attribute TEXT, value TEXT);
> INSERT INTO ct(rowid, attribute, value) VALUES('test1','att1','val1');
> INSERT INTO ct(rowid, attribute, value) VALUES('test1','att2','val2');
> INSERT INTO ct(rowid, attribute, value) VALUES('test1','att3','val3');
> INSERT INTO ct(rowid, attribute, value) VALUES('test1','att4','val4');
> INSERT INTO ct(rowid, attribute, value) VALUES('test2','att1','val5');
> INSERT INTO ct(rowid, attribute, value) VALUES('test2','att2','val6');
> INSERT INTO ct(rowid, attribute, value) VALUES('test2','att3','val7');
> INSERT INTO ct(rowid, attribute, value) VALUES('test2','att4','val8');
>
> SELECT *
> FROM crosstab(
>   'select rowid, attribute, value
>from ct
>where attribute = ''att2'' or attribute = ''att3''
>order by 1,2')
> AS ct(row_name text, category_1 text, category_2 text, category_3 text);
>
>  row_name | category_1 | category_2 | category_3
> --+++
>  test1| val2   | val3   |
>  test2| val6   | val7   |
>
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>


Re: Want to test spark-sql-kafka but get unresolved dependency error

2016-10-14 Thread Cody Koeninger
I can't be sure, no.

On Fri, Oct 14, 2016 at 3:06 AM, Julian Keppel
 wrote:
> Okay, thank you! Can you say, when this feature will be released?
>
> 2016-10-13 16:29 GMT+02:00 Cody Koeninger :
>>
>> As Sean said, it's unreleased.  If you want to try it out, build spark
>>
>> http://spark.apache.org/docs/latest/building-spark.html
>>
>> The easiest way to include the jar is probably to use mvn install to
>> put it in your local repository, then link it in your application's
>> mvn or sbt build file as described in the docs you linked.
>>
>>
>> On Thu, Oct 13, 2016 at 3:24 AM, JayKay 
>> wrote:
>> > I want to work with the Kafka integration for structured streaming. I
>> > use
>> > Spark version 2.0.0. and I start the spark-shell with:
>> >
>> > spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.0.0
>> >
>> > As described here:
>> >
>> > https://github.com/apache/spark/blob/master/docs/structured-streaming-kafka-integration.md
>> >
>> > But I get a unresolved dependency error ("unresolved dependency:
>> > org.apache.spark#spark-sql-kafka-0-10_2.11;2.0.0: not found"). So it
>> > seems
>> > not to be available via maven or spark-packages.
>> >
>> > How can I accesss this package? Or am I doing something wrong/missing?
>> >
>> > Thank you for you help.
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/Want-to-test-spark-sql-kafka-but-get-unresolved-dependency-error-tp27891.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Want to test spark-sql-kafka but get unresolved dependency error

2016-10-14 Thread Julian Keppel
Okay, thank you! Can you say, when this feature will be released?

2016-10-13 16:29 GMT+02:00 Cody Koeninger :

> As Sean said, it's unreleased.  If you want to try it out, build spark
>
> http://spark.apache.org/docs/latest/building-spark.html
>
> The easiest way to include the jar is probably to use mvn install to
> put it in your local repository, then link it in your application's
> mvn or sbt build file as described in the docs you linked.
>
>
> On Thu, Oct 13, 2016 at 3:24 AM, JayKay 
> wrote:
> > I want to work with the Kafka integration for structured streaming. I use
> > Spark version 2.0.0. and I start the spark-shell with:
> >
> > spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.0.0
> >
> > As described here:
> > https://github.com/apache/spark/blob/master/docs/
> structured-streaming-kafka-integration.md
> >
> > But I get a unresolved dependency error ("unresolved dependency:
> > org.apache.spark#spark-sql-kafka-0-10_2.11;2.0.0: not found"). So it
> seems
> > not to be available via maven or spark-packages.
> >
> > How can I accesss this package? Or am I doing something wrong/missing?
> >
> > Thank you for you help.
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Want-to-test-spark-sql-kafka-but-get-
> unresolved-dependency-error-tp27891.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>


Re: Want to test spark-sql-kafka but get unresolved dependency error

2016-10-13 Thread Cody Koeninger
As Sean said, it's unreleased.  If you want to try it out, build spark

http://spark.apache.org/docs/latest/building-spark.html

The easiest way to include the jar is probably to use mvn install to
put it in your local repository, then link it in your application's
mvn or sbt build file as described in the docs you linked.


On Thu, Oct 13, 2016 at 3:24 AM, JayKay  wrote:
> I want to work with the Kafka integration for structured streaming. I use
> Spark version 2.0.0. and I start the spark-shell with:
>
> spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.0.0
>
> As described here:
> https://github.com/apache/spark/blob/master/docs/structured-streaming-kafka-integration.md
>
> But I get a unresolved dependency error ("unresolved dependency:
> org.apache.spark#spark-sql-kafka-0-10_2.11;2.0.0: not found"). So it seems
> not to be available via maven or spark-packages.
>
> How can I accesss this package? Or am I doing something wrong/missing?
>
> Thank you for you help.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Want-to-test-spark-sql-kafka-but-get-unresolved-dependency-error-tp27891.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: OOM when running Spark SQL by PySpark on Java 8

2016-10-13 Thread Shady Xu
All nodes of my YARN cluster is running on Java 7, but I submit the job
from a Java 8 client.

I realised I run the job in yarn cluster mode and that's why setting '
--driver-java-options' is effective. Now the problem is, why submitting a
job from a Java 8 client to a Java 7 cluster causes a PermGen OOM.

2016-10-13 17:30 GMT+08:00 Sean Owen :

> You can specify it; it just doesn't do anything but cause a warning in
> Java 8. It won't work in general to have such a tiny PermGen. If it's
> working it means you're on Java 8 because it's ignored. You should set
> MaxPermSize if anything, not PermSize. However the error indicates you are
> not using Java 8 everywhere on your cluster, and that's a potentially
> bigger problem.
>
> On Thu, Oct 13, 2016 at 10:26 AM Shady Xu  wrote:
>
>> Solved the problem by specifying the PermGen size when submitting the job
>> (even to just a few MB).
>>
>> Seems Java 8 has removed the Permanent Generation space, thus
>> corresponding JVM arguments are ignored.  But I can still
>> use --driver-java-options "-XX:PermSize=80M -XX:MaxPermSize=100m" to
>> specify them when submitting the Spark job, which is wried. I don't know
>> whether it has anything to do with py4j as I am not familiar with it.
>>
>> 2016-10-13 17:00 GMT+08:00 Shady Xu :
>>
>> Hi,
>>
>> I have a problem when running Spark SQL by PySpark on Java 8. Below is
>> the log.
>>
>>
>> 16/10/13 16:46:40 INFO spark.SparkContext: Starting job: sql at 
>> NativeMethodAccessorImpl.java:-2
>> Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: 
>> PermGen space
>>  at java.lang.ClassLoader.defineClass1(Native Method)
>>  at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>>  at 
>> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>>  at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>>  at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>  at java.security.AccessController.doPrivileged(Native Method)
>>  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>>  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>  at 
>> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:857)
>>  at 
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1630)
>>  at 
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
>>  at 
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
>>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>> Exception in thread "shuffle-server-2" java.lang.OutOfMemoryError: PermGen 
>> space
>> Exception in thread "shuffle-server-4" java.lang.OutOfMemoryError: PermGen 
>> space
>> Exception in thread "threadDeathWatcher-2-1" java.lang.OutOfMemoryError: 
>> PermGen space
>>
>>
>> I tried to increase the driver memory and didn't help. However, things are 
>> ok when I run the same code after switching to Java 7. I also find it ok to 
>> run the SparkPi example on Java 8. So I believe the problem stays with 
>> PySpark rather theSpark core.
>>
>>
>> I am using Spark 2.0.1 and run the program in YARN cluster mode. Anyone any 
>> idea is appreciated.
>>
>>
>>


Re: OOM when running Spark SQL by PySpark on Java 8

2016-10-13 Thread Sean Owen
You can specify it; it just doesn't do anything but cause a warning in Java
8. It won't work in general to have such a tiny PermGen. If it's working it
means you're on Java 8 because it's ignored. You should set MaxPermSize if
anything, not PermSize. However the error indicates you are not using Java
8 everywhere on your cluster, and that's a potentially bigger problem.

On Thu, Oct 13, 2016 at 10:26 AM Shady Xu  wrote:

> Solved the problem by specifying the PermGen size when submitting the job
> (even to just a few MB).
>
> Seems Java 8 has removed the Permanent Generation space, thus
> corresponding JVM arguments are ignored.  But I can still
> use --driver-java-options "-XX:PermSize=80M -XX:MaxPermSize=100m" to
> specify them when submitting the Spark job, which is wried. I don't know
> whether it has anything to do with py4j as I am not familiar with it.
>
> 2016-10-13 17:00 GMT+08:00 Shady Xu :
>
> Hi,
>
> I have a problem when running Spark SQL by PySpark on Java 8. Below is the
> log.
>
>
> 16/10/13 16:46:40 INFO spark.SparkContext: Starting job: sql at 
> NativeMethodAccessorImpl.java:-2
> Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: 
> PermGen space
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:857)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1630)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> Exception in thread "shuffle-server-2" java.lang.OutOfMemoryError: PermGen 
> space
> Exception in thread "shuffle-server-4" java.lang.OutOfMemoryError: PermGen 
> space
> Exception in thread "threadDeathWatcher-2-1" java.lang.OutOfMemoryError: 
> PermGen space
>
>
> I tried to increase the driver memory and didn't help. However, things are ok 
> when I run the same code after switching to Java 7. I also find it ok to run 
> the SparkPi example on Java 8. So I believe the problem stays with PySpark 
> rather theSpark core.
>
>
> I am using Spark 2.0.1 and run the program in YARN cluster mode. Anyone any 
> idea is appreciated.
>
>
>


Re: OOM when running Spark SQL by PySpark on Java 8

2016-10-13 Thread Shady Xu
Solved the problem by specifying the PermGen size when submitting the job
(even to just a few MB).

Seems Java 8 has removed the Permanent Generation space, thus corresponding
JVM arguments are ignored.  But I can still use --driver-java-options
"-XX:PermSize=80M -XX:MaxPermSize=100m" to specify them when submitting the
Spark job, which is wried. I don't know whether it has anything to do with
py4j as I am not familiar with it.

2016-10-13 17:00 GMT+08:00 Shady Xu :

> Hi,
>
> I have a problem when running Spark SQL by PySpark on Java 8. Below is the
> log.
>
>
> 16/10/13 16:46:40 INFO spark.SparkContext: Starting job: sql at 
> NativeMethodAccessorImpl.java:-2
> Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: 
> PermGen space
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:857)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1630)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> Exception in thread "shuffle-server-2" java.lang.OutOfMemoryError: PermGen 
> space
> Exception in thread "shuffle-server-4" java.lang.OutOfMemoryError: PermGen 
> space
> Exception in thread "threadDeathWatcher-2-1" java.lang.OutOfMemoryError: 
> PermGen space
>
>
> I tried to increase the driver memory and didn't help. However, things are ok 
> when I run the same code after switching to Java 7. I also find it ok to run 
> the SparkPi example on Java 8. So I believe the problem stays with PySpark 
> rather theSpark core.
>
>
> I am using Spark 2.0.1 and run the program in YARN cluster mode. Anyone any 
> idea is appreciated.
>
>


Re: OOM when running Spark SQL by PySpark on Java 8

2016-10-13 Thread Sean Owen
The error doesn't say you're out of memory, but says you're out of PermGen.
If you see this, you aren't running Java 8 AFAIK, because 8 has no PermGen.
But if you're running Java 7, and you go investigate what this error means,
you'll find you need to increase PermGen. This is mentioned in the Spark
docs too, as you need to increase this when building on Java 7.

On Thu, Oct 13, 2016 at 10:00 AM Shady Xu  wrote:

> Hi,
>
> I have a problem when running Spark SQL by PySpark on Java 8. Below is the
> log.
>
>
> 16/10/13 16:46:40 INFO spark.SparkContext: Starting job: sql at 
> NativeMethodAccessorImpl.java:-2
> Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: 
> PermGen space
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:857)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1630)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> Exception in thread "shuffle-server-2" java.lang.OutOfMemoryError: PermGen 
> space
> Exception in thread "shuffle-server-4" java.lang.OutOfMemoryError: PermGen 
> space
> Exception in thread "threadDeathWatcher-2-1" java.lang.OutOfMemoryError: 
> PermGen space
>
>
> I tried to increase the driver memory and didn't help. However, things are ok 
> when I run the same code after switching to Java 7. I also find it ok to run 
> the SparkPi example on Java 8. So I believe the problem stays with PySpark 
> rather theSpark core.
>
>
> I am using Spark 2.0.1 and run the program in YARN cluster mode. Anyone any 
> idea is appreciated.
>
>


Re: Want to test spark-sql-kafka but get unresolved dependency error

2016-10-13 Thread Mich Talebzadeh
add --jars /spark-streaming-kafka_2.10-1.5.1.jar

(may need to download the jar file or any newer version)


to spark-shell.

I also have spark-streaming-kafka-assembly_2.10-1.6.1.jar as well on --jar
list

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 13 October 2016 at 09:24, JayKay  wrote:

> I want to work with the Kafka integration for structured streaming. I use
> Spark version 2.0.0. and I start the spark-shell with:
>
> spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.0.0
>
> As described here:
> https://github.com/apache/spark/blob/master/docs/
> structured-streaming-kafka-integration.md
>
> But I get a unresolved dependency error ("unresolved dependency:
> org.apache.spark#spark-sql-kafka-0-10_2.11;2.0.0: not found"). So it seems
> not to be available via maven or spark-packages.
>
> How can I accesss this package? Or am I doing something wrong/missing?
>
> Thank you for you help.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Want-to-test-spark-sql-kafka-but-get-
> unresolved-dependency-error-tp27891.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Want to test spark-sql-kafka but get unresolved dependency error

2016-10-13 Thread Sean Owen
I don't believe that's been released yet. It looks like it was merged into
branches about a week ago. You're looking at unreleased docs too - have a
look at http://spark.apache.org/docs/latest/ for the latest released docs.

On Thu, Oct 13, 2016 at 9:24 AM JayKay  wrote:

> I want to work with the Kafka integration for structured streaming. I use
> Spark version 2.0.0. and I start the spark-shell with:
>
> spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.0.0
>
> As described here:
>
> https://github.com/apache/spark/blob/master/docs/structured-streaming-kafka-integration.md
>
> But I get a unresolved dependency error ("unresolved dependency:
> org.apache.spark#spark-sql-kafka-0-10_2.11;2.0.0: not found"). So it seems
> not to be available via maven or spark-packages.
>
> How can I accesss this package? Or am I doing something wrong/missing?
>
> Thank you for you help.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Want-to-test-spark-sql-kafka-but-get-unresolved-dependency-error-tp27891.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: java.lang.NoClassDefFoundError: org/apache/spark/sql/Dataset

2016-10-07 Thread kant kodali
perfect! That fixes it all!
 





On Fri, Oct 7, 2016 1:29 AM, Denis Bolshakov bolshakov.de...@gmail.com
wrote:
You need to have spark-sql, now you are missing it.


7 Окт 2016 г. 11:12 пользователь "kant kodali"  написал:
Here are the jar files on my classpath after doing a grep for spark jars.
org.apache.spark/spark-core_2.11/2.0.0/c4d04336c142f10eb7e172155f022f
86b6d11dd3/spark-core_2.11-2.0.0.jar

org.apache.spark/sparkstreaming_2.11/2.0.0/7227cbd39f5952b0ed3579bc78463b
cc318ecd2b/spark-streaming_2.11-2.0.0.jar
com.datastax.spark/spark-cassandra-connector_2.11/2.0.0-M3/
d38ac36dde076e3364f1024985754bce84bd39d/spark-cassandra-
connector_2.11-2.0.0-M3.jar
org.apache.spark/spark-launcher_2.11/2.0.0/9c3e1bd84ccb099e86ea232f5acd8f
ec1a61e291/spark-launcher_2.11-2.0.0.jar
org.apache.spark/spark-network-common_2.11/2.0.0/b451dae899ee8138e96319528eed64
f7e849dbe2/spark-network-common_2.11-2.0.0.jar
org.apache.spark/spark-network-shuffle_2.11/2.0.0/233c036e88761424212508b2a6a556
33a3cf4ec8/spark-network-shuffle_2.11-2.0.0.jar
org.apache.spark/spark-unsafe_2.11/2.0.0/9f8682d4c83ce32f08fea067c2e22a
aabca27d86/spark-unsafe_2.11-2.0.0.jar
org.apache.spark/spark-tags_2.11/2.0.0/7f84a46b1e60c1981e47cae05c462f
ed65217eff/spark-tags_2.11-2.0.0.jar
org.spark-project.spark/unused/1.0.0/205fe37a2fade6ce6dfcf8eff57ed2
1a4a1c22af/unused-1.0.0.jar











On Wed, Oct 5, 2016 4:03 PM, Jakob Odersky ja...@odersky.com
wrote:
Ok, that rules out a whole class of errors. Let's continue the diagnostic:
- How are you submitting the application to spark?
- How/which version of spark are you using within your build tool?
- Could you have dirty ivy or maven caches that use some locally-built version
of spark?

On Wed, Oct 5, 2016 at 3:35 PM, kant kodali   wrote:
I am running locally so they all are on one host

 





On Wed, Oct 5, 2016 3:12 PM, Jakob Odersky ja...@odersky.com
wrote:
Are all spark and scala versions the same? By "all" I mean the master, worker
and driver instances.

Re: java.lang.NoClassDefFoundError: org/apache/spark/sql/Dataset

2016-10-05 Thread kant kodali

I am running locally so they all are on one host






On Wed, Oct 5, 2016 3:12 PM, Jakob Odersky ja...@odersky.com
wrote:
Are all spark and scala versions the same? By "all" I mean the master, worker
and driver instances.

Re: ArrayType support in Spark SQL

2016-09-25 Thread Koert Kuipers
not pretty but this works:

import org.apache.spark.sql.functions.udf
df.withColumn("array", sqlf.udf({ () => Seq(1, 2, 3) }).apply())


On Sun, Sep 25, 2016 at 6:13 PM, Jason White 
wrote:

> It seems that `functions.lit` doesn't support ArrayTypes. To reproduce:
>
> org.apache.spark.sql.functions.lit(2 :: 1 :: Nil)
>
> java.lang.RuntimeException: Unsupported literal type class
> scala.collection.immutable.$colon$colon List(2, 1)
>   at
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(
> literals.scala:59)
>   at org.apache.spark.sql.functions$.lit(functions.scala:101)
>   ... 48 elided
>
> This is about the first thing I tried to do with ArrayTypes in Spark SQL.
> Is
> this usage supported, or on the roadmap?
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/ArrayType-support-
> in-Spark-SQL-tp19063.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Reporting errors from spark sql

2016-08-21 Thread Jacek Laskowski
Hi,

See 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala#L65
to learn how Spark SQL parses SQL texts. It could give you a way out.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Thu, Aug 18, 2016 at 3:14 PM, yael aharon  wrote:
> Hello,
> I am working on an SQL editor which is powered by spark SQL. When the SQL is
> not valid, I would like to provide the user with a line number and column
> number where the first error occurred. I am having a hard time finding a
> mechanism that will give me that information programmatically.
>
> Most of the time, if an erroneous SQL statement is used, I am getting a
> RuntimeException, where line number and column number are implicitly
> embedded within the text of the message, but it is really error prone to
> parse the message text and count the number of spaces prior to the '^'
> symbol...
>
> Sometimes, AnalysisException is used, but when I try to extract the line and
> startPosition from it, they are always empty.
>
> Any help would be greatly appreciated.
> thanks!

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: read parquetfile in spark-sql error

2016-07-25 Thread cj
thank you.but I hope to read the parquet file as a table in the spark-sql,not 
in the java(or scala) program.


-- Original --
From:  "Kabeer Ahmed";<kabeer.ah...@outlook.com>;
Date:  Mon, Jul 25, 2016 10:36 PM
To:  "cj"<124411...@qq.com>; 
Cc:  "user"<user@spark.apache.org>; "lian.cs.zju"<lian.cs@gmail.com>; 
Subject:  Re: read parquetfile in spark-sql error



 I hope the below sample helps you:  
 
 val parquetDF = hiveContext.read.parquet("hdfs://.parquet") 
 parquetDF.registerTempTable("parquetTable")
 sql("SELECT * FROM parquetTable").collect().foreach(println)
 
 
 
 Kabeer.
 Sent from
 
  Nylas N1, the extensible, open source mail client.
 
  On Jul 25 2016, at 12:09 pm, cj <124411...@qq.com> wrote: 
  hi,all:
 
 
   I use spark1.6.1 as my work env.
   
   when I saved the following content as test1.sql file :
   
 CREATE TEMPORARY TABLE parquetTable USING org.apache.spark.sql.parquet OPTIONS 
(   path "examples/src/main/resources/people.parquet" ) SELECT * FROM 
parquetTable 
 
 and use bin/spark-sql to run it 
(/home/bae/dataplatform/spark-1.6.1/bin/spark-sql  --properties-file 
./spark-dataplatform.conf -f test1.sql ),I encountered a grammar error.
 
 
 
 
  SET hive.support.sql11.reserved.keywords=false
 SET spark.sql.hive.version=1.2.1
 SET spark.sql.hive.version=1.2.1
 NoViableAltException(280@[192:1: tableName : (db= identifier DOT tab= 
identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME $tab) 
);])
 at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
 at org.antlr.runtime.DFA.predict(DFA.java:116)
 at 
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.tableName(HiveParser_FromClauseParser.java:4747)
 at 
org.apache.hadoop.hive.ql.parse.HiveParser.tableName(HiveParser.java:45918)
 at 
org.apache.hadoop.hive.ql.parse.HiveParser.createTableStatement(HiveParser.java:5029)
 at 
org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2640)
 at 
org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1650)
 at 
org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1109)
 at 
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
 at 
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
 at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:276)
 at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:303)
 at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
 at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
 at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
 at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
 at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
 at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
 at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
 at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
 at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:295)
 at 
org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:66)
 at 
org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:66)
 at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:290)
 at 
org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(Clie

Re: read parquetfile in spark-sql error

2016-07-25 Thread Takeshi Yamamuro
Hi,

Seems your query was not consist with the HQL syntax.
you'd better off re-checking the definitions:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable

// maropu

On Mon, Jul 25, 2016 at 11:36 PM, Kabeer Ahmed 
wrote:

> I hope the below sample helps you:
>
> val parquetDF = hiveContext.read.parquet("hdfs://.parquet")
> parquetDF.registerTempTable("parquetTable")
> sql("SELECT * FROM parquetTable").collect().foreach(println)
>
> Kabeer.
> Sent from
>  Nylas N1
> ,
> the extensible, open source mail client.
> On Jul 25 2016, at 12:09 pm, cj <124411...@qq.com> wrote:
>
>> hi,all:
>>
>>   I use spark1.6.1 as my work env.
>>
>>   when I saved the following content as test1.sql file :
>>
>>
>> CREATE TEMPORARY TABLE parquetTable
>>
>> USING org.apache.spark.sql.parquetOPTIONS (
>>   path "examples/src/main/resources/people.parquet")
>> SELECT * FROM parquetTable
>>
>>
>> and use bin/spark-sql to run it
>> (/home/bae/dataplatform/spark-1.6.1/bin/spark-sql  --properties-file
>> ./spark-dataplatform.conf -f test1.sql ),I encountered a grammar error.
>>
>>
>> SET hive.support.sql11.reserved.keywords=false
>> SET spark.sql.hive.version=1.2.1
>> SET spark.sql.hive.version=1.2.1
>> NoViableAltException(280@[192:1: tableName : (db= identifier DOT tab=
>> identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME
>> $tab) );])
>> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
>> at org.antlr.runtime.DFA.predict(DFA.java:116)
>> at
>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.tableName(HiveParser_FromClauseParser.java:4747)
>> at
>> org.apache.hadoop.hive.ql.parse.HiveParser.tableName(HiveParser.java:45918)
>> at
>> org.apache.hadoop.hive.ql.parse.HiveParser.createTableStatement(HiveParser.java:5029)
>> at
>> org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2640)
>> at
>> org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1650)
>> at
>> org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1109)
>> at
>> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
>> at
>> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
>> at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:276)
>> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:303)
>> at
>> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
>> at
>> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
>> at
>> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>> at
>> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>> at
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>> at
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>> at
>> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>> at
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>> at
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>> at
>> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>> at
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>> at
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>> at
>> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>> at
>> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>> at
>> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>> at
>> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
>> at
>> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
>> at
>> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
>> at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:295)
>> at
>> org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:66)
>> at
>> org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:66)
>> at
>> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:290)
>> at
>> 

Re: read parquetfile in spark-sql error

2016-07-25 Thread Kabeer Ahmed
I hope the below sample helps you:

val parquetDF = hiveContext.read.parquet("hdfs://.parquet")
parquetDF.registerTempTable("parquetTable")
sql("SELECT * FROM parquetTable").collect().foreach(println)

Kabeer.
Sent from
 Nylas 
N1,
 the extensible, open source mail client.

On Jul 25 2016, at 12:09 pm, cj <124411...@qq.com> wrote:
hi,all:

  I use spark1.6.1 as my work env.

  when I saved the following content as test1.sql file :


CREATE TEMPORARY TABLE parquetTable

USING org.apache.spark.sql.parquet
OPTIONS (
  path "examples/src/main/resources/people.parquet"
)

SELECT * FROM parquetTable

and use bin/spark-sql to run it 
(/home/bae/dataplatform/spark-1.6.1/bin/spark-sql  --properties-file 
./spark-dataplatform.conf -f test1.sql ),I encountered a grammar error.


SET hive.support.sql11.reserved.keywords=false
SET spark.sql.hive.version=1.2.1
SET spark.sql.hive.version=1.2.1
NoViableAltException(280@[192:1: tableName : (db= identifier DOT tab= 
identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME $tab) 
);])
at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
at org.antlr.runtime.DFA.predict(DFA.java:116)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.tableName(HiveParser_FromClauseParser.java:4747)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.tableName(HiveParser.java:45918)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.createTableStatement(HiveParser.java:5029)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2640)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1650)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1109)
at 
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
at 
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:276)
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:303)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:295)
at 
org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:66)
at 
org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:66)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:290)
at 
org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:237)
at 
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:236)
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:279)
at org.apache.spark.sql.hive.HiveQLDialect.parse(HiveContext.scala:65)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
at 

Re: XML Processing using Spark SQL

2016-05-12 Thread Mail.com
Hi Arun,

Could you try using Stax or JaxB.

Thanks,
Pradeep

> On May 12, 2016, at 8:35 PM, Hyukjin Kwon  wrote:
> 
> Hi Arunkumar,
> 
> 
> I guess your records are self-closing ones.
> 
> There is an issue open here, https://github.com/databricks/spark-xml/issues/92
> 
> This is about XmlInputFormat.scala and it seems a bit tricky to handle the 
> case so I left open until now.
> 
> 
> Thanks!
> 
> 
> 2016-05-13 5:03 GMT+09:00 Arunkumar Chandrasekar :
>> Hello,
>> 
>> Greetings.
>> 
>> I'm trying to process a xml file exported from Health Kit application using 
>> Spark SQL for learning purpose. The sample record data is like the below:
>> 
>>  > sourceVersion="9.3" device="HKDevice: 0x7896, name:iPhone, 
>> manufacturer:Apple, model:iPhone, hardware:iPhone7,2, software:9.3" 
>> unit="count" creationDate="2016-04-23 19:31:33 +0530" startDate="2016-04-23 
>> 19:00:20 +0530" endDate="2016-04-23 19:01:41 +0530" value="31"/>
>> 
>>  > sourceVersion="9.3.1" device="HKDevice: 0x85746, name:iPhone, 
>> manufacturer:Apple, model:iPhone, hardware:iPhone7,2, software:9.3.1" 
>> unit="count" creationDate="2016-04-24 05:45:00 +0530" startDate="2016-04-24 
>> 05:25:04 +0530" endDate="2016-04-24 05:25:24 +0530" value="10"/>.
>> 
>> I want to have the column name of my table as the field value like type, 
>> sourceName, sourceVersion and the row entries as their respective values 
>> like HKQuantityTypeIdentifierStepCount, Vizhi, 9.3.1,..
>> 
>> I took a look at the Spark-XML, but didn't get any information in my case 
>> (my xml is not well formed with the tags). Is there any other option to 
>> convert the record that I have mentioned above into a schema format for 
>> playing with Spark SQL?
>> 
>> Thanks in Advance.
>> 
>> Thank You,
>> Arun Chandrasekar
>> chan.arunku...@gmail.com
> 


Re: XML Processing using Spark SQL

2016-05-12 Thread Hyukjin Kwon
Hi Arunkumar,


I guess your records are self-closing ones.

There is an issue open here,
https://github.com/databricks/spark-xml/issues/92

This is about XmlInputFormat.scala and it seems a bit tricky to handle the
case so I left open until now.


Thanks!


2016-05-13 5:03 GMT+09:00 Arunkumar Chandrasekar :

> Hello,
>
> Greetings.
>
> I'm trying to process a xml file exported from Health Kit application
> using Spark SQL for learning purpose. The sample record data is like the
> below:
>
>   sourceVersion="9.3" device="HKDevice: 0x7896, name:iPhone,
> manufacturer:Apple, model:iPhone, hardware:iPhone7,2, software:9.3"
> unit="count" creationDate="2016-04-23 19:31:33 +0530" startDate="2016-04-23
> 19:00:20 +0530" endDate="2016-04-23 19:01:41 +0530" value="31"/>
>
>   sourceVersion="9.3.1" device="HKDevice: 0x85746, name:iPhone,
> manufacturer:Apple, model:iPhone, hardware:iPhone7,2, software:9.3.1"
> unit="count" creationDate="2016-04-24 05:45:00 +0530" startDate="2016-04-24
> 05:25:04 +0530" endDate="2016-04-24 05:25:24 +0530" value="10"/>.
>
> I want to have the column name of my table as the field value like type,
> sourceName, sourceVersion and the row entries as their respective values
> like HKQuantityTypeIdentifierStepCount, Vizhi, 9.3.1,..
>
> I took a look at the Spark-XML ,
> but didn't get any information in my case (my xml is not well formed with
> the tags). Is there any other option to convert the record that I have
> mentioned above into a schema format for playing with Spark SQL?
>
> Thanks in Advance.
>
> *Thank You*,
> Arun Chandrasekar
> chan.arunku...@gmail.com
>


Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
t AND
> s.ad =
> >>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >=
> '2016-01-03'").count()
> >>>> res9: Long = 23809
> >>>>
> >>>>
> >>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
> d.account
> >>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> >>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
> >>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
> s.ad =
> >>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >=
> '2016-01-03'").count()
> >>>> res10: Long = 23809
> >>>>
> >>>>
> >>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
> d.account
> >>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> >>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
> >>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
> s.ad =
> >>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >=
> '2016-01-03'").count()
> >>>> res11: Long = 23809
> >>>>
> >>>>
> >>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
> d.account
> >>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> >>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
> >>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
> s.ad =
> >>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >=
> '2016-01-03'").count()
> >>>> res12: Long = 23809
> >>>>
> >>>>
> >>>>
> >>>> From my results above, we notice that the counts of distinct values
> based
> >>>> on the join criteria and filter criteria for each individual table is
> >>>> located at res6 and res7.  My question is why is the outer join
> producing
> >>>> less rows than the smallest table; if there are no matches it should
> still
> >>>> bring in that row as part of the outer join.  For the full and right
> outer
> >>>> join I am expecting to see a minimum of res6 rows, but I get less, is
> there
> >>>> something specific that I am missing here?  I am expecting that the
> full
> >>>> outer join would give me the union of the two table sets so I am
> expecting
> >>>> at least 42533 rows not 23809.
> >>>>
> >>>>
> >>>> Gourav,
> >>>>
> >>>> I just ran this result set on a new session with slightly newer
> data...
> >>>> still seeing those results.
> >>>>
> >>>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>> KP
> >>>>
> >>>>
> >>>> On Mon, May 2, 2016 at 11:16 PM, Davies Liu <dav...@databricks.com>
> >>>> wrote:
> >>>>>
> >>>>> as @Gourav said, all the join with different join type show the same
> >>>>> results,
> >>>>> which meant that all the rows from left could match at least one row
> >>>>> from right,
> >>>>> all the rows from right could match at least one row from left, even
> >>>>> the number of row from left does not equal that of right.
> >>>>>
> >>>>> This is correct result.
> >>>>>
> >>>>> On Mon, May 2, 2016 at 2:13 PM, Kevin Peng <kpe...@gmail.com> wrote:
> >>>>>> Yong,
> >>>>>>
> >>>>>> Sorry, let explain my deduction; it is going be difficult to get a
> >>>>>> sample
> >>>>>> data out since the dataset I am using is proprietary.
> >>>>>>
> >>>>>> From the above set queries (ones mentioned in above comments) both
> >>>>>> inner and
> >>>>>> outer join are producing the same counts.  They are basically
> pulling
> >>>>>> out
> >>>>>> selected columns from the query, but there is no roll up happening
> or
> >>>>>> anything that would possible make it suspicious that there is any
> >>>>>> difference
> >>>>>> besides the type of joins.  The tables are matched based on thr

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Michael Segel
d) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
>>>> res11: Long = 23809
>>>> 
>>>> 
>>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
>>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad 
>>>> =
>>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
>>>> res12: Long = 23809
>>>> 
>>>> 
>>>> 
>>>> From my results above, we notice that the counts of distinct values based
>>>> on the join criteria and filter criteria for each individual table is
>>>> located at res6 and res7.  My question is why is the outer join producing
>>>> less rows than the smallest table; if there are no matches it should still
>>>> bring in that row as part of the outer join.  For the full and right outer
>>>> join I am expecting to see a minimum of res6 rows, but I get less, is there
>>>> something specific that I am missing here?  I am expecting that the full
>>>> outer join would give me the union of the two table sets so I am expecting
>>>> at least 42533 rows not 23809.
>>>> 
>>>> 
>>>> Gourav,
>>>> 
>>>> I just ran this result set on a new session with slightly newer data...
>>>> still seeing those results.
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> KP
>>>> 
>>>> 
>>>> On Mon, May 2, 2016 at 11:16 PM, Davies Liu <dav...@databricks.com>
>>>> wrote:
>>>>> 
>>>>> as @Gourav said, all the join with different join type show the same
>>>>> results,
>>>>> which meant that all the rows from left could match at least one row
>>>>> from right,
>>>>> all the rows from right could match at least one row from left, even
>>>>> the number of row from left does not equal that of right.
>>>>> 
>>>>> This is correct result.
>>>>> 
>>>>> On Mon, May 2, 2016 at 2:13 PM, Kevin Peng <kpe...@gmail.com> wrote:
>>>>>> Yong,
>>>>>> 
>>>>>> Sorry, let explain my deduction; it is going be difficult to get a
>>>>>> sample
>>>>>> data out since the dataset I am using is proprietary.
>>>>>> 
>>>>>> From the above set queries (ones mentioned in above comments) both
>>>>>> inner and
>>>>>> outer join are producing the same counts.  They are basically pulling
>>>>>> out
>>>>>> selected columns from the query, but there is no roll up happening or
>>>>>> anything that would possible make it suspicious that there is any
>>>>>> difference
>>>>>> besides the type of joins.  The tables are matched based on three keys
>>>>>> that
>>>>>> are present in both tables (ad, account, and date), on top of this
>>>>>> they are
>>>>>> filtered by date being above 2016-01-03.  Since all the joins are
>>>>>> producing
>>>>>> the same counts, the natural suspicions is that the tables are
>>>>>> identical,
>>>>>> but I when I run the following two queries:
>>>>>> 
>>>>>> scala> sqlContext.sql("select * from swig_pin_promo_lt where date
>>>>>>> ='2016-01-03'").count
>>>>>> 
>>>>>> res14: Long = 34158
>>>>>> 
>>>>>> scala> sqlContext.sql("select * from dps_pin_promo_lt where date
>>>>>>> ='2016-01-03'").count
>>>>>> 
>>>>>> res15: Long = 42693
>>>>>> 
>>>>>> 
>>>>>> The above two queries filter out the data based on date used by the
>>>>>> joins of
>>>>>> 2016-01-03 and you can see the row count between the two tables are
>>>>>> different, which is why I am suspecting something is wrong with the
>>>>>> outer
>>>>>> joins in spark sql, because in this situation the right and outer
>>>>>> joins may
>>>>>> produce the same

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
> keys
> >>>> > that
> >>>> > are present in both tables (ad, account, and date), on top of this
> >>>> > they are
> >>>> > filtered by date being above 2016-01-03.  Since all the joins are
> >>>> > producing
> >>>> > the same counts, the natural suspicions is that the tables are
> >>>> > identical,
> >>>> > but I when I run the following two queries:
> >>>> >
> >>>> > scala> sqlContext.sql("select * from swig_pin_promo_lt where date
> >>>> >>='2016-01-03'").count
> >>>> >
> >>>> > res14: Long = 34158
> >>>> >
> >>>> > scala> sqlContext.sql("select * from dps_pin_promo_lt where date
> >>>> >>='2016-01-03'").count
> >>>> >
> >>>> > res15: Long = 42693
> >>>> >
> >>>> >
> >>>> > The above two queries filter out the data based on date used by the
> >>>> > joins of
> >>>> > 2016-01-03 and you can see the row count between the two tables are
> >>>> > different, which is why I am suspecting something is wrong with the
> >>>> > outer
> >>>> > joins in spark sql, because in this situation the right and outer
> >>>> > joins may
> >>>> > produce the same results, but it should not be equal to the left
> join
> >>>> > and
> >>>> > definitely not the inner join; unless I am missing something.
> >>>> >
> >>>> >
> >>>> > Side note: In my haste response above I posted the wrong counts for
> >>>> > dps.count, the real value is res16: Long = 42694
> >>>> >
> >>>> >
> >>>> > Thanks,
> >>>> >
> >>>> >
> >>>> > KP
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> > On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <java8...@hotmail.com>
> >>>> > wrote:
> >>>> >>
> >>>> >> We are still not sure what is the problem, if you cannot show us
> with
> >>>> >> some
> >>>> >> example data.
> >>>> >>
> >>>> >> For dps with 42632 rows, and swig with 42034 rows, if dps full
> outer
> >>>> >> join
> >>>> >> with swig on 3 columns; with additional filters, get the same
> >>>> >> resultSet row
> >>>> >> count as dps lefter outer join with swig on 3 columns, with
> >>>> >> additional
> >>>> >> filters, again get the the same resultSet row count as dps right
> >>>> >> outer join
> >>>> >> with swig on 3 columns, with same additional filters.
> >>>> >>
> >>>> >> Without knowing your data, I cannot see the reason that has to be a
> >>>> >> bug in
> >>>> >> the spark.
> >>>> >>
> >>>> >> Am I misunderstanding your bug?
> >>>> >>
> >>>> >> Yong
> >>>> >>
> >>>> >> 
> >>>> >> From: kpe...@gmail.com
> >>>> >> Date: Mon, 2 May 2016 12:11:18 -0700
> >>>> >> Subject: Re: Weird results with Spark SQL Outer joins
> >>>> >> To: gourav.sengu...@gmail.com
> >>>> >> CC: user@spark.apache.org
> >>>> >>
> >>>> >>
> >>>> >> Gourav,
> >>>> >>
> >>>> >> I wish that was case, but I have done a select count on each of the
> >>>> >> two
> >>>> >> tables individually and they return back different number of rows:
> >>>> >>
> >>>> >>
> >>>> >> dps.registerTempTable("dps_pin_promo_lt")
> >>>> >> swig.registerTempTable("swig_pin_promo_lt")
> >>>> >>
> >>>> >>
> >>>> >> dps.count()
> >>>> >> RESULT: 42632
> >>>> >>
> >>>> >>
> >>>> >> swig.count()
> >>>> >> RESULT: 42034
> >>>> >>
> >>>> >> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta
> >>>> >> <gourav.sengu...@gmail.com> wrote:
> >>>> >>
> >>>> >> This shows that both the tables have matching records and no
> >>>> >> mismatches.
> >>>> >> Therefore obviously you have the same results irrespective of
> whether
> >>>> >> you
> >>>> >> use right or left join.
> >>>> >>
> >>>> >> I think that there is no problem here, unless I am missing
> something.
> >>>> >>
> >>>> >> Regards,
> >>>> >> Gourav
> >>>> >>
> >>>> >> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kpe...@gmail.com> wrote:
> >>>> >>
> >>>> >> Also, the results of the inner query produced the same results:
> >>>> >> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
> >>>> >> d.account
> >>>> >> AS
> >>>> >> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> >>>> >> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
> >>>> >> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account
> AND
> >>>> >> s.ad
> >>>> >> =
> >>>> >> d.ad) WHERE s.date >= '2016-01-03'AND d.date >=
> >>>> >> '2016-01-03'").count()
> >>>> >> RESULT:23747
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >> --
> >>>> >> View this message in context:
> >>>> >>
> >>>> >>
> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
> >>>> >> Sent from the Apache Spark User List mailing list archive at
> >>>> >> Nabble.com.
> >>>> >>
> >>>> >>
> -
> >>>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >>>> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >
> >>>
> >>>
> >>
> >
> >
> >
> > --
> > Cesar Flores
>


Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Davies Liu
n the smallest table; if there are no matches it should still
>>> bring in that row as part of the outer join.  For the full and right outer
>>> join I am expecting to see a minimum of res6 rows, but I get less, is there
>>> something specific that I am missing here?  I am expecting that the full
>>> outer join would give me the union of the two table sets so I am expecting
>>> at least 42533 rows not 23809.
>>>
>>>
>>> Gourav,
>>>
>>> I just ran this result set on a new session with slightly newer data...
>>> still seeing those results.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> KP
>>>
>>>
>>> On Mon, May 2, 2016 at 11:16 PM, Davies Liu <dav...@databricks.com>
>>> wrote:
>>>>
>>>> as @Gourav said, all the join with different join type show the same
>>>> results,
>>>> which meant that all the rows from left could match at least one row
>>>> from right,
>>>> all the rows from right could match at least one row from left, even
>>>> the number of row from left does not equal that of right.
>>>>
>>>> This is correct result.
>>>>
>>>> On Mon, May 2, 2016 at 2:13 PM, Kevin Peng <kpe...@gmail.com> wrote:
>>>> > Yong,
>>>> >
>>>> > Sorry, let explain my deduction; it is going be difficult to get a
>>>> > sample
>>>> > data out since the dataset I am using is proprietary.
>>>> >
>>>> > From the above set queries (ones mentioned in above comments) both
>>>> > inner and
>>>> > outer join are producing the same counts.  They are basically pulling
>>>> > out
>>>> > selected columns from the query, but there is no roll up happening or
>>>> > anything that would possible make it suspicious that there is any
>>>> > difference
>>>> > besides the type of joins.  The tables are matched based on three keys
>>>> > that
>>>> > are present in both tables (ad, account, and date), on top of this
>>>> > they are
>>>> > filtered by date being above 2016-01-03.  Since all the joins are
>>>> > producing
>>>> > the same counts, the natural suspicions is that the tables are
>>>> > identical,
>>>> > but I when I run the following two queries:
>>>> >
>>>> > scala> sqlContext.sql("select * from swig_pin_promo_lt where date
>>>> >>='2016-01-03'").count
>>>> >
>>>> > res14: Long = 34158
>>>> >
>>>> > scala> sqlContext.sql("select * from dps_pin_promo_lt where date
>>>> >>='2016-01-03'").count
>>>> >
>>>> > res15: Long = 42693
>>>> >
>>>> >
>>>> > The above two queries filter out the data based on date used by the
>>>> > joins of
>>>> > 2016-01-03 and you can see the row count between the two tables are
>>>> > different, which is why I am suspecting something is wrong with the
>>>> > outer
>>>> > joins in spark sql, because in this situation the right and outer
>>>> > joins may
>>>> > produce the same results, but it should not be equal to the left join
>>>> > and
>>>> > definitely not the inner join; unless I am missing something.
>>>> >
>>>> >
>>>> > Side note: In my haste response above I posted the wrong counts for
>>>> > dps.count, the real value is res16: Long = 42694
>>>> >
>>>> >
>>>> > Thanks,
>>>> >
>>>> >
>>>> > KP
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <java8...@hotmail.com>
>>>> > wrote:
>>>> >>
>>>> >> We are still not sure what is the problem, if you cannot show us with
>>>> >> some
>>>> >> example data.
>>>> >>
>>>> >> For dps with 42632 rows, and swig with 42034 rows, if dps full outer
>>>> >> join
>>>> >> with swig on 3 columns; with additional filters, get the same
>>>> >> resultSet row
>>>> >> co

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Cesar Flores
n, May 2, 2016 at 11:16 PM, Davies Liu <dav...@databricks.com>
>> wrote:
>>
>>> as @Gourav said, all the join with different join type show the same
>>> results,
>>> which meant that all the rows from left could match at least one row
>>> from right,
>>> all the rows from right could match at least one row from left, even
>>> the number of row from left does not equal that of right.
>>>
>>> This is correct result.
>>>
>>> On Mon, May 2, 2016 at 2:13 PM, Kevin Peng <kpe...@gmail.com> wrote:
>>> > Yong,
>>> >
>>> > Sorry, let explain my deduction; it is going be difficult to get a
>>> sample
>>> > data out since the dataset I am using is proprietary.
>>> >
>>> > From the above set queries (ones mentioned in above comments) both
>>> inner and
>>> > outer join are producing the same counts.  They are basically pulling
>>> out
>>> > selected columns from the query, but there is no roll up happening or
>>> > anything that would possible make it suspicious that there is any
>>> difference
>>> > besides the type of joins.  The tables are matched based on three keys
>>> that
>>> > are present in both tables (ad, account, and date), on top of this
>>> they are
>>> > filtered by date being above 2016-01-03.  Since all the joins are
>>> producing
>>> > the same counts, the natural suspicions is that the tables are
>>> identical,
>>> > but I when I run the following two queries:
>>> >
>>> > scala> sqlContext.sql("select * from swig_pin_promo_lt where date
>>> >>='2016-01-03'").count
>>> >
>>> > res14: Long = 34158
>>> >
>>> > scala> sqlContext.sql("select * from dps_pin_promo_lt where date
>>> >>='2016-01-03'").count
>>> >
>>> > res15: Long = 42693
>>> >
>>> >
>>> > The above two queries filter out the data based on date used by the
>>> joins of
>>> > 2016-01-03 and you can see the row count between the two tables are
>>> > different, which is why I am suspecting something is wrong with the
>>> outer
>>> > joins in spark sql, because in this situation the right and outer
>>> joins may
>>> > produce the same results, but it should not be equal to the left join
>>> and
>>> > definitely not the inner join; unless I am missing something.
>>> >
>>> >
>>> > Side note: In my haste response above I posted the wrong counts for
>>> > dps.count, the real value is res16: Long = 42694
>>> >
>>> >
>>> > Thanks,
>>> >
>>> >
>>> > KP
>>> >
>>> >
>>> >
>>> >
>>> > On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <java8...@hotmail.com>
>>> wrote:
>>> >>
>>> >> We are still not sure what is the problem, if you cannot show us with
>>> some
>>> >> example data.
>>> >>
>>> >> For dps with 42632 rows, and swig with 42034 rows, if dps full outer
>>> join
>>> >> with swig on 3 columns; with additional filters, get the same
>>> resultSet row
>>> >> count as dps lefter outer join with swig on 3 columns, with additional
>>> >> filters, again get the the same resultSet row count as dps right
>>> outer join
>>> >> with swig on 3 columns, with same additional filters.
>>> >>
>>> >> Without knowing your data, I cannot see the reason that has to be a
>>> bug in
>>> >> the spark.
>>> >>
>>> >> Am I misunderstanding your bug?
>>> >>
>>> >> Yong
>>> >>
>>> >> 
>>> >> From: kpe...@gmail.com
>>> >> Date: Mon, 2 May 2016 12:11:18 -0700
>>> >> Subject: Re: Weird results with Spark SQL Outer joins
>>> >> To: gourav.sengu...@gmail.com
>>> >> CC: user@spark.apache.org
>>> >>
>>> >>
>>> >> Gourav,
>>> >>
>>> >> I wish that was case, but I have done a select count on each of the
>>> two
>>> >> tables individually and they return back different number of rows:
>>> >>
>>> >>
>>> >> dps.registerTempTable("dps_pin_promo_lt")
>>> >> swig.registerTempTable("swig_pin_promo_lt")
>>> >>
>>> >>
>>> >> dps.count()
>>> >> RESULT: 42632
>>> >>
>>> >>
>>> >> swig.count()
>>> >> RESULT: 42034
>>> >>
>>> >> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta
>>> >> <gourav.sengu...@gmail.com> wrote:
>>> >>
>>> >> This shows that both the tables have matching records and no
>>> mismatches.
>>> >> Therefore obviously you have the same results irrespective of whether
>>> you
>>> >> use right or left join.
>>> >>
>>> >> I think that there is no problem here, unless I am missing something.
>>> >>
>>> >> Regards,
>>> >> Gourav
>>> >>
>>> >> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kpe...@gmail.com> wrote:
>>> >>
>>> >> Also, the results of the inner query produced the same results:
>>> >> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
>>> d.account
>>> >> AS
>>> >> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>> >> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
>>> >> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>>> s.ad
>>> >> =
>>> >> d.ad) WHERE s.date >= '2016-01-03'AND d.date >=
>>> '2016-01-03'").count()
>>> >> RESULT:23747
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> View this message in context:
>>> >>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
>>> >> Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>> >>
>>> >> -
>>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> >> For additional commands, e-mail: user-h...@spark.apache.org
>>> >>
>>> >>
>>> >>
>>> >
>>>
>>
>>
>


-- 
Cesar Flores


Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Gourav Sengupta
gt;> >
>> > From the above set queries (ones mentioned in above comments) both
>> inner and
>> > outer join are producing the same counts.  They are basically pulling
>> out
>> > selected columns from the query, but there is no roll up happening or
>> > anything that would possible make it suspicious that there is any
>> difference
>> > besides the type of joins.  The tables are matched based on three keys
>> that
>> > are present in both tables (ad, account, and date), on top of this they
>> are
>> > filtered by date being above 2016-01-03.  Since all the joins are
>> producing
>> > the same counts, the natural suspicions is that the tables are
>> identical,
>> > but I when I run the following two queries:
>> >
>> > scala> sqlContext.sql("select * from swig_pin_promo_lt where date
>> >>='2016-01-03'").count
>> >
>> > res14: Long = 34158
>> >
>> > scala> sqlContext.sql("select * from dps_pin_promo_lt where date
>> >>='2016-01-03'").count
>> >
>> > res15: Long = 42693
>> >
>> >
>> > The above two queries filter out the data based on date used by the
>> joins of
>> > 2016-01-03 and you can see the row count between the two tables are
>> > different, which is why I am suspecting something is wrong with the
>> outer
>> > joins in spark sql, because in this situation the right and outer joins
>> may
>> > produce the same results, but it should not be equal to the left join
>> and
>> > definitely not the inner join; unless I am missing something.
>> >
>> >
>> > Side note: In my haste response above I posted the wrong counts for
>> > dps.count, the real value is res16: Long = 42694
>> >
>> >
>> > Thanks,
>> >
>> >
>> > KP
>> >
>> >
>> >
>> >
>> > On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <java8...@hotmail.com>
>> wrote:
>> >>
>> >> We are still not sure what is the problem, if you cannot show us with
>> some
>> >> example data.
>> >>
>> >> For dps with 42632 rows, and swig with 42034 rows, if dps full outer
>> join
>> >> with swig on 3 columns; with additional filters, get the same
>> resultSet row
>> >> count as dps lefter outer join with swig on 3 columns, with additional
>> >> filters, again get the the same resultSet row count as dps right outer
>> join
>> >> with swig on 3 columns, with same additional filters.
>> >>
>> >> Without knowing your data, I cannot see the reason that has to be a
>> bug in
>> >> the spark.
>> >>
>> >> Am I misunderstanding your bug?
>> >>
>> >> Yong
>> >>
>> >> 
>> >> From: kpe...@gmail.com
>> >> Date: Mon, 2 May 2016 12:11:18 -0700
>> >> Subject: Re: Weird results with Spark SQL Outer joins
>> >> To: gourav.sengu...@gmail.com
>> >> CC: user@spark.apache.org
>> >>
>> >>
>> >> Gourav,
>> >>
>> >> I wish that was case, but I have done a select count on each of the two
>> >> tables individually and they return back different number of rows:
>> >>
>> >>
>> >> dps.registerTempTable("dps_pin_promo_lt")
>> >> swig.registerTempTable("swig_pin_promo_lt")
>> >>
>> >>
>> >> dps.count()
>> >> RESULT: 42632
>> >>
>> >>
>> >> swig.count()
>> >> RESULT: 42034
>> >>
>> >> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta
>> >> <gourav.sengu...@gmail.com> wrote:
>> >>
>> >> This shows that both the tables have matching records and no
>> mismatches.
>> >> Therefore obviously you have the same results irrespective of whether
>> you
>> >> use right or left join.
>> >>
>> >> I think that there is no problem here, unless I am missing something.
>> >>
>> >> Regards,
>> >> Gourav
>> >>
>> >> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kpe...@gmail.com> wrote:
>> >>
>> >> Also, the results of the inner query produced the same results:
>> >> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
>> d.account
>> >> AS
>> >> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> >> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
>> >> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad
>> >> =
>> >> d.ad) WHERE s.date >= '2016-01-03'AND d.date >=
>> '2016-01-03'").count()
>> >> RESULT:23747
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
>> >> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>> >>
>> >> -
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>> >>
>> >>
>> >
>>
>
>


Re: parquet table in spark-sql

2016-05-03 Thread Sandeep Nemuri
We don't need any delimiters for Parquet file format.

ᐧ

On Tue, May 3, 2016 at 5:31 AM, Varadharajan Mukundan 
wrote:

> Hi,
>
> Yes, it is not needed. Delimiters are need only for text files.
>
> On Tue, May 3, 2016 at 12:49 PM, 喜之郎 <251922...@qq.com> wrote:
>
>> hi, I want to ask a question about parquet table in spark-sql table.
>>
>> I think that parquet have schema information in its own file.
>> so you don't need define row separator and column separator in
>> create-table DDL, like that:
>>
>> total_duration  BigInt)
>> ROW FORMAT DELIMITED
>>   FIELDS TERMINATED BY ','
>>   LINES TERMINATED BY '\n'
>>
>> can anyone give me a answer? thanks
>>
>>
>>
>
>
> --
> Thanks,
> M. Varadharajan
>
> 
>
> "Experience is what you get when you didn't get what you wanted"
>-By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- http://varadharajan.in
>



-- 
*  Regards*
*  Sandeep Nemuri*


Re: parquet table in spark-sql

2016-05-03 Thread Varadharajan Mukundan
Hi,

Yes, it is not needed. Delimiters are need only for text files.

On Tue, May 3, 2016 at 12:49 PM, 喜之郎 <251922...@qq.com> wrote:

> hi, I want to ask a question about parquet table in spark-sql table.
>
> I think that parquet have schema information in its own file.
> so you don't need define row separator and column separator in
> create-table DDL, like that:
>
> total_duration  BigInt)
> ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY ','
>   LINES TERMINATED BY '\n'
>
> can anyone give me a answer? thanks
>
>
>


-- 
Thanks,
M. Varadharajan



"Experience is what you get when you didn't get what you wanted"
   -By Prof. Randy Pausch in "The Last Lecture"

My Journal :- http://varadharajan.in


Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
o_lt where date
> >>='2016-01-03'").count
> >
> > res14: Long = 34158
> >
> > scala> sqlContext.sql("select * from dps_pin_promo_lt where date
> >>='2016-01-03'").count
> >
> > res15: Long = 42693
> >
> >
> > The above two queries filter out the data based on date used by the
> joins of
> > 2016-01-03 and you can see the row count between the two tables are
> > different, which is why I am suspecting something is wrong with the outer
> > joins in spark sql, because in this situation the right and outer joins
> may
> > produce the same results, but it should not be equal to the left join and
> > definitely not the inner join; unless I am missing something.
> >
> >
> > Side note: In my haste response above I posted the wrong counts for
> > dps.count, the real value is res16: Long = 42694
> >
> >
> > Thanks,
> >
> >
> > KP
> >
> >
> >
> >
> > On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <java8...@hotmail.com>
> wrote:
> >>
> >> We are still not sure what is the problem, if you cannot show us with
> some
> >> example data.
> >>
> >> For dps with 42632 rows, and swig with 42034 rows, if dps full outer
> join
> >> with swig on 3 columns; with additional filters, get the same resultSet
> row
> >> count as dps lefter outer join with swig on 3 columns, with additional
> >> filters, again get the the same resultSet row count as dps right outer
> join
> >> with swig on 3 columns, with same additional filters.
> >>
> >> Without knowing your data, I cannot see the reason that has to be a bug
> in
> >> the spark.
> >>
> >> Am I misunderstanding your bug?
> >>
> >> Yong
> >>
> >> 
> >> From: kpe...@gmail.com
> >> Date: Mon, 2 May 2016 12:11:18 -0700
> >> Subject: Re: Weird results with Spark SQL Outer joins
> >> To: gourav.sengu...@gmail.com
> >> CC: user@spark.apache.org
> >>
> >>
> >> Gourav,
> >>
> >> I wish that was case, but I have done a select count on each of the two
> >> tables individually and they return back different number of rows:
> >>
> >>
> >> dps.registerTempTable("dps_pin_promo_lt")
> >> swig.registerTempTable("swig_pin_promo_lt")
> >>
> >>
> >> dps.count()
> >> RESULT: 42632
> >>
> >>
> >> swig.count()
> >> RESULT: 42034
> >>
> >> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta
> >> <gourav.sengu...@gmail.com> wrote:
> >>
> >> This shows that both the tables have matching records and no mismatches.
> >> Therefore obviously you have the same results irrespective of whether
> you
> >> use right or left join.
> >>
> >> I think that there is no problem here, unless I am missing something.
> >>
> >> Regards,
> >> Gourav
> >>
> >> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kpe...@gmail.com> wrote:
> >>
> >> Also, the results of the inner query produced the same results:
> >> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
> d.account
> >> AS
> >> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> >> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
> >> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
> s.ad
> >> =
> >> d.ad) WHERE s.date >= '2016-01-03'AND d.date >=
> '2016-01-03'").count()
> >> RESULT:23747
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >>
> >>
> >
>


Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Davies Liu
as @Gourav said, all the join with different join type show the same results,
which meant that all the rows from left could match at least one row from right,
all the rows from right could match at least one row from left, even
the number of row from left does not equal that of right.

This is correct result.

On Mon, May 2, 2016 at 2:13 PM, Kevin Peng <kpe...@gmail.com> wrote:
> Yong,
>
> Sorry, let explain my deduction; it is going be difficult to get a sample
> data out since the dataset I am using is proprietary.
>
> From the above set queries (ones mentioned in above comments) both inner and
> outer join are producing the same counts.  They are basically pulling out
> selected columns from the query, but there is no roll up happening or
> anything that would possible make it suspicious that there is any difference
> besides the type of joins.  The tables are matched based on three keys that
> are present in both tables (ad, account, and date), on top of this they are
> filtered by date being above 2016-01-03.  Since all the joins are producing
> the same counts, the natural suspicions is that the tables are identical,
> but I when I run the following two queries:
>
> scala> sqlContext.sql("select * from swig_pin_promo_lt where date
>>='2016-01-03'").count
>
> res14: Long = 34158
>
> scala> sqlContext.sql("select * from dps_pin_promo_lt where date
>>='2016-01-03'").count
>
> res15: Long = 42693
>
>
> The above two queries filter out the data based on date used by the joins of
> 2016-01-03 and you can see the row count between the two tables are
> different, which is why I am suspecting something is wrong with the outer
> joins in spark sql, because in this situation the right and outer joins may
> produce the same results, but it should not be equal to the left join and
> definitely not the inner join; unless I am missing something.
>
>
> Side note: In my haste response above I posted the wrong counts for
> dps.count, the real value is res16: Long = 42694
>
>
> Thanks,
>
>
> KP
>
>
>
>
> On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <java8...@hotmail.com> wrote:
>>
>> We are still not sure what is the problem, if you cannot show us with some
>> example data.
>>
>> For dps with 42632 rows, and swig with 42034 rows, if dps full outer join
>> with swig on 3 columns; with additional filters, get the same resultSet row
>> count as dps lefter outer join with swig on 3 columns, with additional
>> filters, again get the the same resultSet row count as dps right outer join
>> with swig on 3 columns, with same additional filters.
>>
>> Without knowing your data, I cannot see the reason that has to be a bug in
>> the spark.
>>
>> Am I misunderstanding your bug?
>>
>> Yong
>>
>> 
>> From: kpe...@gmail.com
>> Date: Mon, 2 May 2016 12:11:18 -0700
>> Subject: Re: Weird results with Spark SQL Outer joins
>> To: gourav.sengu...@gmail.com
>> CC: user@spark.apache.org
>>
>>
>> Gourav,
>>
>> I wish that was case, but I have done a select count on each of the two
>> tables individually and they return back different number of rows:
>>
>>
>> dps.registerTempTable("dps_pin_promo_lt")
>> swig.registerTempTable("swig_pin_promo_lt")
>>
>>
>> dps.count()
>> RESULT: 42632
>>
>>
>> swig.count()
>> RESULT: 42034
>>
>> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta
>> <gourav.sengu...@gmail.com> wrote:
>>
>> This shows that both the tables have matching records and no mismatches.
>> Therefore obviously you have the same results irrespective of whether you
>> use right or left join.
>>
>> I think that there is no problem here, unless I am missing something.
>>
>> Regards,
>> Gourav
>>
>> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kpe...@gmail.com> wrote:
>>
>> Also, the results of the inner query produced the same results:
>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>> AS
>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
>> =
>> d.ad) WHERE s.date >= '2016-01-03'AND d.date >= '2016-01-03'").count()
>> RESULT:23747
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Gourav Sengupta
Hi,

its morning 4:40 here, therefore I might not be getting things right.

But there is a very high chance of getting spurious results in case you
have created that variable more than once in IPython or pyspark shell and
cached it and are re using it. Please close the sessions and create the
variable only once and then add the table (after ensuring that you have
dropped if first) and then check it.

I had faced this issue once and then realized it was because of the above
reasons.


Regards,
Gourav

On Mon, May 2, 2016 at 10:13 PM, Kevin Peng <kpe...@gmail.com> wrote:

> Yong,
>
> Sorry, let explain my deduction; it is going be difficult to get a sample
> data out since the dataset I am using is proprietary.
>
> From the above set queries (ones mentioned in above comments) both inner
> and outer join are producing the same counts.  They are basically pulling
> out selected columns from the query, but there is no roll up happening or
> anything that would possible make it suspicious that there is any
> difference besides the type of joins.  The tables are matched based on
> three keys that are present in both tables (ad, account, and date), on top
> of this they are filtered by date being above 2016-01-03.  Since all the
> joins are producing the same counts, the natural suspicions is that the
> tables are identical, but I when I run the following two queries:
>
> scala> sqlContext.sql("select * from swig_pin_promo_lt where date
> >='2016-01-03'").count
>
> res14: Long = 34158
>
> scala> sqlContext.sql("select * from dps_pin_promo_lt where date
> >='2016-01-03'").count
>
> res15: Long = 42693
>
>
> The above two queries filter out the data based on date used by the joins
> of 2016-01-03 and you can see the row count between the two tables are
> different, which is why I am suspecting something is wrong with the outer
> joins in spark sql, because in this situation the right and outer joins may
> produce the same results, but it should not be equal to the left join and
> definitely not the inner join; unless I am missing something.
>
>
> Side note: In my haste response above I posted the wrong counts for
> dps.count, the real value is res16: Long = 42694
>
>
> Thanks,
>
>
> KP
>
>
>
> On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <java8...@hotmail.com> wrote:
>
>> We are still not sure what is the problem, if you cannot show us with
>> some example data.
>>
>> For dps with 42632 rows, and swig with 42034 rows, if dps full outer join
>> with swig on 3 columns; with additional filters, get the same resultSet row
>> count as dps lefter outer join with swig on 3 columns, with additional
>> filters, again get the the same resultSet row count as dps right outer join
>> with swig on 3 columns, with same additional filters.
>>
>> Without knowing your data, I cannot see the reason that has to be a bug
>> in the spark.
>>
>> Am I misunderstanding your bug?
>>
>> Yong
>>
>> --
>> From: kpe...@gmail.com
>> Date: Mon, 2 May 2016 12:11:18 -0700
>> Subject: Re: Weird results with Spark SQL Outer joins
>> To: gourav.sengu...@gmail.com
>> CC: user@spark.apache.org
>>
>>
>> Gourav,
>>
>> I wish that was case, but I have done a select count on each of the two
>> tables individually and they return back different number of rows:
>>
>>
>> dps.registerTempTable("dps_pin_promo_lt")
>> swig.registerTempTable("swig_pin_promo_lt")
>>
>>
>> dps.count()
>> RESULT: 42632
>>
>>
>> swig.count()
>> RESULT: 42034
>>
>> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>> This shows that both the tables have matching records and no mismatches.
>> Therefore obviously you have the same results irrespective of whether you
>> use right or left join.
>>
>> I think that there is no problem here, unless I am missing something.
>>
>> Regards,
>> Gourav
>>
>> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kpe...@gmail.com> wrote:
>>
>> Also, the results of the inner query produced the same results:
>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>> AS
>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad =
>> d.ad) WHERE s.date >= '2016-01-03'AND d.date >=
>> '2016-01-03'").count()
>> RESULT:23747
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>>
>>
>


Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Yong,

Sorry, let explain my deduction; it is going be difficult to get a sample
data out since the dataset I am using is proprietary.

>From the above set queries (ones mentioned in above comments) both inner
and outer join are producing the same counts.  They are basically pulling
out selected columns from the query, but there is no roll up happening or
anything that would possible make it suspicious that there is any
difference besides the type of joins.  The tables are matched based on
three keys that are present in both tables (ad, account, and date), on top
of this they are filtered by date being above 2016-01-03.  Since all the
joins are producing the same counts, the natural suspicions is that the
tables are identical, but I when I run the following two queries:

scala> sqlContext.sql("select * from swig_pin_promo_lt where date
>='2016-01-03'").count

res14: Long = 34158

scala> sqlContext.sql("select * from dps_pin_promo_lt where date
>='2016-01-03'").count

res15: Long = 42693


The above two queries filter out the data based on date used by the joins
of 2016-01-03 and you can see the row count between the two tables are
different, which is why I am suspecting something is wrong with the outer
joins in spark sql, because in this situation the right and outer joins may
produce the same results, but it should not be equal to the left join and
definitely not the inner join; unless I am missing something.


Side note: In my haste response above I posted the wrong counts for
dps.count, the real value is res16: Long = 42694


Thanks,


KP



On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <java8...@hotmail.com> wrote:

> We are still not sure what is the problem, if you cannot show us with some
> example data.
>
> For dps with 42632 rows, and swig with 42034 rows, if dps full outer join
> with swig on 3 columns; with additional filters, get the same resultSet row
> count as dps lefter outer join with swig on 3 columns, with additional
> filters, again get the the same resultSet row count as dps right outer join
> with swig on 3 columns, with same additional filters.
>
> Without knowing your data, I cannot see the reason that has to be a bug in
> the spark.
>
> Am I misunderstanding your bug?
>
> Yong
>
> ------
> From: kpe...@gmail.com
> Date: Mon, 2 May 2016 12:11:18 -0700
> Subject: Re: Weird results with Spark SQL Outer joins
> To: gourav.sengu...@gmail.com
> CC: user@spark.apache.org
>
>
> Gourav,
>
> I wish that was case, but I have done a select count on each of the two
> tables individually and they return back different number of rows:
>
>
> dps.registerTempTable("dps_pin_promo_lt")
> swig.registerTempTable("swig_pin_promo_lt")
>
>
> dps.count()
> RESULT: 42632
>
>
> swig.count()
> RESULT: 42034
>
> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
> This shows that both the tables have matching records and no mismatches.
> Therefore obviously you have the same results irrespective of whether you
> use right or left join.
>
> I think that there is no problem here, unless I am missing something.
>
> Regards,
> Gourav
>
> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kpe...@gmail.com> wrote:
>
> Also, the results of the inner query produced the same results:
> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
> AS
> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
> =
> d.ad) WHERE s.date >= '2016-01-03'AND d.date >= '2016-01-03'").count()
> RESULT:23747
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>


RE: Weird results with Spark SQL Outer joins

2016-05-02 Thread Yong Zhang
We are still not sure what is the problem, if you cannot show us with some 
example data.
For dps with 42632 rows, and swig with 42034 rows, if dps full outer join with 
swig on 3 columns; with additional filters, get the same resultSet row count as 
dps lefter outer join with swig on 3 columns, with additional filters, again 
get the the same resultSet row count as dps right outer join with swig on 3 
columns, with same additional filters.
Without knowing your data, I cannot see the reason that has to be a bug in the 
spark.
Am I misunderstanding your bug?
Yong

From: kpe...@gmail.com
Date: Mon, 2 May 2016 12:11:18 -0700
Subject: Re: Weird results with Spark SQL Outer joins
To: gourav.sengu...@gmail.com
CC: user@spark.apache.org

Gourav,
I wish that was case, but I have done a select count on each of the two tables 
individually and they return back different number of rows:









dps.registerTempTable("dps_pin_promo_lt")

swig.registerTempTable("swig_pin_promo_lt")




dps.count()

RESULT: 42632




swig.count()

RESULT: 42034

On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta <gourav.sengu...@gmail.com> 
wrote:
This shows that both the tables have matching records and no mismatches. 
Therefore obviously you have the same results irrespective of whether you use 
right or left join. 
I think that there is no problem here, unless I am missing something.
Regards,Gourav 
On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kpe...@gmail.com> wrote:
Also, the results of the inner query produced the same results:

sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account AS

d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,

d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN

dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad =

d.ad) WHERE s.date >= '2016-01-03'AND d.date >= '2016-01-03'").count()

RESULT:23747







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org






  

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Gourav,

I wish that was case, but I have done a select count on each of the two
tables individually and they return back different number of rows:


dps.registerTempTable("dps_pin_promo_lt")

swig.registerTempTable("swig_pin_promo_lt")


dps.count()

RESULT: 42632


swig.count()

RESULT: 42034

On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta 
wrote:

> This shows that both the tables have matching records and no mismatches.
> Therefore obviously you have the same results irrespective of whether you
> use right or left join.
>
> I think that there is no problem here, unless I am missing something.
>
> Regards,
> Gourav
>
> On Mon, May 2, 2016 at 7:48 PM, kpeng1  wrote:
>
>> Also, the results of the inner query produced the same results:
>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>> AS
>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad =
>> d.ad) WHERE s.date >= '2016-01-03'AND d.date >=
>> '2016-01-03'").count()
>> RESULT:23747
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


  1   2   3   >