subject:"SparkSQL"

Re: [SparkSQL, SparkUI, RESTAPI] How to extract the WholeStageCodeGen ids from SparkUI

2023-04-12 Thread Jacek Laskowski

Hi,

You could use QueryExecutionListener or Spark listeners to intercept query
execution events and extract whatever is required. That's what web UI does
(as it's simply a bunch of SparkListeners --> https://youtu.be/mVP9sZ6K__Y
;-)).

Pozdrawiam,
Jacek Laskowski

"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Fri, Apr 7, 2023 at 12:23 PM Chenghao Lyu  wrote:

> Hi,
>
> The detailed stage page shows the involved WholeStageCodegen Ids in its
> DAG visualization from the Spark UI when running a SparkSQL. (e.g., under
> the link
> node:18088/history/application_1663600377480_62091/stages/stage/?id=1=0).
>
> However, I have trouble extracting the WholeStageCodegen ids from the DAG
> visualization via the RESTAPIs. Is there any other way to get the
> WholeStageCodegen Ids information for each stage automatically?
>
> Cheers,
> Chenghao
>

Re: [SparkSQL, SparkUI, RESTAPI] How to extract the WholeStageCodeGen ids from SparkUI

2023-04-11 Thread Chitral Verma

try explain codegen on your DF and then pardee the string

On Fri, 7 Apr, 2023, 3:53 pm Chenghao Lyu,  wrote:

> Hi,
>
> The detailed stage page shows the involved WholeStageCodegen Ids in its
> DAG visualization from the Spark UI when running a SparkSQL. (e.g., under
> the link
> node:18088/history/application_1663600377480_62091/stages/stage/?id=1=0).
>
> However, I have trouble extracting the WholeStageCodegen ids from the DAG
> visualization via the RESTAPIs. Is there any other way to get the
> WholeStageCodegen Ids information for each stage automatically?
>
> Cheers,
> Chenghao
>

[SparkSQL, SparkUI, RESTAPI] How to extract the WholeStageCodeGen ids from SparkUI

2023-04-07 Thread Chenghao Lyu

Hi,

The detailed stage page shows the involved WholeStageCodegen Ids in its DAG 
visualization from the Spark UI when running a SparkSQL. (e.g., under the link 
node:18088/history/application_1663600377480_62091/stages/stage/?id=1=0).

However, I have trouble extracting the WholeStageCodegen ids from the DAG 
visualization via the RESTAPIs. Is there any other way to get the 
WholeStageCodegen Ids information for each stage automatically?

Cheers,
Chenghao

Re: Adding OpenSearch as a secondary index provider to SparkSQL

2023-03-24 Thread Mich Talebzadeh

Hi,

Are you talking about intelligent index scan here?

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Fri, 24 Mar 2023 at 07:03, Anirudha Jadhav  wrote:

> Hello community, wanted your opinion on this implementation demo.
>
> / support for Materialized views, skipping indices and covered indices
> with bloom filter optimizations with opensearch via SparkSQL
>
> https://github.com/opensearch-project/sql/discussions/1465
> ( see video with voice over )
>
> Ani
> --
> Anirudha P. Jadhav
>

Adding OpenSearch as a secondary index provider to SparkSQL

2023-03-24 Thread Anirudha Jadhav

Hello community, wanted your opinion on this implementation demo.

/ support for Materialized views, skipping indices and covered indices with
bloom filter optimizations with opensearch via SparkSQL

https://github.com/opensearch-project/sql/discussions/1465
( see video with voice over )

Ani
-- 
Anirudha P. Jadhav

Re: [New Project] sparksql-ml : Distributed Machine Learning using SparkSQL.

2023-02-27 Thread Russell Jurney

I think it is awesome. Brilliant interface that is missing from Spark.
Would you integrate with something like MLFlow?

Thanks,
Russell Jurney @rjurney <http://twitter.com/rjurney>
russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
<http://facebook.com/jurney> datasyndrome.com Book a time on Calendly
<https://calendly.com/rjurney_personal/30min>


On Mon, Feb 27, 2023 at 10:16 AM Chitral Verma 
wrote:

> Hi All,
> I worked on this idea a few years back as a pet project to bridge
> *SparkSQL* and *SparkML* and empower anyone to implement production
> grade, distributed machine learning over Apache Spark as long as they have
> SQL skills.
>
> In principle the idea works exactly like Google's BigQueryML but at a much
> wider scope with no vendor lock-in on basically every source that's
> supported by Spark in cloud or on-prem.
>
> *Training* a ML model can look like,
>
> FIT 'LogisticRegression' ESTIMATOR WITH PARAMS(maxIter = 3) TO (
> SELECT * FROM mlDataset) AND OVERWRITE AT LOCATION '/path/to/lr-model';
>
> *Prediction* a ML model can look like,
>
> PREDICT FOR (SELECT * FROM mlTestDataset) USING MODEL STORED AT LOCATION 
> '/path/to/lr-model'
>
> *Feature Preprocessing* can look like,
>
> TRANSFORM (SELECT * FROM dataset) using 'StopWordsRemover' TRANSFORMER WITH
> PARAMS (inputCol='raw', outputCol='filtered') AND WRITE AT LOCATION 
> '/path/to/test-transformer'
>
>
> But a lot more can be done with this library.
>
> I was wondering if any of you find this interesting and would like to
> contribute to the project here,
>
> https://github.com/chitralverma/sparksql-ml
>
>
> Regards,
> Chitral Verma
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Fwd: [New Project] sparksql-ml : Distributed Machine Learning using SparkSQL.

2023-02-27 Thread Chitral Verma

Hi All,
I worked on this idea a few years back as a pet project to bridge *SparkSQL*
and *SparkML* and empower anyone to implement production grade, distributed
machine learning over Apache Spark as long as they have SQL skills.

In principle the idea works exactly like Google's BigQueryML but at a much
wider scope with no vendor lock-in on basically every source that's
supported by Spark in cloud or on-prem.

*Training* a ML model can look like,

FIT 'LogisticRegression' ESTIMATOR WITH PARAMS(maxIter = 3) TO (
SELECT * FROM mlDataset) AND OVERWRITE AT LOCATION '/path/to/lr-model';

*Prediction* a ML model can look like,

PREDICT FOR (SELECT * FROM mlTestDataset) USING MODEL STORED AT
LOCATION '/path/to/lr-model'

*Feature Preprocessing* can look like,

TRANSFORM (SELECT * FROM dataset) using 'StopWordsRemover' TRANSFORMER WITH
PARAMS (inputCol='raw', outputCol='filtered') AND WRITE AT LOCATION
'/path/to/test-transformer'


But a lot more can be done with this library.

I was wondering if any of you find this interesting and would like to
contribute to the project here,

https://github.com/chitralverma/sparksql-ml


Regards,
Chitral Verma

Re: [pyspark/sparksql]: How to overcome redundant/repetitive code? Is a for loop over an sql statement with a variable a bad idea?

2023-01-06 Thread Sean Owen

Right, nothing wrong with a for loop here. Seems like just the right thing.

On Fri, Jan 6, 2023, 3:20 PM Joris Billen 
wrote:

> Hello Community,
> I am working in pyspark with sparksql and have a very similar very complex
> list of dataframes that Ill have to execute several times for all the
> “models” I have.
> Suppose the code is exactly the same for all models, only the table it
> reads from and some values in the where statements will have the modelname
> in it.
> My question is how to prevent repetitive code.
> So instead of doing somethg like this (this is pseudocode, in reality it
> makes use of lots of complex dataframes) which also would require me to
> change the code every time I change it in the future:
>
> *dfmodel1=sqlContext.sql("SELECT  FROM model1_table
> WHERE model =‘model1’ “).write()*
> *dfmodel2=sqlContext.sql("SELECT  FROM model2_table
> WHERE model =‘model2’ “).write()*
> *dfmodel3=sqlContext.sql("SELECT  FROM model3_table
> WHERE model =‘model3’ “).write()*
>
>
> For loops in spark sound like a bad idea (but that is mainly in terms of
> data, maybe nothing against looping over sql statements). Is it allowed
> to do something like this?
>
>
> *spark-submit withloops.py [“model1”,"model2”,"model3"]*
>
> *code withloops.py*
> *models=sys.arg[1]*
> *qry="""SELECT  FROM {} WHERE model ='{}'"""*
> *for i in models:*
> *  FROM_TABLE=table_model*
> *  sqlContext.sql(qry.format(i,table_model )).write()*
>
>
>
> I was trying to look up about refactoring in pyspark to prevent redundant
> code but didnt find any relevant links.
>
>
>
> Thanks for input!
>

[pyspark/sparksql]: How to overcome redundant/repetitive code? Is a for loop over an sql statement with a variable a bad idea?

2023-01-06 Thread Joris Billen

Hello Community,
I am working in pyspark with sparksql and have a very similar very complex list 
of dataframes that Ill have to execute several times for all the “models” I 
have.
Suppose the code is exactly the same for all models, only the table it reads 
from and some values in the where statements will have the modelname in it.
My question is how to prevent repetitive code.
So instead of doing somethg like this (this is pseudocode, in reality it makes 
use of lots of complex dataframes) which also would require me to change the 
code every time I change it in the future:

dfmodel1=sqlContext.sql("SELECT  FROM model1_table WHERE 
model =‘model1’ “).write()
dfmodel2=sqlContext.sql("SELECT  FROM model2_table WHERE 
model =‘model2’ “).write()
dfmodel3=sqlContext.sql("SELECT  FROM model3_table WHERE 
model =‘model3’ “).write()


For loops in spark sound like a bad idea (but that is mainly in terms of data, 
maybe nothing against looping over sql statements). Is it allowed to do 
something like this?


spark-submit withloops.py [“model1”,"model2”,"model3"]

code withloops.py
models=sys.arg[1]
qry="""SELECT  FROM {} WHERE model ='{}'"""
for i in models:
  FROM_TABLE=table_model
  sqlContext.sql(qry.format(i,table_model )).write()



I was trying to look up about refactoring in pyspark to prevent redundant code 
but didnt find any relevant links.



Thanks for input!

Re: Can we upload a csv dataset into Hive using SparkSQL?

2022-12-13 Thread Artemis User

Your DDL statement doesn't look right.  You may want to check the Spark 
SQL Reference online for how to create table in Hive format 
(https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-hiveformat.html). 
You should be able to populate the table directly using CREATE by 
providing a location parameter.


If creating a table from CSV is your only objective, you may want to 
consider using beeline since it is more efficient and probably supports 
more standard SQL functions...


On 12/10/22 7:07 AM, sam smith wrote:

Hello,

I want to create a table in Hive and then load a CSV file content into 
it all by means of Spark SQL.
I saw in the docs the example with the .txt file BUT can we do instead 
something like the following to accomplish what i want? :


|String warehouseLocation = new 
File("spark-warehouse").getAbsolutePath(); SparkSession spark = 
SparkSession .builder() .appName("Java Spark Hive Example") 
.config("spark.sql.warehouse.dir", warehouseLocation) 
.enableHiveSupport() .getOrCreate(); spark.sql("CREATE TABLE IF NOT 
EXISTS csvFile USING hive"); spark.sql("LOAD DATA LOCAL INPATH 
'C:/Users/Me/Documents/examples/src/main/resources/data.csv' INTO 
TABLE csvFile");|

Can we upload a csv dataset into Hive using SparkSQL?

2022-12-10 Thread sam smith

Hello,

I want to create a table in Hive and then load a CSV file content into it
all by means of Spark SQL.
I saw in the docs the example with the .txt file BUT can we do instead
something like the following to accomplish what i want? :

String warehouseLocation = new
File("spark-warehouse").getAbsolutePath();SparkSession spark =
SparkSession
  .builder()
  .appName("Java Spark Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport()
  .getOrCreate();
spark.sql("CREATE TABLE IF NOT EXISTS csvFile USING
hive");spark.sql("LOAD DATA LOCAL INPATH
'C:/Users/Me/Documents/examples/src/main/resources/data.csv' INTO
TABLE csvFile");

Re: external table with parquet files: problem querying in sparksql since data is stored as integer while hive schema expects a timestamp

2022-07-24 Thread Gourav Sengupta

Hi,

please try to query the table directly by loading the hive metastore (we
can do that quite easily in AWS EMR, but we can do things quite easily with
everything in AWS), rather than querying the s3 location directly.


Regards,
Gourav

On Wed, Jul 20, 2022 at 9:51 PM Joris Billen 
wrote:

> Hi,
> below sounds like something that someone will have experienced...
> I have external tables of parquet files with a hive table defined on top
> of the data. I dont manage/know the details of how the data lands.
> For some tables no issues when querying through spark.
> But for others there is an issue: looks like the datatype for hive is
> timestamp, but the parquet file contains an integer number =microseconds:
> if I access the table in spark sql I get:
>
> *Unable to create Parquet converter for data type “timestamp” whose
> Parquet type is optional int64 airac_wef (TIMESTAMP(MICROS,false))*
>
> OR
>
> *Parquet column cannot be converted in file
> abfs://somelocation/table/partition=484/00_0. Column: [atimefield],
> Expected: timestamp, Found: INT64*
>
>
>
> Anyone encountered this? I tried several sorts of CAST but no success yet.
> I see similar problems on forums (like this:
> https://stackoverflow.com/questions/59096125/spark-2-4-parquet-column-cannot-be-converted-in-file-column-impressions-exp
> ) but no solution.
>
>
> Thanks for input!
>

external table with parquet files: problem querying in sparksql since data is stored as integer while hive schema expects a timestamp

2022-07-20 Thread Joris Billen

Hi,
below sounds like something that someone will have experienced...
I have external tables of parquet files with a hive table defined on top of the 
data. I dont manage/know the details of how the data lands.
For some tables no issues when querying through spark.
But for others there is an issue: looks like the datatype for hive is 
timestamp, but the parquet file contains an integer number =microseconds: if I 
access the table in spark sql I get:

Unable to create Parquet converter for data type “timestamp” whose Parquet type 
is optional int64 airac_wef (TIMESTAMP(MICROS,false))

OR

Parquet column cannot be converted in file 
abfs://somelocation/table/partition=484/00_0. Column: [atimefield], 
Expected: timestamp, Found: INT64



Anyone encountered this? I tried several sorts of CAST but no success yet. I 
see similar problems on forums (like this: 
https://stackoverflow.com/questions/59096125/spark-2-4-parquet-column-cannot-be-converted-in-file-column-impressions-exp
 ) but no solution.


Thanks for input!

Re: Using Avro file format with SparkSQL

2022-02-17 Thread Artemis User


Please try these two corrections:

1. The --packages isn't the right command line argument for
   spark-submit.  Please use --conf spark.jars.packages=your-package to
   specify Maven packages or define your configuration parameters in
   the spark-defaults.conf file
2. Please check the version number of your spark-avro jar file in
   MavenCentral and see if that version is indeed available and
   compatible with Spark 3.2.  The version we are currently using for
   Spark 3.2 is spark-avro_2.12-3.1.1.jar, not 3.2.0.

BTW, you do have to include the spark-avro lib as a customer jar file.  
The Spark 3.2 distribution includes only the avro libs, not the 
spark-avro lib.  Hope this helps...


-- ND


On 2/9/22 10:25 PM, Karanika, Anna wrote:

Hello,

I have been trying to use spark SQL’s operations that are related to 
the Avro file format,
e.g., stored as, save, load, in a Java class but they keep failing 
with the following stack trace:


Exception in thread "main" org.apache.spark.sql.AnalysisException: 
 Failed to find data source: avro. Avro is built-in but external data 
source module since Spark 2.4. Please deploy the application as per 
the deployment section of "Apache Avro Data Source Guide".
        at 
org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindAvroDataSourceError(QueryCompilationErrors.scala:1032)
        at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
        at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720)
        at 
org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:852)
        at 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256)
        at 
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)

        at xsys.fileformats.SparkSQLvsAvro.main(SparkSQLvsAvro.java:57)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
        at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

        at java.base/java.lang.reflect.Method.invoke(Method.java:564)
        at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
        at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at 
org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at 
org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
        at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)

        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

For context, I am invoking spark-submit and adding arguments 
--packages org.apache.spark:spark-avro_2.12:3.2.0.

Yet, Spark responds as if the dependency was not added.
I am running spark-v3.2.0 (Scala 2.12).

On the other hand, everything works great with spark-shell or spark-sql.

I would appreciate any advice or feedback to get this running.

Thank you,
Anna

RE: Re: Using Avro file format with SparkSQL

2022-02-14 Thread Morven Huang

Hi Steve, 

You’re correct about the '--packages' option, seems my memory does not serve me 
well :) 

On 2022/02/15 07:04:27 Stephen Coy wrote:
> Hi Morven,
> 
> We use —packages for all of our spark jobs. Spark downloads the specified jar 
> and all of its dependencies from a Maven repository.
> 
> This means we never have to build fat or uber jars.
> 
> It does mean that the Apache Ivy configuration has to be set up correctly 
> though.
> 
> Cheers,
> 
> Steve C
> 
> > On 15 Feb 2022, at 5:58 pm, Morven Huang  wrote:
> >
> > I wrote a toy spark job and ran it within my IDE, same error if I don’t add 
> > spark-avro to my pom.xml. After putting spark-avro dependency to my 
> > pom.xml, everything works fine.
> >
> > Another thing is, if my memory serves me right, the spark-submit options 
> > for extra jars is ‘--jars’ , not ‘--packages’.
> >
> > Regards,
> >
> > Morven Huang
> >
> >
> > On 2022/02/10 03:25:28 "Karanika, Anna" wrote:
> >> Hello,
> >>
> >> I have been trying to use spark SQL’s operations that are related to the 
> >> Avro file format,
> >> e.g., stored as, save, load, in a Java class but they keep failing with 
> >> the following stack trace:
> >>
> >> Exception in thread "main" org.apache.spark.sql.AnalysisException:  Failed 
> >> to find data source: avro. Avro is built-in but external data source 
> >> module since Spark 2.4. Please deploy the application as per the 
> >> deployment section of "Apache Avro Data Source Guide".
> >>at 
> >> org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindAvroDataSourceError(QueryCompilationErrors.scala:1032)
> >>at 
> >> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
> >>at 
> >> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720)
> >>at 
> >> org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:852)
> >>at 
> >> org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256)
> >>at 
> >> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
> >>at xsys.fileformats.SparkSQLvsAvro.main(SparkSQLvsAvro.java:57)
> >>at 
> >> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> >> Method)
> >>at 
> >> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
> >>at 
> >> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >>at java.base/java.lang.reflect.Method.invoke(Method.java:564)
> >>at 
> >> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> >>at 
> >> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
> >>at 
> >> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> >>at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> >>at 
> >> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> >>at 
> >> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
> >>at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
> >>at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> >>
> >> For context, I am invoking spark-submit and adding arguments --packages 
> >> org.apache.spark:spark-avro_2.12:3.2.0.
> >> Yet, Spark responds as if the dependency was not added.
> >> I am running spark-v3.2.0 (Scala 2.12).
> >>
> >> On the other hand, everything works great with spark-shell or spark-sql.
> >>
> >> I would appreciate any advice or feedback to get this running.
> >>
> >> Thank you,
> >> Anna
> >>
> >>
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
> 
> This email contains confidential information of and is the copyright of 
> Infomedia. It must not be forwarded, amended or disclosed without consent of 
> the sender. If you received this message by mistake, please advise the sender 
> and delete all copies. Security of transmission on the internet cannot be 
> guaranteed, could be infected, intercepted, or corrupted and you should 
> ensure you have suitable antivirus protection in place. By sending us your or 
> any third party personal details, you consent to (or confirm you have 
> obtained consent from such third parties) to Infomedia’s privacy policy. 
> http://www.infomedia.com.au/privacy-policy/
> 
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Using Avro file format with SparkSQL

2022-02-14 Thread Stephen Coy

Hi Morven,

We use —packages for all of our spark jobs. Spark downloads the specified jar 
and all of its dependencies from a Maven repository.

This means we never have to build fat or uber jars.

It does mean that the Apache Ivy configuration has to be set up correctly 
though.

Cheers,

Steve C

> On 15 Feb 2022, at 5:58 pm, Morven Huang  wrote:
>
> I wrote a toy spark job and ran it within my IDE, same error if I don’t add 
> spark-avro to my pom.xml. After putting spark-avro dependency to my pom.xml, 
> everything works fine.
>
> Another thing is, if my memory serves me right, the spark-submit options for 
> extra jars is ‘--jars’ , not ‘--packages’.
>
> Regards,
>
> Morven Huang
>
>
> On 2022/02/10 03:25:28 "Karanika, Anna" wrote:
>> Hello,
>>
>> I have been trying to use spark SQL’s operations that are related to the 
>> Avro file format,
>> e.g., stored as, save, load, in a Java class but they keep failing with the 
>> following stack trace:
>>
>> Exception in thread "main" org.apache.spark.sql.AnalysisException:  Failed 
>> to find data source: avro. Avro is built-in but external data source module 
>> since Spark 2.4. Please deploy the application as per the deployment section 
>> of "Apache Avro Data Source Guide".
>>at 
>> org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindAvroDataSourceError(QueryCompilationErrors.scala:1032)
>>at 
>> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
>>at 
>> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720)
>>at 
>> org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:852)
>>at 
>> org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256)
>>at 
>> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
>>at xsys.fileformats.SparkSQLvsAvro.main(SparkSQLvsAvro.java:57)
>>at 
>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
>> Method)
>>at 
>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
>>at 
>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>at java.base/java.lang.reflect.Method.invoke(Method.java:564)
>>at 
>> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>>at 
>> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
>>at 
>> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>>at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>>at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>>at 
>> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
>>at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
>>at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>> For context, I am invoking spark-submit and adding arguments --packages 
>> org.apache.spark:spark-avro_2.12:3.2.0.
>> Yet, Spark responds as if the dependency was not added.
>> I am running spark-v3.2.0 (Scala 2.12).
>>
>> On the other hand, everything works great with spark-shell or spark-sql.
>>
>> I would appreciate any advice or feedback to get this running.
>>
>> Thank you,
>> Anna
>>
>>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/

RE: Using Avro file format with SparkSQL

2022-02-14 Thread Morven Huang

I wrote a toy spark job and ran it within my IDE, same error if I don’t add 
spark-avro to my pom.xml. After putting spark-avro dependency to my pom.xml, 
everything works fine.

Another thing is, if my memory serves me right, the spark-submit options for 
extra jars is ‘--jars’ , not ‘--packages’. 

Regards, 

Morven Huang


On 2022/02/10 03:25:28 "Karanika, Anna" wrote:
> Hello,
> 
> I have been trying to use spark SQL’s operations that are related to the Avro 
> file format,
> e.g., stored as, save, load, in a Java class but they keep failing with the 
> following stack trace:
> 
> Exception in thread "main" org.apache.spark.sql.AnalysisException:  Failed to 
> find data source: avro. Avro is built-in but external data source module 
> since Spark 2.4. Please deploy the application as per the deployment section 
> of "Apache Avro Data Source Guide".
> at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindAvroDataSourceError(QueryCompilationErrors.scala:1032)
> at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
> at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720)
> at 
> org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:852)
> at 
> org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256)
> at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
> at xsys.fileformats.SparkSQLvsAvro.main(SparkSQLvsAvro.java:57)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:564)
> at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
> at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 
> For context, I am invoking spark-submit and adding arguments --packages 
> org.apache.spark:spark-avro_2.12:3.2.0.
> Yet, Spark responds as if the dependency was not added.
> I am running spark-v3.2.0 (Scala 2.12).
> 
> On the other hand, everything works great with spark-shell or spark-sql.
> 
> I would appreciate any advice or feedback to get this running.
> 
> Thank you,
> Anna
> 
> 
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Using Avro file format with SparkSQL

2022-02-11 Thread Gourav Sengupta

Hi Anna,

Avro libraries should be inbuilt in SPARK in case I am not wrong. Any
particular reason why you are using a deprecated or soon to be deprecated
version of SPARK?

SPARK 3.2.1 is fantastic.

Please do let us know about your set up if possible.


Regards,
Gourav Sengupta

On Thu, Feb 10, 2022 at 3:35 AM Karanika, Anna  wrote:

> Hello,
>
> I have been trying to use spark SQL’s operations that are related to the
> Avro file format,
> e.g., stored as, save, load, in a Java class but they keep failing with
> the following stack trace:
>
> Exception in thread "main" org.apache.spark.sql.AnalysisException:  Failed
> to find data source: avro. Avro is built-in but external data source module
> since Spark 2.4. Please deploy the application as per the deployment
> section of "Apache Avro Data Source Guide".
> at
> org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindAvroDataSourceError(QueryCompilationErrors.scala:1032)
> at
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
> at
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720)
> at
> org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:852)
> at
> org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256)
> at
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
> at xsys.fileformats.SparkSQLvsAvro.main(SparkSQLvsAvro.java:57)
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
> at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:564)
> at
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> at org.apache.spark.deploy.SparkSubmit.org
> $apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
> at
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> at
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> at
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> at
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
> at
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> For context, I am invoking spark-submit and adding arguments --packages
> org.apache.spark:spark-avro_2.12:3.2.0.
> Yet, Spark responds as if the dependency was not added.
> I am running spark-v3.2.0 (Scala 2.12).
>
> On the other hand, everything works great with spark-shell or spark-sql.
>
> I would appreciate any advice or feedback to get this running.
>
> Thank you,
> Anna
>
>

Re: Using Avro file format with SparkSQL

2022-02-09 Thread frakass


Have you added the dependency in the build.sbt?
Can you 'sbt package' the source successfully?

regards
frakass

On 2022/2/10 11:25, Karanika, Anna wrote:
For context, I am invoking spark-submit and adding arguments --packages 
org.apache.spark:spark-avro_2.12:3.2.0.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Using Avro file format with SparkSQL

2022-02-09 Thread Karanika, Anna

Hello,

I have been trying to use spark SQL’s operations that are related to the Avro 
file format,
e.g., stored as, save, load, in a Java class but they keep failing with the 
following stack trace:

Exception in thread "main" org.apache.spark.sql.AnalysisException:  Failed to 
find data source: avro. Avro is built-in but external data source module since 
Spark 2.4. Please deploy the application as per the deployment section of 
"Apache Avro Data Source Guide".
at 
org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindAvroDataSourceError(QueryCompilationErrors.scala:1032)
at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720)
at 
org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:852)
at 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
at xsys.fileformats.SparkSQLvsAvro.main(SparkSQLvsAvro.java:57)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564)
at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

For context, I am invoking spark-submit and adding arguments --packages 
org.apache.spark:spark-avro_2.12:3.2.0.
Yet, Spark responds as if the dependency was not added.
I am running spark-v3.2.0 (Scala 2.12).

On the other hand, everything works great with spark-shell or spark-sql.

I would appreciate any advice or feedback to get this running.

Thank you,
Anna

Re: SparkSQL vs Dataframe vs Dataset

2021-12-06 Thread yonghua

‌
>From my experience, SQL is easy for the guys who already know SQL syntax. With 
>the correct indexing SQL is also fast. But within programs dataframe is must 
>faster and convenient for loading large data structure from the external.
 
De : "rajat kumar"
A : "user @spark"
Envoyé: lundi 6 Décembre 2021 21:49
Objet : SparkSQL vs Dataframe vs Dataset
 

Hi Users,

Is there any use case when we need to use SQL vs Dataframe vs Dataset?

Is there any recommended approach or any advantage/performance gain over others?

Thanks
Rajat

SparkSQL vs Dataframe vs Dataset

2021-12-06 Thread rajat kumar

Hi Users,

Is there any use case when we need to use SQL vs Dataframe vs Dataset?

Is there any recommended approach or any advantage/performance gain over
others?

Thanks
Rajat

Re: [SparkSQL] Full Join Return Null Value For Funtion-Based Column

2021-01-18 Thread 刘欢

Sorry, I know the reason. closed

发件人: 刘 欢 
日期: 2021年1月18日 星期一 下午1:39
收件人: "user@spark.apache.org" 
主题: [SparkSQL] Full Join Return Null Value For Funtion-Based Column

Hi All:
Here I got two tables:

Table A
name
num
tom
2
jerry
3
jerry
4
null
null






Table B
name
score
tom
12
jerry
10
jerry
8
null
null






When i use spark.sql() to get result from A and B with sql :


select
  a.name as aName,
  a.date,
  b.name as bName
from
(
select
  name,
  date_format(now(),'-MM-dd') AS date
from
  A
group by
  name
) a
FULL JOIN
(
select
  name
from
  B
group by
  name
) b
ON a.name = b.name

I got results contain ALL NULL VALUE ROW like:

aName
date
bName
null
null
null
…
…
…

Can anyone explains why all null value row appears?

[SparkSQL] Full Join Return Null Value For Funtion-Based Column

2021-01-17 Thread 刘欢

Hi All:
Here I got two tables:

Table A
name
num
tom
2
jerry
3
jerry
4
null
null






Table B
name
score
tom
12
jerry
10
jerry
8
null
null






When i use spark.sql() to get result from A and B with sql :


select
  a.name as aName,
  a.date,
  b.name as bName
from
(
select
  name,
  date_format(now(),'-MM-dd') AS date
from
  A
group by
  name
) a
FULL JOIN
(
select
  name
from
  B
group by
  name
) b
ON a.name = b.name

I got results contain ALL NULL VALUE ROW like:

aName
date
bName
null
null
null
…
…
…

Can anyone explains why all null value row appears?

sparksql 2.4.0 java.lang.NoClassDefFoundError: com/esotericsoftware/minlog/Log

2020-07-09 Thread Ivan Petrov

Hi there!
I'm seeing this exception in Spark Driver log.
Executor log stays empty. No exceptions, nothing.
8 tasks out of 402 failed with this exception.
What is the right way to debug it?
Thank you.

I see that

spark/jars -> minlog-1.3.0.jar


is in driver classpath at least...


java.lang.NoClassDefFoundError: com/esotericsoftware/minlog/Log
at
com.esotericsoftware.kryo.serializers.FieldSerializerConfig.(FieldSerializerConfig.java:47)
at com.esotericsoftware.kryo.Kryo.(Kryo.java:154)
at com.esotericsoftware.kryo.Kryo.(Kryo.java:161)
at com.twitter.chill.KryoBase.(KryoBase.scala:34)
at
com.twitter.chill.EmptyScalaKryoInstantiator.newKryo(ScalaKryoInstantiator.scala:36)
at
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:97)
at
org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:324)
at
org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:309)
at
org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:218)
at
org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:305)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:235)
at scala.Option.getOrElse(Option.scala:121)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1371)
at
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
at
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:89)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Re: sparksql 2.4.0 java.lang.NoClassDefFoundError: com/esotericsoftware/minlog/Log

2020-07-09 Thread Ivan Petrov

spark/jars -> minlog-1.3.0.jar

I see that jar is there. What do I do wrong?


чт, 9 июл. 2020 г. в 20:43, Ivan Petrov :

> Hi there!
> I'm seeing this exception in Spark Driver log.
> Executor log stays empty. No exceptions, nothing.
> 8 tasks out of 402 failed with this exception.
> What is the right way to debug it?
> Thank you.
>
> java.lang.NoClassDefFoundError: com/esotericsoftware/minlog/Log
> at
> com.esotericsoftware.kryo.serializers.FieldSerializerConfig.(FieldSerializerConfig.java:47)
> at com.esotericsoftware.kryo.Kryo.(Kryo.java:154)
> at com.esotericsoftware.kryo.Kryo.(Kryo.java:161)
> at com.twitter.chill.KryoBase.(KryoBase.scala:34)
> at
> com.twitter.chill.EmptyScalaKryoInstantiator.newKryo(ScalaKryoInstantiator.scala:36)
> at
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:97)
> at
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:324)
> at
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:309)
> at
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:218)
> at
> org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:305)
> at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:235)
> at scala.Option.getOrElse(Option.scala:121)
> at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1371)
> at
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
> at
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> at
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:89)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>

sparksql 2.4.0 java.lang.NoClassDefFoundError: com/esotericsoftware/minlog/Log

2020-07-09 Thread Ivan Petrov

Hi there!
I'm seeing this exception in Spark Driver log.
Executor log stays empty. No exceptions, nothing.
8 tasks out of 402 failed with this exception.
What is the right way to debug it?
Thank you.

java.lang.NoClassDefFoundError: com/esotericsoftware/minlog/Log
at
com.esotericsoftware.kryo.serializers.FieldSerializerConfig.(FieldSerializerConfig.java:47)
at com.esotericsoftware.kryo.Kryo.(Kryo.java:154)
at com.esotericsoftware.kryo.Kryo.(Kryo.java:161)
at com.twitter.chill.KryoBase.(KryoBase.scala:34)
at
com.twitter.chill.EmptyScalaKryoInstantiator.newKryo(ScalaKryoInstantiator.scala:36)
at
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:97)
at
org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:324)
at
org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:309)
at
org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:218)
at
org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:305)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:235)
at scala.Option.getOrElse(Option.scala:121)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1371)
at
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
at
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:89)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Re: pyspark(sparksql-v 2.4) cannot read hive table which is created

2020-03-16 Thread dominic kim

I solved the problem with the option below
spark.sql ("SET spark.hadoop.metastore.catalog.default = hive") 
spark.sql ("SET spark.sql.hive.convertMetastoreOrc = false")



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: pyspark(sparksql-v 2.4) cannot read hive table which is created

2020-03-16 Thread dominic kim

I solved the problem with the option below
spark.sql ("SET spark.hadoop.metastore.catalog.default = hive") 
spark.sql ("SET spark.sql.hive.convertMetastoreOrc = false")



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

pyspark(sparksql-v 2.4) cannot read hive table which is created

2020-03-16 Thread dominic kim

I use related spark config value but not works like below(success in spark
2.1.1) :
spark.hive.mapred.supports.subdirectories=true
spark.hive.supports.subdirectories=true
spark.mapred.input.dir.recursive=true
spark.hive.mapred.supports.subdirectories=true

And when I query, I also use related hive config but not works like below:
mapred.input.dir.recursive=true
hive.mapred.supports.subdirectories=true

I already know if load the path like
'/user/test/warehouse/somedb.db/dt=20200312/*/' as Dataframein pyspark, it
works. But for complex business logic, I should use spark.sql().

Please give me advise.
Thanks !

* Code
from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext, SparkSession
if __name__ == "__main__":
spark = SparkSession \
.builder \
.appName("Sub-Directory Test") \
.enableHiveSupport() \
.getOrCreate()
spark.sql("select * from somedb.table where dt = '20200301' limit
10").show()

* Hive table directory path
/user/test/warehouse/somedb.db/dt=20200312/1/00_0
/user/test/warehouse/somedb.db/dt=20200312/1/00_1
.
.
/user/test/warehouse/somedb.db/dt=20200312/2/00_0
/user/test/warehouse/somedb.db/dt=20200312/3/00_0



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Mich Talebzadeh

Have you updated partition statistics by any chance?

I assume you can access the table and data though Hive itself?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 7 Aug 2019 at 21:07, Patrick McCarthy 
wrote:

> Do the permissions on the hive table files on HDFS correspond with what
> the spark user is able to read? This might arise from spark being run as
> different users.
>
> On Wed, Aug 7, 2019 at 3:15 PM Rishikesh Gawade 
> wrote:
>
>> Hi,
>> I did not explicitly create a Hive Context. I have been using the
>> spark.sqlContext that gets created upon launching the spark-shell.
>> Isn't this sqlContext same as the hiveContext?
>> Thanks,
>> Rishikesh
>>
>> On Wed, Aug 7, 2019 at 12:43 PM Jörn Franke  wrote:
>>
>>> Do you use the HiveContext in Spark? Do you configure the same options
>>> there? Can you share some code?
>>>
>>> Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade <
>>> rishikeshg1...@gmail.com>:
>>>
>>> Hi.
>>> I am using Spark 2.3.2 and Hive 3.1.0.
>>> Even if i use parquet files the result would be same, because after all
>>> sparkSQL isn't able to descend into the subdirectories over which the table
>>> is created. Could there be any other way?
>>> Thanks,
>>> Rishikesh
>>>
>>> On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh 
>>> wrote:
>>>
>>>> which versions of Spark and Hive are you using.
>>>>
>>>> what will happen if you use parquet tables instead?
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade 
>>>> wrote:
>>>>
>>>>> Hi.
>>>>> I have built a Hive external table on top of a directory 'A' which has
>>>>> data stored in ORC format. This directory has several subdirectories 
>>>>> inside
>>>>> it, each of which contains the actual ORC files.
>>>>> These subdirectories are actually created by spark jobs which ingest
>>>>> data from other sources and write it into this directory.
>>>>> I tried creating a table and setting the table properties of the same
>>>>> as *hive.mapred.supports.subdirectories=TRUE* and
>>>>> *mapred.input.dir.recursive**=TRUE*.
>>>>> As a result of this, when i fire the simplest query of *select
>>>>> count(*) from ExtTable* via the Hive CLI, it successfully gives me
>>>>> the expected count of records in the table.
>>>>> However, when i fire the same query via sparkSQL, i get count = 0.
>>>>>
>>>>> I think the sparkSQL isn't able to descend into the subdirectories for
>>>>> getting the data while hive is able to do so.
>>>>> Are there any configurations needed to be set on the spark side so
>>>>> that this works as it does via hive cli?
>>>>> I am using Spark on YARN.
>>>>>
>>>>> Thanks,
>>>>> Rishikesh
>>>>>
>>>>> Tags: subdirectories, subdirectory, recursive, recursion, hive
>>>>> external table, orc, sparksql, yarn
>>>>>
>>>>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Patrick McCarthy

Do the permissions on the hive table files on HDFS correspond with what the
spark user is able to read? This might arise from spark being run as
different users.

On Wed, Aug 7, 2019 at 3:15 PM Rishikesh Gawade 
wrote:

> Hi,
> I did not explicitly create a Hive Context. I have been using the
> spark.sqlContext that gets created upon launching the spark-shell.
> Isn't this sqlContext same as the hiveContext?
> Thanks,
> Rishikesh
>
> On Wed, Aug 7, 2019 at 12:43 PM Jörn Franke  wrote:
>
>> Do you use the HiveContext in Spark? Do you configure the same options
>> there? Can you share some code?
>>
>> Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade > >:
>>
>> Hi.
>> I am using Spark 2.3.2 and Hive 3.1.0.
>> Even if i use parquet files the result would be same, because after all
>> sparkSQL isn't able to descend into the subdirectories over which the table
>> is created. Could there be any other way?
>> Thanks,
>> Rishikesh
>>
>> On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh 
>> wrote:
>>
>>> which versions of Spark and Hive are you using.
>>>
>>> what will happen if you use parquet tables instead?
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade 
>>> wrote:
>>>
>>>> Hi.
>>>> I have built a Hive external table on top of a directory 'A' which has
>>>> data stored in ORC format. This directory has several subdirectories inside
>>>> it, each of which contains the actual ORC files.
>>>> These subdirectories are actually created by spark jobs which ingest
>>>> data from other sources and write it into this directory.
>>>> I tried creating a table and setting the table properties of the same
>>>> as *hive.mapred.supports.subdirectories=TRUE* and
>>>> *mapred.input.dir.recursive**=TRUE*.
>>>> As a result of this, when i fire the simplest query of *select
>>>> count(*) from ExtTable* via the Hive CLI, it successfully gives me the
>>>> expected count of records in the table.
>>>> However, when i fire the same query via sparkSQL, i get count = 0.
>>>>
>>>> I think the sparkSQL isn't able to descend into the subdirectories for
>>>> getting the data while hive is able to do so.
>>>> Are there any configurations needed to be set on the spark side so that
>>>> this works as it does via hive cli?
>>>> I am using Spark on YARN.
>>>>
>>>> Thanks,
>>>> Rishikesh
>>>>
>>>> Tags: subdirectories, subdirectory, recursive, recursion, hive external
>>>> table, orc, sparksql, yarn
>>>>
>>>

-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Rishikesh Gawade

Hi,
I did not explicitly create a Hive Context. I have been using the
spark.sqlContext that gets created upon launching the spark-shell.
Isn't this sqlContext same as the hiveContext?
Thanks,
Rishikesh

On Wed, Aug 7, 2019 at 12:43 PM Jörn Franke  wrote:

> Do you use the HiveContext in Spark? Do you configure the same options
> there? Can you share some code?
>
> Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade  >:
>
> Hi.
> I am using Spark 2.3.2 and Hive 3.1.0.
> Even if i use parquet files the result would be same, because after all
> sparkSQL isn't able to descend into the subdirectories over which the table
> is created. Could there be any other way?
> Thanks,
> Rishikesh
>
> On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh 
> wrote:
>
>> which versions of Spark and Hive are you using.
>>
>> what will happen if you use parquet tables instead?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade 
>> wrote:
>>
>>> Hi.
>>> I have built a Hive external table on top of a directory 'A' which has
>>> data stored in ORC format. This directory has several subdirectories inside
>>> it, each of which contains the actual ORC files.
>>> These subdirectories are actually created by spark jobs which ingest
>>> data from other sources and write it into this directory.
>>> I tried creating a table and setting the table properties of the same as
>>> *hive.mapred.supports.subdirectories=TRUE* and
>>> *mapred.input.dir.recursive**=TRUE*.
>>> As a result of this, when i fire the simplest query of *select count(*)
>>> from ExtTable* via the Hive CLI, it successfully gives me the expected
>>> count of records in the table.
>>> However, when i fire the same query via sparkSQL, i get count = 0.
>>>
>>> I think the sparkSQL isn't able to descend into the subdirectories for
>>> getting the data while hive is able to do so.
>>> Are there any configurations needed to be set on the spark side so that
>>> this works as it does via hive cli?
>>> I am using Spark on YARN.
>>>
>>> Thanks,
>>> Rishikesh
>>>
>>> Tags: subdirectories, subdirectory, recursive, recursion, hive external
>>> table, orc, sparksql, yarn
>>>
>>

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Jörn Franke

Do you use the HiveContext in Spark? Do you configure the same options there? 
Can you share some code?

> Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade :
> 
> Hi.
> I am using Spark 2.3.2 and Hive 3.1.0. 
> Even if i use parquet files the result would be same, because after all 
> sparkSQL isn't able to descend into the subdirectories over which the table 
> is created. Could there be any other way?
> Thanks,
> Rishikesh
> 
>> On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh  
>> wrote:
>> which versions of Spark and Hive are you using.
>> 
>> what will happen if you use parquet tables instead?
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> 
>>> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade  
>>> wrote:
>>> Hi.
>>> I have built a Hive external table on top of a directory 'A' which has data 
>>> stored in ORC format. This directory has several subdirectories inside it, 
>>> each of which contains the actual ORC files.
>>> These subdirectories are actually created by spark jobs which ingest data 
>>> from other sources and write it into this directory.
>>> I tried creating a table and setting the table properties of the same as 
>>> hive.mapred.supports.subdirectories=TRUE and 
>>> mapred.input.dir.recursive=TRUE.
>>> As a result of this, when i fire the simplest query of select count(*) from 
>>> ExtTable via the Hive CLI, it successfully gives me the expected count of 
>>> records in the table.
>>> However, when i fire the same query via sparkSQL, i get count = 0.
>>> 
>>> I think the sparkSQL isn't able to descend into the subdirectories for 
>>> getting the data while hive is able to do so.
>>> Are there any configurations needed to be set on the spark side so that 
>>> this works as it does via hive cli? 
>>> I am using Spark on YARN.
>>> 
>>> Thanks,
>>> Rishikesh
>>> 
>>> Tags: subdirectories, subdirectory, recursive, recursion, hive external 
>>> table, orc, sparksql, yarn

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Rishikesh Gawade

Hi.
I am using Spark 2.3.2 and Hive 3.1.0.
Even if i use parquet files the result would be same, because after all
sparkSQL isn't able to descend into the subdirectories over which the table
is created. Could there be any other way?
Thanks,
Rishikesh

On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh 
wrote:

> which versions of Spark and Hive are you using.
>
> what will happen if you use parquet tables instead?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade 
> wrote:
>
>> Hi.
>> I have built a Hive external table on top of a directory 'A' which has
>> data stored in ORC format. This directory has several subdirectories inside
>> it, each of which contains the actual ORC files.
>> These subdirectories are actually created by spark jobs which ingest data
>> from other sources and write it into this directory.
>> I tried creating a table and setting the table properties of the same as
>> *hive.mapred.supports.subdirectories=TRUE* and
>> *mapred.input.dir.recursive**=TRUE*.
>> As a result of this, when i fire the simplest query of *select count(*)
>> from ExtTable* via the Hive CLI, it successfully gives me the expected
>> count of records in the table.
>> However, when i fire the same query via sparkSQL, i get count = 0.
>>
>> I think the sparkSQL isn't able to descend into the subdirectories for
>> getting the data while hive is able to do so.
>> Are there any configurations needed to be set on the spark side so that
>> this works as it does via hive cli?
>> I am using Spark on YARN.
>>
>> Thanks,
>> Rishikesh
>>
>> Tags: subdirectories, subdirectory, recursive, recursion, hive external
>> table, orc, sparksql, yarn
>>
>

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-06 Thread Mich Talebzadeh

which versions of Spark and Hive are you using.

what will happen if you use parquet tables instead?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade 
wrote:

> Hi.
> I have built a Hive external table on top of a directory 'A' which has
> data stored in ORC format. This directory has several subdirectories inside
> it, each of which contains the actual ORC files.
> These subdirectories are actually created by spark jobs which ingest data
> from other sources and write it into this directory.
> I tried creating a table and setting the table properties of the same as
> *hive.mapred.supports.subdirectories=TRUE* and
> *mapred.input.dir.recursive**=TRUE*.
> As a result of this, when i fire the simplest query of *select count(*)
> from ExtTable* via the Hive CLI, it successfully gives me the expected
> count of records in the table.
> However, when i fire the same query via sparkSQL, i get count = 0.
>
> I think the sparkSQL isn't able to descend into the subdirectories for
> getting the data while hive is able to do so.
> Are there any configurations needed to be set on the spark side so that
> this works as it does via hive cli?
> I am using Spark on YARN.
>
> Thanks,
> Rishikesh
>
> Tags: subdirectories, subdirectory, recursive, recursion, hive external
> table, orc, sparksql, yarn
>

Hive external table not working in sparkSQL when subdirectories are present

2019-08-06 Thread Rishikesh Gawade

Hi.
I have built a Hive external table on top of a directory 'A' which has data
stored in ORC format. This directory has several subdirectories inside it,
each of which contains the actual ORC files.
These subdirectories are actually created by spark jobs which ingest data
from other sources and write it into this directory.
I tried creating a table and setting the table properties of the same as
*hive.mapred.supports.subdirectories=TRUE* and *mapred.input.dir.recursive*
*=TRUE*.
As a result of this, when i fire the simplest query of *select count(*)
from ExtTable* via the Hive CLI, it successfully gives me the expected
count of records in the table.
However, when i fire the same query via sparkSQL, i get count = 0.

I think the sparkSQL isn't able to descend into the subdirectories for
getting the data while hive is able to do so.
Are there any configurations needed to be set on the spark side so that
this works as it does via hive cli?
I am using Spark on YARN.

Thanks,
Rishikesh

Tags: subdirectories, subdirectory, recursive, recursion, hive external
table, orc, sparksql, yarn

Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-12 Thread Reynold Xin

No sorry I'm not at liberty to share other people's code.

On Fri, Jul 12, 2019 at 9:33 AM, Gourav Sengupta < gourav.sengu...@gmail.com > 
wrote:

> 
> Hi Reynold,
> 
> 
> I am genuinely curious about queries which are more than 1 MB and am
> stunned by tens of MB's. Any samples to share :) 
> 
> 
> Regards,
> Gourav
> 
> On Thu, Jul 11, 2019 at 5:03 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> There is no explicit limit but a JVM string cannot be bigger than 2G. It
>> will also at some point run out of memory with too big of a query plan
>> tree or become incredibly slow due to query planning complexity. I've seen
>> queries that are tens of MBs in size.
>> 
>> 
>> 
>> 
>> 
>> 
>> On Thu, Jul 11, 2019 at 5:01 AM, 李书明 < alemmontree@ 126. com (
>> alemmont...@126.com ) > wrote:
>> 
>>> I have a question about the limit(biggest) of SQL's length that is
>>> supported in SparkSQL. I can't find the answer in the documents of Spark.
>>> 
>>> 
>>> Maybe Interger.MAX_VALUE or not ?
>>> 
>> 
>> 
> 
>

Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-12 Thread Gourav Sengupta

Hi Reynold,

I am genuinely curious about queries which are more than 1 MB and am
stunned by tens of MB's. Any samples to share :)

Regards,
Gourav

On Thu, Jul 11, 2019 at 5:03 PM Reynold Xin  wrote:

> There is no explicit limit but a JVM string cannot be bigger than 2G. It
> will also at some point run out of memory with too big of a query plan tree
> or become incredibly slow due to query planning complexity. I've seen
> queries that are tens of MBs in size.
>
>
>
> On Thu, Jul 11, 2019 at 5:01 AM, 李书明  wrote:
>
>> I have a question about the limit(biggest) of SQL's length that is
>> supported in SparkSQL. I can't find the answer in the documents of Spark.
>>
>> Maybe Interger.MAX_VALUE or not ?
>>
>>
>

Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-11 Thread Reynold Xin

There is no explicit limit but a JVM string cannot be bigger than 2G. It will 
also at some point run out of memory with too big of a query plan tree or 
become incredibly slow due to query planning complexity. I've seen queries that 
are tens of MBs in size.

On Thu, Jul 11, 2019 at 5:01 AM, 李书明 < alemmont...@126.com > wrote:

> 
> I have a question about the limit(biggest) of SQL's length that is
> supported in SparkSQL. I can't find the answer in the documents of Spark.
> 
> 
> Maybe Interger.MAX_VALUE or not ?
> 
> 
> 
>

Re: sparksql in sparkR?

2019-06-07 Thread Felix Cheung

This seem to be more a question of spark-sql shell? I may suggest you change 
the email title to get more attention.


From: ya 
Sent: Wednesday, June 5, 2019 11:48:17 PM
To: user@spark.apache.org
Subject: sparksql in sparkR?

Dear list,

I am trying to use sparksql within my R, I am having the following questions, 
could you give me some advice please? Thank you very much.

1. I connect my R and spark using the library sparkR, probably some of the 
members here also are R users? Do I understand correctly that SparkSQL can be 
connected and triggered via SparkR and used in R (not in sparkR shell of spark)?

2. I ran sparkR library in R, trying to create a new sql database and a table, 
I could not get the database and the table I want. The code looks like below:

library(SparkR)
Sys.setenv(SPARK_HOME='/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7')
sparkR.session(sparkHome=Sys.getenv('/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7'))
sql("create database learnsql; use learnsql")
sql("
create table employee_tbl
(emp_id varchar(10) not null,
emp_name char(10) not null,
emp_st_addr char(10) not null,
emp_city char(10) not null,
emp_st char(10) not null,
emp_zip integer(5) not null,
emp_phone integer(10) null,
emp_pager integer(10) null);
insert into employee_tbl values ('0001','john','yanlanjie 
1','gz','jiaoqiaojun','510006','1353');
select*from employee_tbl;
“)

I ran the following code in spark-sql shell, I get the database learnsql, 
however, I still can’t get the table.

spark-sql> create database learnsql;show databases;
19/06/06 14:42:36 INFO HiveMetaStore: 0: create_database: 
Database(name:learnsql, description:, 
locationUri:file:/Users/ya/spark-warehouse/learnsql.db, parameters:{})
19/06/06 14:42:36 INFO audit: ugi=yaip=unknown-ip-addr  
cmd=create_database: Database(name:learnsql, description:, 
locationUri:file:/Users/ya/spark-warehouse/learnsql.db, parameters:{})
Error in query: org.apache.hadoop.hive.metastore.api.AlreadyExistsException: 
Database learnsql already exists;

spark-sql> create table employee_tbl
 > (emp_id varchar(10) not null,
 > emp_name char(10) not null,
 > emp_st_addr char(10) not null,
 > emp_city char(10) not null,
 > emp_st char(10) not null,
 > emp_zip integer(5) not null,
 > emp_phone integer(10) null,
 > emp_pager integer(10) null);
Error in query:
no viable alternative at input 'create table employee_tbl\n(emp_id varchar(10) 
not'(line 2, pos 20)

== SQL ==
create table employee_tbl
(emp_id varchar(10) not null,
^^^
emp_name char(10) not null,
emp_st_addr char(10) not null,
emp_city char(10) not null,
emp_st char(10) not null,
emp_zip integer(5) not null,
emp_phone integer(10) null,
emp_pager integer(10) null)

spark-sql> insert into employee_tbl values ('0001','john','yanlanjie 
1','gz','jiaoqiaojun','510006','1353');
19/06/06 14:43:43 INFO HiveMetaStore: 0: get_table : db=default tbl=employee_tbl
19/06/06 14:43:43 INFO audit: ugi=yaip=unknown-ip-addr  cmd=get_table : 
db=default tbl=employee_tbl
Error in query: Table or view not found: employee_tbl; line 1 pos 0


Does sparkSQL has different coding grammar? What did I miss?

Thank you very much.

Best regards,

YA




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

sparksql in sparkR?

2019-06-06 Thread ya

Dear list,

I am trying to use sparksql within my R, I am having the following questions, 
could you give me some advice please? Thank you very much.

1. I connect my R and spark using the library sparkR, probably some of the 
members here also are R users? Do I understand correctly that SparkSQL can be 
connected and triggered via SparkR and used in R (not in sparkR shell of spark)?

2. I ran sparkR library in R, trying to create a new sql database and a table, 
I could not get the database and the table I want. The code looks like below:

library(SparkR)
Sys.setenv(SPARK_HOME='/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7') 
sparkR.session(sparkHome=Sys.getenv('/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7'))
sql("create database learnsql; use learnsql")
sql("
create table employee_tbl
(emp_id varchar(10) not null,
emp_name char(10) not null,
emp_st_addr char(10) not null,
emp_city char(10) not null,
emp_st char(10) not null,
emp_zip integer(5) not null,
emp_phone integer(10) null,
emp_pager integer(10) null);
insert into employee_tbl values ('0001','john','yanlanjie 
1','gz','jiaoqiaojun','510006','1353');
select*from employee_tbl;
“)

I ran the following code in spark-sql shell, I get the database learnsql, 
however, I still can’t get the table. 

spark-sql> create database learnsql;show databases;
19/06/06 14:42:36 INFO HiveMetaStore: 0: create_database: 
Database(name:learnsql, description:, 
locationUri:file:/Users/ya/spark-warehouse/learnsql.db, parameters:{})
19/06/06 14:42:36 INFO audit: ugi=yaip=unknown-ip-addr  
cmd=create_database: Database(name:learnsql, description:, 
locationUri:file:/Users/ya/spark-warehouse/learnsql.db, parameters:{})   
Error in query: org.apache.hadoop.hive.metastore.api.AlreadyExistsException: 
Database learnsql already exists;

spark-sql> create table employee_tbl
 > (emp_id varchar(10) not null,
 > emp_name char(10) not null,
 > emp_st_addr char(10) not null,
 > emp_city char(10) not null,
 > emp_st char(10) not null,
 > emp_zip integer(5) not null,
 > emp_phone integer(10) null,
 > emp_pager integer(10) null);
Error in query: 
no viable alternative at input 'create table employee_tbl\n(emp_id varchar(10) 
not'(line 2, pos 20)

== SQL ==
create table employee_tbl
(emp_id varchar(10) not null,
^^^
emp_name char(10) not null,
emp_st_addr char(10) not null,
emp_city char(10) not null,
emp_st char(10) not null,
emp_zip integer(5) not null,
emp_phone integer(10) null,
emp_pager integer(10) null)

spark-sql> insert into employee_tbl values ('0001','john','yanlanjie 
1','gz','jiaoqiaojun','510006','1353');
19/06/06 14:43:43 INFO HiveMetaStore: 0: get_table : db=default tbl=employee_tbl
19/06/06 14:43:43 INFO audit: ugi=yaip=unknown-ip-addr  cmd=get_table : 
db=default tbl=employee_tbl 
Error in query: Table or view not found: employee_tbl; line 1 pos 0


Does sparkSQL has different coding grammar? What did I miss?

Thank you very much.

Best regards,

YA




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[SparkSQL, user-defined Hadoop, K8s] Hadoop free spark on kubernetes => NoClassDefFound

2019-03-07 Thread Sommer Tobias

Hi all,

we are having problems with using a custom hadoop lib in a spark image

when running it on a kubernetes cluster while following the steps of the 
documentation.

Details in the description below.



Does anyone else had similar problems? Is there something missing in the setup 
below?

Or is this a bug?



Hadoop free spark on kubernetes





Using custom hadoop libraries in spark image

does not work with following the steps of the documentation (*)

for running spark pi on kubernetes cluster.





*Usage of hadoop free build:

https://spark.apache.org/docs/2.4.0/hadoop-provided.html





Steps:
1.   Download hadoop free spark  
spark-2.4.0-bin-without-hadoop.tgz
2.   Build spark image without hadoop from this with docker-image-tool.sh
3.   Create Dockerfile to add an image layer to the spark image without 
hadoop that adds a custom hadoop

(see: Dockerfile and conf/spark-enf.sh in gist)  ==> custom Hadoop Version 2.9.2
4.   Use custom hadoop spark image to run spark examples

(see: k8s submit below)
5.   Produces JNI Error (see message below), expected instead is 
computation of pi.





See files in gist

https://gist.github.com/HectorOvid/c0bdad1b9dc8f64540b5b34e73f2a4a1





Regards,

Tobias Sommer
M.Sc. (Uni)
Team eso-IN-Swarm
Software Engineer

[Beschreibung: Description: Description: Description: Description: Description: 
Description: e-solutions-logo-text-142]

e.solutions GmbH
Despag-Str. 4a, 85055 Ingolstadt, Germany

Phone +49-8458-3332-1219
Fax +49-8458-3332-2219
tobias.som...@esolutions.de

Registered Office:
Despag-Str. 4a, 85055 Ingolstadt, Germany

e.solutions GmbH
Managing Directors Uwe Reder, Dr. Riclef Schmidt-Clausen
Register Court Ingolstadt HRB 5221

SparkSql query on a port and peocess queries

2019-01-15 Thread Soheil Pourbafrani

Hi,
In my problem data is stored on both Database and HDFS. I create an
application that according to the query,  Spark load data, process the
query and return the answer.

I'm looking for a service that gets SQL queries and returns the answers
(like Databases command line). Is there a way that my application listen on
a port and get the query and return the answer, there?

Re: Need help with SparkSQL Query

2018-12-17 Thread Ramandeep Singh Nanda

You can use analytical functions in spark sql.

Something like select * from (select id, row_number() over (partition by id
order by timestamp ) as rn from root) where rn=1

On Mon, Dec 17, 2018 at 4:03 PM Nikhil Goyal  wrote:

> Hi guys,
>
> I have a dataframe of type Record (id: Long, timestamp: Long, isValid:
> Boolean,  other metrics)
>
> Schema looks like this:
> root
>  |-- id: long (nullable = true)
>  |-- timestamp: long (nullable = true)
>  |-- isValid: boolean (nullable = true)
> .
>
> I need to find the earliest valid record per id. In RDD world I can do
> groupBy 'id' and find the earliest one but I am not sure how I can do it in
> SQL. Since I am doing this in PySpark I cannot really use DataSet API for
> this.
>
> One thing I can do is groupBy 'id', find the earliest timestamp available
> and then join with the original dataframe to get the right record (all the
> metrics).
>
> Or I can create a single column with all the records and then implement a
> UDAF in scala and use it in pyspark.
>
> Both solutions don't seem to be straight forward. Is there a simpler
> solution to this?
>
> Thanks
> Nikhil
>


-- 
Regards,
Ramandeep Singh
http://orastack.com
+13474792296
ramannan...@gmail.com

Re: Need help with SparkSQL Query

2018-12-17 Thread Patrick McCarthy

Untested, but something like the below should work:

from pyspark.sql import functions as F
from pyspark.sql import window as W

(record
.withColumn('ts_rank',
F.dense_rank().over(W.Window.orderBy('timestamp').partitionBy("id"))
.filter(F.col('ts_rank')==1)
.drop('ts_rank')
)


On Mon, Dec 17, 2018 at 4:04 PM Nikhil Goyal  wrote:

> Hi guys,
>
> I have a dataframe of type Record (id: Long, timestamp: Long, isValid:
> Boolean,  other metrics)
>
> Schema looks like this:
> root
>  |-- id: long (nullable = true)
>  |-- timestamp: long (nullable = true)
>  |-- isValid: boolean (nullable = true)
> .
>
> I need to find the earliest valid record per id. In RDD world I can do
> groupBy 'id' and find the earliest one but I am not sure how I can do it in
> SQL. Since I am doing this in PySpark I cannot really use DataSet API for
> this.
>
> One thing I can do is groupBy 'id', find the earliest timestamp available
> and then join with the original dataframe to get the right record (all the
> metrics).
>
> Or I can create a single column with all the records and then implement a
> UDAF in scala and use it in pyspark.
>
> Both solutions don't seem to be straight forward. Is there a simpler
> solution to this?
>
> Thanks
> Nikhil
>

Need help with SparkSQL Query

2018-12-17 Thread Nikhil Goyal

Hi guys,

I have a dataframe of type Record (id: Long, timestamp: Long, isValid:
Boolean,  other metrics)

Schema looks like this:
root
 |-- id: long (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- isValid: boolean (nullable = true)
.

I need to find the earliest valid record per id. In RDD world I can do
groupBy 'id' and find the earliest one but I am not sure how I can do it in
SQL. Since I am doing this in PySpark I cannot really use DataSet API for
this.

One thing I can do is groupBy 'id', find the earliest timestamp available
and then join with the original dataframe to get the right record (all the
metrics).

Or I can create a single column with all the records and then implement a
UDAF in scala and use it in pyspark.

Both solutions don't seem to be straight forward. Is there a simpler
solution to this?

Thanks
Nikhil

Re: SparkSQL read Hive transactional table

2018-10-17 Thread Gourav Sengupta

Hi,

I think that the speed of ORC has been improved in latest versions. Any
chance you could use the latest version?

Regards,
Gourav Sengupta

On 17 Oct 2018 6:11 am, "daily"  wrote:

Hi,

Spark version: 2.3.0
Hive   version: 2.1.0

Best regards.


-- 原始邮件 --
*发件人:* "Gourav Sengupta";
*发送时间:* 2018年10月16日(星期二) 晚上6:35
*收件人:* "daily";
*抄送:* "user"; "dev";
*主题:* Re: SparkSQL read Hive transactional table

Hi,

can I please ask which version of Hive and Spark are you using?

Regards,
Gourav Sengupta

On Tue, Oct 16, 2018 at 2:42 AM daily  wrote:

> Hi,
>
> I use HCatalog Streaming Mutation API to write data to hive transactional
> table, and then, I use SparkSQL to read data from the hive transactional
> table. I get the right result.
> However, SparkSQL uses more time to read hive orc bucket transactional
> table, beacause SparkSQL read all columns(not The columns involved in SQL)
> so it uses more time.
> My question is why that SparkSQL read all columns of hive orc bucket
> transactional table, but not the columns involved in SQL? Is it possible to
> control the SparkSQL read the columns involved in SQL?
>
>
>
> For example:
> Hive Table:
> create table dbtest.t_a1 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6
> int) partitioned by(sd string,st string) clustered by(t0) into 10 buckets
> stored as orc TBLPROPERTIES ('transactional'='true');
>
> create table dbtest.t_a2 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6
> int) partitioned by(sd string,st string) clustered by(t0) into 10 buckets
> stored as orc TBLPROPERTIES ('transactional'='false');
>
> SparkSQL:
> select sum(t1),sum(t2) from dbtest.t_a1 group by t0;
> select sum(t1),sum(t2) from dbtest.t_a2 group by t0;
>
> SparkSQL's stage Input size:
>
> dbtest.t_a1=113.9 GB,
>
> dbtest.t_a2=96.5 MB
>
>
>
> Best regards.
>
>
>
>

?????? SparkSQL read Hive transactional table

2018-10-16 Thread daily

Hi,


Spark version: 2.3.0
Hive   version: 2.1.0



Best regards.





--  --
??: "Gourav Sengupta";
: 2018??10??16??(??) 6:35
??: "daily";
: "user"; "dev"; 
: Re: SparkSQL read Hive transactional table



Hi,

can I please ask which version of Hive and Spark are you using?


Regards,
Gourav Sengupta


On Tue, Oct 16, 2018 at 2:42 AM daily  wrote:

 
Hi,
   
I use HCatalog Streaming   Mutation API to write data to hive transactional 
table, and then, I use   SparkSQL to read data from the hive transactional 
table. I get the right   result.
   However, SparkSQL uses more time to read hive orc bucket transactional 
table,   beacause SparkSQL read all columns(not The columns involved in SQL) so 
it   uses more time.
   My question is why that SparkSQL read all columns of hive orc bucket   
transactional table, but not the columns involved in SQL? Is it possible to   
control the SparkSQL read the columns involved in SQL?
   
 
   
For example:
   Hive Table:
   create table dbtest.t_a1 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) 
  partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored   
as orc TBLPROPERTIES ('transactional'='true');
   
create table dbtest.t_a2 (t0   VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) 
partitioned by(sd string,st   string) clustered by(t0) into 10 buckets stored 
as orc TBLPROPERTIES   ('transactional'='false');
   
SparkSQL: 
   select sum(t1),sum(t2) from dbtest.t_a1 group by t0;
   select sum(t1),sum(t2) from dbtest.t_a2 group by t0;
   
SparkSQL's stage Input size:
   
dbtest.t_a1=113.9 GB,
   
dbtest.t_a2=96.5 MB
   
 
   
Best regards.

Re: SparkSQL read Hive transactional table

2018-10-16 Thread Gourav Sengupta

Hi,

can I please ask which version of Hive and Spark are you using?

Regards,
Gourav Sengupta

On Tue, Oct 16, 2018 at 2:42 AM daily  wrote:

> Hi,
>
> I use HCatalog Streaming Mutation API to write data to hive transactional
> table, and then, I use SparkSQL to read data from the hive transactional
> table. I get the right result.
> However, SparkSQL uses more time to read hive orc bucket transactional
> table, beacause SparkSQL read all columns(not The columns involved in SQL)
> so it uses more time.
> My question is why that SparkSQL read all columns of hive orc bucket
> transactional table, but not the columns involved in SQL? Is it possible to
> control the SparkSQL read the columns involved in SQL?
>
>
>
> For example:
> Hive Table:
> create table dbtest.t_a1 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6
> int) partitioned by(sd string,st string) clustered by(t0) into 10 buckets
> stored as orc TBLPROPERTIES ('transactional'='true');
>
> create table dbtest.t_a2 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6
> int) partitioned by(sd string,st string) clustered by(t0) into 10 buckets
> stored as orc TBLPROPERTIES ('transactional'='false');
>
> SparkSQL:
> select sum(t1),sum(t2) from dbtest.t_a1 group by t0;
> select sum(t1),sum(t2) from dbtest.t_a2 group by t0;
>
> SparkSQL's stage Input size:
>
> dbtest.t_a1=113.9 GB,
>
> dbtest.t_a2=96.5 MB
>
>
>
> Best regards.
>
>
>
>

SparkSQL read Hive transactional table

2018-10-15 Thread daily

Hi,
   
I use HCatalog Streaming   Mutation API to write data to hive transactional 
table, and then, I use   SparkSQL to read data from the hive transactional 
table. I get the right   result.
   However, SparkSQL uses more time to read hive orc bucket transactional 
table,   beacause SparkSQL read all columns(not The columns involved in SQL) so 
it   uses more time.
   My question is why that SparkSQL read all columns of hive orc bucket   
transactional table, but not the columns involved in SQL? Is it possible to   
control the SparkSQL read the columns involved in SQL?
   
 
   
For example:
   Hive Table:
   create table dbtest.t_a1 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) 
  partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored   
as orc TBLPROPERTIES ('transactional'='true');
   
create table dbtest.t_a2 (t0   VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) 
partitioned by(sd string,st   string) clustered by(t0) into 10 buckets stored 
as orc TBLPROPERTIES   ('transactional'='false');
   
SparkSQL: 
   select sum(t1),sum(t2) from dbtest.t_a1 group by t0;
   select sum(t1),sum(t2) from dbtest.t_a2 group by t0;
   
SparkSQL's stage Input size:
   
dbtest.t_a1=113.9 GB,
   
dbtest.t_a2=96.5 MB
   
 
   
Best regards.

SparkSQL read Hive transactional table

2018-10-13 Thread wys372b


Hi,
I use HCatalog Streaming Mutation API to write data to hive transactional 
table, and then, I use SparkSQL to read data from the hive transactional table. 
I get the right result.
However, SparkSQL uses more time to read hive orc bucket transactional table, 
beacause SparkSQL read all columns(not The columns involved in SQL) so it uses 
more time.

My question is why that SparkSQL read all columns of hive orc bucket 
transactional table, but not the columns involved in SQL? Is it possible to 
control the SparkSQL read the columns involved in SQL?

For example:
Hive Table:
create table dbtest.t_a1 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) 
partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored as 
orc TBLPROPERTIES ('transactional'='true');
create table dbtest.t_a2 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) 
partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored as 
orc TBLPROPERTIES ('transactional'='false');
SparkSQL: 
select sum(t1),sum(t2) from dbtest.t_a1 group by t0;
select sum(t1),sum(t2) from dbtest.t_a2 group by t0;
SparkSQL's stage Input size: dbtest.t_a1=113.9 GB, dbtest.t_a2=96.5 MB

Best regards.

SparkSQL read Hive transactional table

2018-10-12 Thread daily

Hi,
I use HCatalog Streaming Mutation API to write data to hive transactional 
table, and then, I use SparkSQL to read data from the hive transactional table. 
I get the right result.
However, SparkSQL uses more time to read hive orc bucket transactional table, 
beacause SparkSQL read all columns(not The columns involved in SQL) so it uses 
more time.
My question is why that SparkSQL read all columns of hive orc bucket 
transactional table, but not the columns involved in SQL? Is it possible to 
control the SparkSQL read the columns involved in SQL?


For example:
Hive Table:
create table dbtest.t_a1 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) 
partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored as 
orc TBLPROPERTIES ('transactional'='true');
create table dbtest.t_a2 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) 
partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored as 
orc TBLPROPERTIES ('transactional'='false');
SparkSQL: 
select sum(t1),sum(t2) from dbtest.t_a1 group by t0;
select sum(t1),sum(t2) from dbtest.t_a2 group by t0;
SparkSQL's stage Input size: dbtest.t_a1=113.9 GB, dbtest.t_a2=96.5 MB


Best regards.

sparksql exception when using regexp_replace

2018-10-10 Thread 付涛

Hi, sparks: I am using sparksql to insert some values into directory，the
sql seems like this:  insert overwrite directory '/temp/test_spark'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '~' select
regexp_replace('a~b~c', '~', ''), 123456 however，some exceptions has
throwed:  Caused by: org.apache.hadoop.hive.serde2.SerDeException:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 4 elements
while columns.types has 2 elements! at
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:163)
at
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:90)
at
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:116)
at
org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:119)
at
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
at
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
... 8 more   the hive version used is 2.0.1   when I add a alias to
regexp_replace, the sql has successed:  insert overwrite
directory '/temp/test_spark'   ROW FORMAT DELIMITED FIELDS TERMINATED BY
'~'   select regexp_replace('a~b~c', '~', '') as kv, 123456



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

sparksql exception when using regexp_replace

2018-10-10 Thread 付涛

Hi, sparks:
 I am using sparksql to insert some values into directory，the sql seems
like this:
 
 insert overwrite directory '/temp/test_spark'
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '~'
 select regexp_replace('a~b~c', '~', ''), 123456

 however，some exceptions has throwed:
 
 Caused by: org.apache.hadoop.hive.serde2.SerDeException:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 4 elements
while columns.types has 2 elements!
at
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:163)
at
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:90)
at
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:116)
at
org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:119)
at
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
at
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
... 8 more

   the hive version used is 2.0.1

   when I add a alias to regexp_replace, the sql has successed:
   
   insert overwrite directory '/temp/test_spark'
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '~'
   select regexp_replace('a~b~c', '~', '') as kv, 123456



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [SparkSQL] Count Distinct issue

2018-09-17 Thread kathleen li

Hi,
I can't reproduce your issue:

scala> spark.sql("select distinct * from dfv").show()
++++++++++++++++---+
|   a|   b|   c|   d|   e|   f|   g|   h|   i|   j|   k|   l|   m|   n|
o|  p|
++++++++++++++++---+
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
9|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
13|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
2|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
7|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
8|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
3|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
5|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
15|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
12|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
16|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
14|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
4|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
6|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
10|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
11|
|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|
1|
++++++++++++++++---+


scala> spark.sql("select count(distinct *) from dfv").show()
+--+
|count(DISTINCT a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p)|
+--+
|16|
+--+
Kathleen

On Fri, Sep 14, 2018 at 11:54 AM Daniele Foroni 
wrote:

> Hi all,
>
> I am having some troubles in doing a count distinct over multiple columns.
> This is an example of my data:
> ++++---+
> |a   |b   |c   |d  |
> ++++---+
> |null|null|null|1  |
> |null|null|null|2  |
> |null|null|null|3  |
> |null|null|null|4  |
> |null|null|null|5  |
> |null|null|null|6  |
> |null|null|null|7  |
> ++++---+
> And my code:
> val df: Dataset[Row] = …
> val cols: List[Column] = df.columns.map(col).toList
> df.agg(countDistinct(cols.head, cols.tail: _*))
>
> So, in the example above, if I count the distinct “rows” I obtain 7 as
> result as expected (since the “d" column changes for every row).
> However, with more columns (16) in EXACTLY the same situation (one
> incremental column and 15 columns filled with nulls) the result is 0.
>
> I don’t understand why I am experiencing this problem.
> Any solution?
>
> Thanks,
> ---
> Daniele
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Is there any open source framework that converts Cypher to SparkSQL?

2018-09-16 Thread Matei Zaharia

GraphFrames (https://graphframes.github.io) offers a Cypher-like syntax that 
then executes on Spark SQL.

> On Sep 14, 2018, at 2:42 AM, kant kodali  wrote:
> 
> Hi All,
> 
> Is there any open source framework that converts Cypher to SparkSQL?
> 
> Thanks!

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[SparkSQL] Count Distinct issue

2018-09-14 Thread Daniele Foroni

Hi all,

I am having some troubles in doing a count distinct over multiple columns.
This is an example of my data:
++++---+
|a   |b   |c   |d  |
++++---+
|null|null|null|1  |
|null|null|null|2  |
|null|null|null|3  |
|null|null|null|4  |
|null|null|null|5  |
|null|null|null|6  |
|null|null|null|7  |
++++---+
And my code:
val df: Dataset[Row] = …
val cols: List[Column] = df.columns.map(col).toList
df.agg(countDistinct(cols.head, cols.tail: _*))

So, in the example above, if I count the distinct “rows” I obtain 7 as result 
as expected (since the “d" column changes for every row).
However, with more columns (16) in EXACTLY the same situation (one incremental 
column and 15 columns filled with nulls) the result is 0.

I don’t understand why I am experiencing this problem.
Any solution?

Thanks,
---
Daniele


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Is there any open source framework that converts Cypher to SparkSQL?

2018-09-14 Thread kant kodali

Hi All,

Is there any open source framework that converts Cypher to SparkSQL?

Thanks!

Re: Where can I read the Kafka offsets in SparkSQL application

2018-07-24 Thread Gourav Sengupta

Hi,


can you see whether using the option for checkPointLocation would work in
case you are using structured streaming?

Regards,
Gourav Sengupta

On Tue, Jul 24, 2018 at 12:30 PM, John, Vishal (Agoda) <
vishal.j...@agoda.com.invalid> wrote:

>
> Hello all,
>
>
> I have to read data from Kafka topic at regular intervals. I create the
> dataframe as shown below.  I don’t want to start reading from the beginning
> on each run. At the same time, I don’t want to miss the messages between
> run intervals.
>
> val queryDf = sqlContext
>   .read
>   .format("kafka")
>   .option("kafka.bootstrap.servers", hosts)
>   .option("enable.auto.commit", true)
>   .option("subscribe", topicName)
>   .option("auto.commit.interval.ms", 1000)
>   .option("startingOffsets", " latest")  //??  earliest OR latest
>   .load()
>   .selectExpr("CAST(value AS STRING) as message")
>
> I would like to understand where the offsets will be stored, so that I can
> supply it each time the application starts. Or is there a way to supply a
> custom location where to store the offsets.
> This is not a Steaming application. So, I am not sure if checkpoint
> directory is valid in this case.
>
> Any pointers would be highly helpful.
>
>
> thanks,
> Vishal
>
> 
> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by copyright
> or other legal rules. If you have received it by mistake please let us know
> by reply email and delete it from your system. It is prohibited to copy
> this message or disclose its content to anyone. Any confidentiality or
> privilege is not waived or lost by any mistaken delivery or unauthorized
> disclosure of the message. All messages sent to and from Agoda may be
> monitored to ensure compliance with company policies, to protect the
> company's interests and to remove potential malware. Electronic messages
> may be intercepted, amended, lost or deleted, or contain viruses.
>

Where can I read the Kafka offsets in SparkSQL application

2018-07-24 Thread John, Vishal (Agoda)


Hello all,


I have to read data from Kafka topic at regular intervals. I create the 
dataframe as shown below.  I don’t want to start reading from the beginning on 
each run. At the same time, I don’t want to miss the messages between run 
intervals.

val queryDf = sqlContext
  .read
  .format("kafka")
  .option("kafka.bootstrap.servers", hosts)
  .option("enable.auto.commit", true)
  .option("subscribe", topicName)
  .option("auto.commit.interval.ms", 1000)
  .option("startingOffsets", " latest")  //??  earliest OR latest
  .load()
  .selectExpr("CAST(value AS STRING) as message")

I would like to understand where the offsets will be stored, so that I can 
supply it each time the application starts. Or is there a way to supply a 
custom location where to store the offsets.
This is not a Steaming application. So, I am not sure if checkpoint directory 
is valid in this case.

Any pointers would be highly helpful.


thanks,
Vishal


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.

Error on fetchin mass data from cassandra using SparkSQL

2018-05-28 Thread Soheil Pourbafrani

I tried to fetch some data from Cassandra using SparkSql. For small tables,
all things go well but trying to fetch data from big tables I got the
following error:

java.lang.NoSuchMethodError:
com.datastax.driver.core.ResultSet.fetchMoreResults()Lshade/com/datastax/spark/connector/google/common/util/concurrent/ListenableFuture;

I tried many version of spark Cassandra connector (2.0.5, 2.0.8, 2.3.0) and
even unshaded version of that but no differences!

In my project, I have Flink-cassandra-connector-1.4.2 maven dependency and
it uses Cassandra datastax core, too. I remove that dependency but no
differences!

I also read this
<https://stackoverflow.com/questions/39034538/what-happens-nosuchmethoderror-com-datastax-driver-core-resultset-fetchmorere/40539460#40539460>
post
in StackOverflow but their solution didn't work for me.

My Cassandra version is 3.11 and I use Spark 2.2.1 in local mode.

How can I solve the problem?

Re: How to use disk instead of just InMemoryRelation when use JDBC datasource in SPARKSQL?

2018-04-12 Thread Takeshi Yamamuro

You want to use `Dataset.persist(StorageLevel.MEMORY_AND_DISK)`?

On Thu, Apr 12, 2018 at 1:12 PM, Louis Hust <louis.h...@gmail.com> wrote:

> We  want to extract data from mysql, and calculate in sparksql.
> The sql explain like below.
>
>
> REGIONKEY#177,N_COMMENT#178] PushedFilters: [], ReadSchema:
> struct<N_NATIONKEY:int,N_NAME:string,N_REGIONKEY:int,N_COMMENT:string>
>   +- *(20) Sort [r_regionkey#203 ASC NULLS FIRST], false, 0
>  +- Exchange(coordinator id: 266374831)
> hashpartitioning(r_regionkey#203, 200), coordinator[target post-shuffle
> partition size: 67108864]
> +- *(19) Project [R_REGIONKEY#203]
>+- *(19) Filter ((isnotnull(r_name#204) &&
> (r_name#204 = AFRICA)) && isnotnull(r_regionkey#203))
>   +- InMemoryTableScan [R_REGIONKEY#203,
> r_name#204], [isnotnull(r_name#204), (r_name#204 = AFRICA),
> isnotnull(r_regionkey#203)]
> +- InMemoryRelation [R_REGIONKEY#203,
> R_NAME#204, R_COMMENT#205], true, 1, StorageLevel(disk, memory, 1
> replicas)
>   +- *(1) Scan
> JDBCRelation(region) [numPartitions=1] 
> [R_REGIONKEY#203,R_NAME#204,R_COMMENT#205]
> PushedFilters: [], ReadSchema: struct<R_REGIONKEY:int,R_NAME:
> string,R_COMMENT:string>
>
>
> As you see, all JDBCRelation convert to InMemoryRelation. Cause the JDBC
> table is so big, the all data can not be filled into memory,  OOM occurs.
> If there is some option to make SparkSQL use Disk if memory not enough?
>



-- 
---
Takeshi Yamamuro

How to use disk instead of just InMemoryRelation when use JDBC datasource in SPARKSQL?

2018-04-11 Thread Louis Hust

We  want to extract data from mysql, and calculate in sparksql.
The sql explain like below.


REGIONKEY#177,N_COMMENT#178] PushedFilters: [], ReadSchema:
struct<N_NATIONKEY:int,N_NAME:string,N_REGIONKEY:int,N_COMMENT:string>
  +- *(20) Sort [r_regionkey#203 ASC NULLS FIRST], false, 0
 +- Exchange(coordinator id: 266374831)
hashpartitioning(r_regionkey#203, 200), coordinator[target post-shuffle
partition size: 67108864]
+- *(19) Project [R_REGIONKEY#203]
   +- *(19) Filter ((isnotnull(r_name#204) &&
(r_name#204 = AFRICA)) && isnotnull(r_regionkey#203))
  +- InMemoryTableScan [R_REGIONKEY#203,
r_name#204], [isnotnull(r_name#204), (r_name#204 = AFRICA),
isnotnull(r_regionkey#203)]
+- InMemoryRelation [R_REGIONKEY#203,
R_NAME#204, R_COMMENT#205], true, 1, StorageLevel(disk, memory, 1
replicas)
  +- *(1) Scan JDBCRelation(region)
[numPartitions=1] [R_REGIONKEY#203,R_NAME#204,R_COMMENT#205] PushedFilters:
[], ReadSchema: struct<R_REGIONKEY:int,R_NAME:string,R_COMMENT:string>


As you see, all JDBCRelation convert to InMemoryRelation. Cause the JDBC
table is so big, the all data can not be filled into memory,  OOM occurs.
If there is some option to make SparkSQL use Disk if memory not enough?

How to use disk instead of just InMemoryRelation when use JDBC datasource in SPARKSQL?

2018-04-10 Thread Louis Hust

We  want to extract data from mysql, and calculate in sparksql.
The sql explain like below.

== Parsed Logical Plan ==
> 'Sort ['revenue DESC NULLS LAST], true
> +- 'Aggregate ['n_name], ['n_name, 'SUM(('l_extendedprice * (1 -
> 'l_discount))) AS revenue#329]
>+- 'Filter ('c_custkey = 'o_custkey) && ('l_orderkey =
> 'o_orderkey)) && ('l_suppkey = 's_suppkey)) && (('c_nationkey =
> 's_nationkey) && ('s_nationkey = 'n_nationkey))) && ((('n_regionkey =
> 'r_regionkey) && ('r_name = AFRICA)) && (('o_orderdate >= 1993-01-01) &&
> ('o_orderdate < 1994-01-01
>   +- 'Join Inner
>  :- 'Join Inner
>  :  :- 'Join Inner
>  :  :  :- 'Join Inner
>  :  :  :  :- 'Join Inner
>  :  :  :  :  :- 'UnresolvedRelation `customer`
>  :  :  :  :  +- 'UnresolvedRelation `orders`
>  :  :  :  +- 'UnresolvedRelation `lineitem`
>  :  :  +- 'UnresolvedRelation `supplier`
>  :  +- 'UnresolvedRelation `nation`
>  +- 'UnresolvedRelation `region`
> == Analyzed Logical Plan ==
> n_name: string, revenue: decimal(38,4)
> Sort [revenue#329 DESC NULLS LAST], true
> +- Aggregate [n_name#176], [n_name#176,
> sum(CheckOverflow((promote_precision(cast(l_extendedprice#68 as
> decimal(16,2))) *
> promote_precision(cast(CheckOverflow((promote_precision(cast(cast(1 as
> decimal(1,0)) as decimal(16,2))) - promote_precision(cast(l_discount#69 as
> decimal(16,2, DecimalType(16,2)) as decimal(16,2,
> DecimalType(32,4))) AS revenue#329]
>+- Filter (c_custkey#273 = o_custkey#1) && (l_orderkey#63 =
> o_orderkey#0)) && (l_suppkey#65 = s_suppkey#224)) && ((c_nationkey#276 =
> s_nationkey#227) && (s_nationkey#227 = n_nationkey#175))) &&
> (((n_regionkey#177 = r_regionkey#203) && (r_name#204 = AFRICA)) &&
> ((cast(o_orderdate#4 as string) >= 1993-01-01) && (cast(o_orderdate#4 as
> string) < 1994-01-01
>   +- Join Inner
>  :- Join Inner
>  :  :- Join Inner
>  :  :  :- Join Inner
>  :  :  :  :- Join Inner
>  :  :  :  :  :- SubqueryAlias customer
>  :  :  :  :  :  +-
> Relation[C_CUSTKEY#273,C_NAME#274,C_ADDRESS#275,C_NATIONKEY#276,C_PHONE#277,C_ACCTBAL#278,C_MKTSEGMENT#279,C_COMMENT#280]
> JDBCRelation(customer) [numPartitions=1]
>  :  :  :  :  +- SubqueryAlias orders
>  :  :  :  : +-
> Relation[O_ORDERKEY#0,O_CUSTKEY#1,O_ORDERSTATUS#2,O_TOTALPRICE#3,O_ORDERDATE#4,O_ORDERPRIORITY#5,O_CLERK#6,O_SHIPPRIORITY#7,O_COMMENT#8]
> JDBCRelation(orders) [numPartitions=1]
>  :  :  :  +- SubqueryAlias lineitem
>  :  :  : +-
> Relation[L_ORDERKEY#63,L_PARTKEY#64,L_SUPPKEY#65,L_LINENUMBER#66,L_QUANTITY#67,L_EXTENDEDPRICE#68,L_DISCOUNT#69,L_TAX#70,L_RETURNFLAG#71,L_LINESTATUS#72,L_SHIPDATE#73,L_COMMITDATE#74,L_RECEIPTDATE#75,L_SHIPINSTRUCT#76,L_SHIPMODE#77,L_COMMENT#78]
> JDBCRelation(lineitem) [numPartitions=1]
>  :  :  +- SubqueryAlias supplier
>  :  : +-
> Relation[S_SUPPKEY#224,S_NAME#225,S_ADDRESS#226,S_NATIONKEY#227,S_PHONE#228,S_ACCTBAL#229,S_COMMENT#230]
> JDBCRelation(supplier) [numPartitions=1]
>  :  +- SubqueryAlias nation
>  : +-
> Relation[N_NATIONKEY#175,N_NAME#176,N_REGIONKEY#177,N_COMMENT#178]
> JDBCRelation(nation) [numPartitions=1]
>  +- SubqueryAlias region
> +- Relation[R_REGIONKEY#203,R_NAME#204,R_COMMENT#205]
> JDBCRelation(region) [numPartitions=1]
> == Optimized Logical Plan ==
> Sort [revenue#329 DESC NULLS LAST], true
> +- Aggregate [n_name#176], [n_name#176,
> sum(CheckOverflow((promote_precision(cast(l_extendedprice#68 as
> decimal(16,2))) * promote_precision(CheckOverflow((1.00 -
> promote_precision(cast(l_discount#69 as decimal(16,2,
> DecimalType(16,2, DecimalType(32,4))) AS revenue#329]
>+- Project [L_EXTENDEDPRICE#68, L_DISCOUNT#69, N_NAME#176]
>   +- Join Inner, (n_regionkey#177 = r_regionkey#203)
>  :- Project [L_EXTENDEDPRICE#68, L_DISCOUNT#69, N_NAME#176,
> N_REGIONKEY#177]
>  :  +- Join Inner, (s_nationkey#227 = n_nationkey#175)
>  : :- Project [L_EXTENDEDPRICE#68, L_DISCOUNT#69,
> S_NATIONKEY#227]
>  : :  +- Join Inner, ((l_suppkey#65 = s_suppkey#224) &&
> (c_nationkey#276 = s_nationkey#227))
>  : : :- Project [C_NATIONKEY#276, L_SUPPKEY#65,
> L_EXTENDEDPRICE#68, L_DISCOUNT#69]
>  : : :  +- Join Inner, (l_orderkey#63 = o_orderkey#0)
>  : : : :- Project [C_NATIONKEY#276, O_ORDERKEY#0]
>  : : : :  +- Join Inner, (c_custkey#273 = o_custkey#1)
>  :

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-31 Thread Tin Vu

Hi Gaurav,

Thank you for your response. This is the answer for your questions:
1. Spark 2.3.0
2. I was using 'spark-sql' command, for example: 'spark-sql --master
spark:/*:7077 --database tpcds_bin_partitioned_orc_100 -f $file_name' wih
file_name is the file that contains SQL script ("select * from table_name").
3. Hadoop 2.9.0

I am using JDBS connector to Drill from Hive Metastore. SparkSQL is also
connecting to ORC database by Hive.

Thanks so much!

Tin

On Sat, Mar 31, 2018 at 11:41 AM, Gourav Sengupta <gourav.sengu...@gmail.com
> wrote:

> Hi Tin,
>
> This sounds interesting. While I would prefer to think that Presto and
> Drill have
>
> can you please provide the following details:
> 1. SPARK version
> 2. The exact code used in SPARK (the full code that was used)
> 3. HADOOP version
>
> I do think that SPARK and DRILL have complementary and different used
> cases. Have you tried using JDBC connector to Drill from within SPARKSQL?
>
> Regards,
> Gourav Sengupta
>
>
> On Thu, Mar 29, 2018 at 1:03 AM, Tin Vu <tvu...@ucr.edu> wrote:
>
>> Hi,
>>
>> I am executing a benchmark to compare performance of SparkSQL, Apache
>> Drill and Presto. My experimental setup:
>>
>>- TPCDS dataset with scale factor 100 (size 100GB).
>>- Spark, Drill, Presto have a same number of workers: 12.
>>- Each worked has same allocated amount of memory: 4GB.
>>- Data is stored by Hive with ORC format.
>>
>> I executed a very simple SQL query: "SELECT * from table_name"
>> The issue is that for some small size tables (even table with few dozen
>> of records), SparkSQL still required about 7-8 seconds to finish, while
>> Drill and Presto only needed less than 1 second.
>> For other large tables with billions records, SparkSQL performance was
>> reasonable when it required 20-30 seconds to scan the whole table.
>> Do you have any idea or reasonable explanation for this issue?
>>
>> Thanks,
>>
>>
>

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-31 Thread Gourav Sengupta

Hi Tin,

This sounds interesting. While I would prefer to think that Presto and
Drill have

can you please provide the following details:
1. SPARK version
2. The exact code used in SPARK (the full code that was used)
3. HADOOP version

I do think that SPARK and DRILL have complementary and different used
cases. Have you tried using JDBC connector to Drill from within SPARKSQL?

Regards,
Gourav Sengupta

On Thu, Mar 29, 2018 at 1:03 AM, Tin Vu <tvu...@ucr.edu> wrote:

> Hi,
>
> I am executing a benchmark to compare performance of SparkSQL, Apache
> Drill and Presto. My experimental setup:
>
>- TPCDS dataset with scale factor 100 (size 100GB).
>- Spark, Drill, Presto have a same number of workers: 12.
>- Each worked has same allocated amount of memory: 4GB.
>- Data is stored by Hive with ORC format.
>
> I executed a very simple SQL query: "SELECT * from table_name"
> The issue is that for some small size tables (even table with few dozen of
> records), SparkSQL still required about 7-8 seconds to finish, while Drill
> and Presto only needed less than 1 second.
> For other large tables with billions records, SparkSQL performance was
> reasonable when it required 20-30 seconds to scan the whole table.
> Do you have any idea or reasonable explanation for this issue?
>
> Thanks,
>
>

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-29 Thread Lalwani, Jayesh

It depends on how you have loaded data.. Ideally, if you have dozens of 
records, your input data should have them in one partition. If the input has 1 
partition, and data is small enough, Spark will keep it in one partition (as 
far as possible)

If you cannot control your data, you need to repartition the data when you load 
it  This will (eventually) cause a shuffle and all the data will be moved into 
the number of partitions that you specify. Subsequent operations will be on the 
repartitioned dataframe, and should take number of tasks. Shuffle has costs 
assosciated with it. You will need to make a call whether you want to take the 
upfront cost of a shuffle, or you want to live with large number of tasks

From: Tin Vu <tvu...@ucr.edu>
Date: Thursday, March 29, 2018 at 10:47 AM
To: "Lalwani, Jayesh" <jayesh.lalw...@capitalone.com>
Cc: "user@spark.apache.org" <user@spark.apache.org>
Subject: Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low 
when compared to Drill or Presto

 You are right. There are too much tasks was created. How can we reduce the 
number of tasks?

On Thu, Mar 29, 2018, 7:44 AM Lalwani, Jayesh 
<jayesh.lalw...@capitalone.com<mailto:jayesh.lalw...@capitalone.com>> wrote:
Without knowing too many details, I can only guess. It could be that Spark is 
creating a lot of tasks even though there are less records. Creation and 
distribution of tasks has a noticeable overhead on smaller datasets.

You might want to look at the driver logs, or the Spark Application Detail UI.

From: Tin Vu <tvu...@ucr.edu<mailto:tvu...@ucr.edu>>
Date: Wednesday, March 28, 2018 at 8:04 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when 
compared to Drill or Presto

Hi,

I am executing a benchmark to compare performance of SparkSQL, Apache Drill and 
Presto. My experimental setup:
• TPCDS dataset with scale factor 100 (size 100GB).
• Spark, Drill, Presto have a same number of workers: 12.
• Each worked has same allocated amount of memory: 4GB.
• Data is stored by Hive with ORC format.

I executed a very simple SQL query: "SELECT * from table_name"
The issue is that for some small size tables (even table with few dozen of 
records), SparkSQL still required about 7-8 seconds to finish, while Drill and 
Presto only needed less than 1 second.
For other large tables with billions records, SparkSQL performance was 
reasonable when it required 20-30 seconds to scan the whole table.
Do you have any idea or reasonable explanation for this issue?

Thanks,




The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-29 Thread Tin Vu

 You are right. There are too much tasks was created. How can we reduce the
number of tasks?

On Thu, Mar 29, 2018, 7:44 AM Lalwani, Jayesh <jayesh.lalw...@capitalone.com>
wrote:

> Without knowing too many details, I can only guess. It could be that Spark
> is creating a lot of tasks even though there are less records. Creation and
> distribution of tasks has a noticeable overhead on smaller datasets.
>
>
>
> You might want to look at the driver logs, or the Spark Application Detail
> UI.
>
>
>
> *From: *Tin Vu <tvu...@ucr.edu>
> *Date: *Wednesday, March 28, 2018 at 8:04 PM
> *To: *"user@spark.apache.org" <user@spark.apache.org>
> *Subject: *[SparkSQL] SparkSQL performance on small TPCDS tables is very
> low when compared to Drill or Presto
>
>
>
> Hi,
>
>
>
> I am executing a benchmark to compare performance of SparkSQL, Apache
> Drill and Presto. My experimental setup:
>
> · TPCDS dataset with scale factor 100 (size 100GB).
>
> · Spark, Drill, Presto have a same number of workers: 12.
>
> · Each worked has same allocated amount of memory: 4GB.
>
> · Data is stored by Hive with ORC format.
>
> I executed a very simple SQL query: "SELECT * from table_name"
> The issue is that for some small size tables (even table with few dozen of
> records), SparkSQL still required about 7-8 seconds to finish, while Drill
> and Presto only needed less than 1 second.
> For other large tables with billions records, SparkSQL performance was
> reasonable when it required 20-30 seconds to scan the whole table.
> Do you have any idea or reasonable explanation for this issue?
>
> Thanks,
>
>
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-29 Thread Lalwani, Jayesh

Without knowing too many details, I can only guess. It could be that Spark is 
creating a lot of tasks even though there are less records. Creation and 
distribution of tasks has a noticeable overhead on smaller datasets.

You might want to look at the driver logs, or the Spark Application Detail UI.

From: Tin Vu <tvu...@ucr.edu>
Date: Wednesday, March 28, 2018 at 8:04 PM
To: "user@spark.apache.org" <user@spark.apache.org>
Subject: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when 
compared to Drill or Presto

Hi,

I am executing a benchmark to compare performance of SparkSQL, Apache Drill and 
Presto. My experimental setup:
· TPCDS dataset with scale factor 100 (size 100GB).
· Spark, Drill, Presto have a same number of workers: 12.
· Each worked has same allocated amount of memory: 4GB.
· Data is stored by Hive with ORC format.

I executed a very simple SQL query: "SELECT * from table_name"
The issue is that for some small size tables (even table with few dozen of 
records), SparkSQL still required about 7-8 seconds to finish, while Drill and 
Presto only needed less than 1 second.
For other large tables with billions records, SparkSQL performance was 
reasonable when it required 20-30 seconds to scan the whole table.
Do you have any idea or reasonable explanation for this issue?

Thanks,

The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Tin Vu

Thanks for your response.  What do you mean when you said "immediately
return"?

On Wed, Mar 28, 2018, 10:33 PM Jörn Franke <jornfra...@gmail.com> wrote:

> I don’t think select * is a good benchmark. You should do a more complex
> operation, otherwise optimizes might see that you don’t do anything in the
> query and immediately return (similarly count might immediately return by
> using some statistics).
>
> On 29. Mar 2018, at 02:03, Tin Vu <tvu...@ucr.edu> wrote:
>
> Hi,
>
> I am executing a benchmark to compare performance of SparkSQL, Apache
> Drill and Presto. My experimental setup:
>
>- TPCDS dataset with scale factor 100 (size 100GB).
>- Spark, Drill, Presto have a same number of workers: 12.
>- Each worked has same allocated amount of memory: 4GB.
>- Data is stored by Hive with ORC format.
>
> I executed a very simple SQL query: "SELECT * from table_name"
> The issue is that for some small size tables (even table with few dozen of
> records), SparkSQL still required about 7-8 seconds to finish, while Drill
> and Presto only needed less than 1 second.
> For other large tables with billions records, SparkSQL performance was
> reasonable when it required 20-30 seconds to scan the whole table.
> Do you have any idea or reasonable explanation for this issue?
>
> Thanks,
>
>

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Jörn Franke

I don’t think select * is a good benchmark. You should do a more complex 
operation, otherwise optimizes might see that you don’t do anything in the 
query and immediately return (similarly count might immediately return by using 
some statistics).

> On 29. Mar 2018, at 02:03, Tin Vu <tvu...@ucr.edu> wrote:
> 
> Hi,
> 
> I am executing a benchmark to compare performance of SparkSQL, Apache Drill 
> and Presto. My experimental setup:
> TPCDS dataset with scale factor 100 (size 100GB).
> Spark, Drill, Presto have a same number of workers: 12.
> Each worked has same allocated amount of memory: 4GB.
> Data is stored by Hive with ORC format.
> I executed a very simple SQL query: "SELECT * from table_name"
> The issue is that for some small size tables (even table with few dozen of 
> records), SparkSQL still required about 7-8 seconds to finish, while Drill 
> and Presto only needed less than 1 second.
> For other large tables with billions records, SparkSQL performance was 
> reasonable when it required 20-30 seconds to scan the whole table.
> Do you have any idea or reasonable explanation for this issue?
> Thanks,
>

[SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Tin Vu

Hi,

I am executing a benchmark to compare performance of SparkSQL, Apache Drill
and Presto. My experimental setup:

   - TPCDS dataset with scale factor 100 (size 100GB).
   - Spark, Drill, Presto have a same number of workers: 12.
   - Each worked has same allocated amount of memory: 4GB.
   - Data is stored by Hive with ORC format.

I executed a very simple SQL query: "SELECT * from table_name"
The issue is that for some small size tables (even table with few dozen of
records), SparkSQL still required about 7-8 seconds to finish, while Drill
and Presto only needed less than 1 second.
For other large tables with billions records, SparkSQL performance was
reasonable when it required 20-30 seconds to scan the whole table.
Do you have any idea or reasonable explanation for this issue?

Thanks,

Why SparkSQL changes the table owner when performing alter table opertations?

2018-03-12 Thread 张万新

Hi,

When using spark.sql() to perform alter table operations I found that spark
changes the table owner property to the execution user. Then I digged into
the source code and found that in HiveClientImpl, the alterTable function
will set the owner of table to the current execution user. Besides, some
other operations like alter partition, getPartitionOption and so on do the
same. Can some experts explain why should we do this? What if just behave
like hive which does not change the owner when doing these kind of
operations?

AM restart in a other node make SparkSQL job into a state of feign death

2017-12-20 Thread Bang Xiao

I run "spark-sql  --master yarn --deploy-mode client -f 'SQLs' " in shell, 
The application  is stuck when the AM is down and restart in other nodes. It
seems the driver wait for the next sql. Is this a bug？In my opinion，Either
the application execute the failed sql or exit with a failure when the AM
restart。



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

AM restart in a other node makes SparkSQL jobs into a state of feign death

2017-12-20 Thread Bang Xiao

I run "spark-sql  --master yarn --deploy-mode client -f 'SQLs' " in shell, 
The application  is stuck when the AM is down and restart in other nodes. It
seems the driver wait for the next sql. Is this a bug？In my opinion，Either
the application execute the failed sql or exit with a failure when the AM
restart。



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: SparkSQL not support CharType

2017-11-23 Thread Jörn Franke

Or bytetype depending on the use case 

> On 23. Nov 2017, at 10:18, Herman van Hövell tot Westerflier 
>  wrote:
> 
> You need to use a StringType. The CharType and VarCharType are there to 
> ensure compatibility with Hive and ORC; they should not be used anywhere else.
> 
>> On Thu, Nov 23, 2017 at 4:09 AM, 163  wrote:
>> Hi,
>>  when I use Dataframe with table schema, It goes wrong:
>> 
>> val test_schema = StructType(Array(
>>   StructField("id", IntegerType, false),
>>   StructField("flag", CharType(1), false),
>>   StructField("time", DateType, false)));
>> 
>> val df = spark.read.format("com.databricks.spark.csv")
>>   .schema(test_schema)
>>   .option("header", "false")
>>   .option("inferSchema", "false")
>>   .option("delimiter", ",")
>>   .load("file:///Users/name/b")
>> 
>> The log is below:
>> Exception in thread "main" scala.MatchError: CharType(1) (of class 
>> org.apache.spark.sql.types.CharType)
>>  at 
>> org.apache.spark.sql.catalyst.encoders.RowEncoder$.org$apache$spark$sql$catalyst$encoders$RowEncoder$$serializerFor(RowEncoder.scala:73)
>>  at 
>> org.apache.spark.sql.catalyst.encoders.RowEncoder$$anonfun$2.apply(RowEncoder.scala:158)
>>  at 
>> org.apache.spark.sql.catalyst.encoders.RowEncoder$$anonfun$2.apply(RowEncoder.scala:157)
>> 
>> Why? Is this a bug?
>> 
>>  But I found spark will translate char type to string when using create 
>> table command:
>> 
>>   create table test(flag char(1));
>>  desc test:flag string;
>> 
>> 
>> 
>> 
>> Regards
>> Wendy He
> 
> 
> 
> -- 
> Herman van Hövell
> Software Engineer
> Databricks Inc.
> hvanhov...@databricks.com
> +31 6 420 590 27
> databricks.com
> 
> 
> 
>

Re: SparkSQL not support CharType

2017-11-23 Thread Herman van Hövell tot Westerflier

You need to use a StringType. The CharType and VarCharType are there to
ensure compatibility with Hive and ORC; they should not be used anywhere
else.

On Thu, Nov 23, 2017 at 4:09 AM, 163  wrote:

> Hi,
>  when I use Dataframe with table schema, It goes wrong:
>
> val test_schema = StructType(Array(
>
>   StructField("id", IntegerType, false),
>   StructField("flag", CharType(1), false),
>   StructField("time", DateType, false)));
>
> val df = spark.read.format("com.databricks.spark.csv")
>   .schema(test_schema)
>   .option("header", "false")
>   .option("inferSchema", "false")
>   .option("delimiter", ",")
>   .load("file:///Users/name/b")
>
>
> The log is below:
> Exception in thread "main" scala.MatchError: CharType(1) (of class
> org.apache.spark.sql.types.CharType)
> at org.apache.spark.sql.catalyst.encoders.RowEncoder$.org$
> apache$spark$sql$catalyst$encoders$RowEncoder$$serializerFor(RowEncoder.
> scala:73)
> at org.apache.spark.sql.catalyst.encoders.RowEncoder$$anonfun$
> 2.apply(RowEncoder.scala:158)
> at org.apache.spark.sql.catalyst.encoders.RowEncoder$$anonfun$
> 2.apply(RowEncoder.scala:157)
>
> Why? Is this a bug?
>
> But I found spark will translate char type to string when using create
> table command:
>
>   create table test(flag char(1));
>   desc test:flag string;
>
>
>
>
> Regards
> Wendy He
>



-- 

Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com]

SparkSQL not support CharType

2017-11-22 Thread 163

Hi,
 when I use Dataframe with table schema, It goes wrong:

val test_schema = StructType(Array(
  StructField("id", IntegerType, false),
  StructField("flag", CharType(1), false),
  StructField("time", DateType, false)));

val df = spark.read.format("com.databricks.spark.csv")
  .schema(test_schema)
  .option("header", "false")
  .option("inferSchema", "false")
  .option("delimiter", ",")
  .load("file:///Users/name/b")

The log is below:
Exception in thread "main" scala.MatchError: CharType(1) (of class 
org.apache.spark.sql.types.CharType)
at 
org.apache.spark.sql.catalyst.encoders.RowEncoder$.org$apache$spark$sql$catalyst$encoders$RowEncoder$$serializerFor(RowEncoder.scala:73)
at 
org.apache.spark.sql.catalyst.encoders.RowEncoder$$anonfun$2.apply(RowEncoder.scala:158)
at 
org.apache.spark.sql.catalyst.encoders.RowEncoder$$anonfun$2.apply(RowEncoder.scala:157)

Why? Is this a bug?

But I found spark will translate char type to string when using create 
table command:

 create table test(flag char(1));
desc test:flag string;




Regards
Wendy He

Re: Dynamic data ingestion into SparkSQL - Interesting question

2017-11-21 Thread Aakash Basu

Yes, I did the same. It's working. Thanks!

On 21-Nov-2017 4:04 PM, "Fernando Pereira" <ferdonl...@gmail.com> wrote:

> Did you consider do string processing to build the SQL expression which
> you can execute with spark.sql(...)?
> Some examples: https://spark.apache.org/docs/latest/sql-
> programming-guide.html#hive-tables
>
> Cheers
>
> On 21 November 2017 at 03:27, Aakash Basu <aakash.spark@gmail.com>
> wrote:
>
>> Hi all,
>>
>> Any help? PFB.
>>
>> Thanks,
>> Aakash.
>>
>> On 20-Nov-2017 6:58 PM, "Aakash Basu" <aakash.spark@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I have a table which will have 4 columns -
>>>
>>> |  Expression|filter_condition| from_clause|
>>> group_by_columns|
>>>
>>>
>>> This file may have variable number of rows depending on the no. of KPIs
>>> I need to calculate.
>>>
>>> I need to write a SparkSQL program which will have to read this file and
>>> run each line of queries dynamically by fetching each column value for a
>>> particular row and create a select query out of it and run inside a
>>> dataframe, later saving it as a temporary table.
>>>
>>> Did anyone do this kind of exercise? If yes, can I get some help on it
>>> pls?
>>>
>>> Thanks,
>>> Aakash.
>>>
>>
>

Re: Dynamic data ingestion into SparkSQL - Interesting question

2017-11-21 Thread Fernando Pereira

Did you consider do string processing to build the SQL expression which you
can execute with spark.sql(...)?
Some examples:
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

Cheers

On 21 November 2017 at 03:27, Aakash Basu <aakash.spark@gmail.com>
wrote:

> Hi all,
>
> Any help? PFB.
>
> Thanks,
> Aakash.
>
> On 20-Nov-2017 6:58 PM, "Aakash Basu" <aakash.spark@gmail.com> wrote:
>
>> Hi all,
>>
>> I have a table which will have 4 columns -
>>
>> |  Expression|filter_condition| from_clause|
>> group_by_columns|
>>
>>
>> This file may have variable number of rows depending on the no. of KPIs I
>> need to calculate.
>>
>> I need to write a SparkSQL program which will have to read this file and
>> run each line of queries dynamically by fetching each column value for a
>> particular row and create a select query out of it and run inside a
>> dataframe, later saving it as a temporary table.
>>
>> Did anyone do this kind of exercise? If yes, can I get some help on it
>> pls?
>>
>> Thanks,
>> Aakash.
>>
>

Re: Dynamic data ingestion into SparkSQL - Interesting question

2017-11-20 Thread Aakash Basu

Hi all,

Any help? PFB.

Thanks,
Aakash.

On 20-Nov-2017 6:58 PM, "Aakash Basu" <aakash.spark@gmail.com> wrote:

> Hi all,
>
> I have a table which will have 4 columns -
>
> |  Expression|filter_condition| from_clause|
> group_by_columns|
>
>
> This file may have variable number of rows depending on the no. of KPIs I
> need to calculate.
>
> I need to write a SparkSQL program which will have to read this file and
> run each line of queries dynamically by fetching each column value for a
> particular row and create a select query out of it and run inside a
> dataframe, later saving it as a temporary table.
>
> Did anyone do this kind of exercise? If yes, can I get some help on it pls?
>
> Thanks,
> Aakash.
>

Dynamic data ingestion into SparkSQL - Interesting question

2017-11-20 Thread Aakash Basu

Hi all,

I have a table which will have 4 columns -

|  Expression|filter_condition| from_clause|
group_by_columns|


This file may have variable number of rows depending on the no. of KPIs I
need to calculate.

I need to write a SparkSQL program which will have to read this file and
run each line of queries dynamically by fetching each column value for a
particular row and create a select query out of it and run inside a
dataframe, later saving it as a temporary table.

Did anyone do this kind of exercise? If yes, can I get some help on it pls?

Thanks,
Aakash.

Re: Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-11-03 Thread ayan guha

you can use 10 passes over the same dataset and build the data


On Fri, Nov 3, 2017 at 9:48 PM, Jean Georges Perrin <jper...@lumeris.com>
wrote:

> Write a UDF?
>
> On Oct 31, 2017, at 11:48, Aakash Basu <aakash.spark@gmail.com> wrote:
>
> Hey all,
>
> Any help in the below please?
>
> Thanks,
> Aakash.
>
>
> -- Forwarded message --
> From: Aakash Basu <aakash.spark@gmail.com>
> Date: Tue, Oct 31, 2017 at 9:17 PM
> Subject: Regarding column partitioning IDs and names as per hierarchical
> level SparkSQL
> To: user <user@spark.apache.org>
>
>
> Hi all,
>
> I have to generate a table with Spark-SQL with the following columns -
>
>
> Level One Id: VARCHAR(20) NULL
> Level One Name: VARCHAR( 50) NOT NULL
> Level Two Id: VARCHAR( 20) NULL
> Level Two Name: VARCHAR(50) NULL
> Level Thr ee Id: VARCHAR(20) NULL
> Level Thr ee Name: VARCHAR(50) NULL
> Level Four Id: VARCHAR(20) NULL
> Level Four Name: VARCHAR( 50) NULL
> Level Five Id: VARCHAR(20) NULL
> Level Five Name: VARCHAR(50) NULL
> Level Six Id: VARCHAR(20) NULL
> Level Six Name: VARCHAR(50) NULL
> Level Seven Id: VARCHAR( 20) NULL
> Level Seven Name: VARCHAR(50) NULL
> Level Eight Id: VARCHAR( 20) NULL
> Level Eight Name: VARCHAR(50) NULL
> Level Nine Id: VARCHAR(20) NULL
> Level Nine Name: VARCHAR( 50) NULL
> Level Ten Id: VARCHAR(20) NULL
> Level Ten Name: VARCHAR(50) NULL
>
> My input source has these columns -
>
>
> ID Description ParentID
> 10 Great-Grandfather
> 1010 Grandfather 10
> 101010 1. Father A 1010
> 101011 2. Father B 1010
> 101012 4. Father C 1010
> 101013 5. Father D 1010
> 101015 3. Father E 1010
> 101018 Father F 1010
> 101019 6. Father G 1010
> 101020 Father H 1010
> 101021 Father I 1010
> 101022 2A. Father J 1010
> 10101010 2. Father K 101010
> Like the above, I have ID till 20 digits, which means, I have 10 levels.
>
> I want to populate the ID and name itself along with all the parents till
> the root for any particular level, which I am unable to create a concrete
> logic for.
>
> Am using this way to fetch respecting levels and populate them in the
> respective columns but not their parents -
>
> Present Logic ->
>
> FinalJoin_DF = spark.sql("select "
>   + "case when length(a.id)/2 = '1' then a.id
> else ' ' end as level_one_id, "
>   + "case when length(a.id)/2 = '1' then a.desc else ' ' end as
> level_one_name, "
>   + "case when length(a.id)/2 = '2' then a.id else ' ' end as
> level_two_id, "
>   + "case when length(a.id)/2 = '2' then a.desc else ' ' end as
> level_two_name, "
>   + "case when length(a.id)/2 = '3' then a.id
> else ' ' end as level_three_id, "
>   + "case when length(a.id)/2 = '3' then a.desc
> else ' ' end as level_three_name, "
>   + "case when length(a.id)/2 = '4' then a.id
> else ' ' end as level_four_id, "
>   + "case when length(a.id)/2 = '4' then a.desc
> else ' ' end as level_four_name, "
>   + "case when length(a.id)/2 = '5' then a.id
> else ' ' end as level_five_id, "
>   + "case when length(a.id)/2 = '5' then a.desc
> else ' ' end as level_five_name, "
>   + "case when length(a.id)/2 = '6' then a.id
> else ' ' end as level_six_id, "
>   + "case when length(a.id)/2 = '6' then a.desc else ' ' end as
> level_six_name, "
>   + "case when length(a.id)/2 = '7' then a.id else ' ' end as
> level_seven_id, "
>   + "case when length(a.id)/2 = '7' then a.desc
> else ' ' end as level_seven_name, "
>   + "case when length(a.id)/2 = '8' then a.id
> else ' ' end as level_eight_id, "
>   + "case when length(a.id)/2 = '8' then a.desc else ' ' end as
> level_eight_name, "
>   + "case when length(a.id)/2 = '9' then a.id
> else ' ' end as level_nine_id, "
>   + "case when length(a.id)/2 = '9' then a.desc else ' ' end as
> level_nine_name, "
>   + "case when length(a.id)/2 = '10' then a.id else ' ' end as
> level_ten_id, "
>   + "case when length(a.id)/2 = '10' then a.desc
> else ' ' end as level_ten_name "
>   + "from CategoryTempTable a")
>
>
> Can someone help me in also populating all the parents levels in the
> respective level ID and level name, please?
>
>
> Thanks,
> Aakash.
>
>
>


-- 
Best Regards,
Ayan Guha

Re: Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-11-03 Thread Jean Georges Perrin

Write a UDF?

> On Oct 31, 2017, at 11:48, Aakash Basu <aakash.spark@gmail.com 
> <mailto:aakash.spark@gmail.com>> wrote:
> 
> Hey all,
> 
> Any help in the below please?
> 
> Thanks,
> Aakash.
> 
> 
> -- Forwarded message --
> From: Aakash Basu <aakash.spark@gmail.com 
> <mailto:aakash.spark@gmail.com>>
> Date: Tue, Oct 31, 2017 at 9:17 PM
> Subject: Regarding column partitioning IDs and names as per hierarchical 
> level SparkSQL
> To: user <user@spark.apache.org <mailto:user@spark.apache.org>>
> 
> 
> Hi all,
> 
> I have to generate a table with Spark-SQL with the following columns -
> 
> 
> Level One Id: VARCHAR(20) NULL
> Level One Name: VARCHAR( 50) NOT NULL
> Level Two Id: VARCHAR( 20) NULL
> Level Two Name: VARCHAR(50) NULL
> Level Thr ee Id: VARCHAR(20) NULL
> Level Thr ee Name: VARCHAR(50) NULL
> Level Four Id: VARCHAR(20) NULL
> Level Four Name: VARCHAR( 50) NULL
> Level Five Id: VARCHAR(20) NULL
> Level Five Name: VARCHAR(50) NULL
> Level Six Id: VARCHAR(20) NULL
> Level Six Name: VARCHAR(50) NULL
> Level Seven Id: VARCHAR( 20) NULL
> Level Seven Name: VARCHAR(50) NULL
> Level Eight Id: VARCHAR( 20) NULL
> Level Eight Name: VARCHAR(50) NULL
> Level Nine Id: VARCHAR(20) NULL
> Level Nine Name: VARCHAR( 50) NULL
> Level Ten Id: VARCHAR(20) NULL
> Level Ten Name: VARCHAR(50) NULL
> 
> My input source has these columns -
> 
> 
> IDDescription ParentID
> 10Great-Grandfather
> 1010  Grandfather 10
> 1010101. Father A 1010
> 1010112. Father B 1010
> 1010124. Father C 1010
> 1010135. Father D 1010
> 1010153. Father E 1010
> 101018Father F1010
> 1010196. Father G 1010
> 101020Father H1010
> 101021Father I1010
> 1010222A. Father J1010
> 10101010  2. Father K 101010
> 
> Like the above, I have ID till 20 digits, which means, I have 10 levels.
> 
> I want to populate the ID and name itself along with all the parents till the 
> root for any particular level, which I am unable to create a concrete logic 
> for.
> 
> Am using this way to fetch respecting levels and populate them in the 
> respective columns but not their parents -
> 
> Present Logic ->
> 
> FinalJoin_DF = spark.sql("select "
>   + "case when length(a.id <http://a.id/>)/2 = '1' 
> then a.id <http://a.id/> else ' ' end as level_one_id, "
> + "case when length(a.id <http://a.id/>)/2 = '1' then 
> a.desc else ' ' end as level_one_name, "
> + "case when length(a.id <http://a.id/>)/2 = '2' then 
> a.id <http://a.id/> else ' ' end as level_two_id, "
> + "case when length(a.id <http://a.id/>)/2 = '2' then 
> a.desc else ' ' end as level_two_name, "
>   + "case when length(a.id <http://a.id/>)/2 = '3' 
> then a.id <http://a.id/> else ' ' end as level_three_id, "
>   + "case when length(a.id <http://a.id/>)/2 = '3' 
> then a.desc else ' ' end as level_three_name, "
>   + "case when length(a.id <http://a.id/>)/2 = '4' 
> then a.id <http://a.id/> else ' ' end as level_four_id, "
>   + "case when length(a.id <http://a.id/>)/2 = '4' 
> then a.desc else ' ' end as level_four_name, "
>   + "case when length(a.id <http://a.id/>)/2 = '5' 
> then a.id <http://a.id/> else ' ' end as level_five_id, "
>   + "case when length(a.id <http://a.id/>)/2 = '5' 
> then a.desc else ' ' end as level_five_name, "
>   + "case when length(a.id <http://a.id/>)/2 = '6' 
> then a.id <http://a.id/> else ' ' end as level_six_id, "
> + "case when length(a.id <http://a.id/>)/2 = '6' then 
> a.desc else ' ' end as level_six_name, "
> + "case when length(a.id <http://a.id/>)/2 = '7' then 
> a.id <http://a.id/> else ' ' end as level_seven_id, "
>   + "case when length(a.id <http://a.id/>)/2 = '7' 
> then a.desc else ' ' end as level_seven_name, "
>   + "case when length(a.id <http://a.id/>)/2 = '8' 
> then a.id <http://a.id/> else ' ' end as level_eight_id, "
>

Fwd: Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-10-31 Thread Aakash Basu

Hey all,

Any help in the below please?

Thanks,
Aakash.


-- Forwarded message --
From: Aakash Basu <aakash.spark@gmail.com>
Date: Tue, Oct 31, 2017 at 9:17 PM
Subject: Regarding column partitioning IDs and names as per hierarchical
level SparkSQL
To: user <user@spark.apache.org>


Hi all,

I have to generate a table with Spark-SQL with the following columns -


Level One Id: VARCHAR(20) NULL
Level One Name: VARCHAR( 50) NOT NULL
Level Two Id: VARCHAR( 20) NULL
Level Two Name: VARCHAR(50) NULL
Level Thr ee Id: VARCHAR(20) NULL
Level Thr ee Name: VARCHAR(50) NULL
Level Four Id: VARCHAR(20) NULL
Level Four Name: VARCHAR( 50) NULL
Level Five Id: VARCHAR(20) NULL
Level Five Name: VARCHAR(50) NULL
Level Six Id: VARCHAR(20) NULL
Level Six Name: VARCHAR(50) NULL
Level Seven Id: VARCHAR( 20) NULL
Level Seven Name: VARCHAR(50) NULL
Level Eight Id: VARCHAR( 20) NULL
Level Eight Name: VARCHAR(50) NULL
Level Nine Id: VARCHAR(20) NULL
Level Nine Name: VARCHAR( 50) NULL
Level Ten Id: VARCHAR(20) NULL
Level Ten Name: VARCHAR(50) NULL

My input source has these columns -


ID Description ParentID
10 Great-Grandfather
1010 Grandfather 10
101010 1. Father A 1010
101011 2. Father B 1010
101012 4. Father C 1010
101013 5. Father D 1010
101015 3. Father E 1010
101018 Father F 1010
101019 6. Father G 1010
101020 Father H 1010
101021 Father I 1010
101022 2A. Father J 1010
10101010 2. Father K 101010
Like the above, I have ID till 20 digits, which means, I have 10 levels.

I want to populate the ID and name itself along with all the parents till
the root for any particular level, which I am unable to create a concrete
logic for.

Am using this way to fetch respecting levels and populate them in the
respective columns but not their parents -

Present Logic ->

FinalJoin_DF = spark.sql("select "
  + "case when length(a.id)/2 = '1' then a.id else
' ' end as level_one_id, "
  + "case when length(a.id)/2 = '1' then a.desc else ' ' end as
level_one_name, "
  + "case when length(a.id)/2 = '2' then a.id else ' ' end as level_two_id,
"
  + "case when length(a.id)/2 = '2' then a.desc else ' ' end as
level_two_name, "
  + "case when length(a.id)/2 = '3' then a.id else
' ' end as level_three_id, "
  + "case when length(a.id)/2 = '3' then a.desc
else ' ' end as level_three_name, "
  + "case when length(a.id)/2 = '4' then a.id else
' ' end as level_four_id, "
  + "case when length(a.id)/2 = '4' then a.desc
else ' ' end as level_four_name, "
  + "case when length(a.id)/2 = '5' then a.id else
' ' end as level_five_id, "
  + "case when length(a.id)/2 = '5' then a.desc
else ' ' end as level_five_name, "
  + "case when length(a.id)/2 = '6' then a.id else
' ' end as level_six_id, "
  + "case when length(a.id)/2 = '6' then a.desc else ' ' end as
level_six_name, "
  + "case when length(a.id)/2 = '7' then a.id else ' ' end as
level_seven_id, "
  + "case when length(a.id)/2 = '7' then a.desc
else ' ' end as level_seven_name, "
  + "case when length(a.id)/2 = '8' then a.id else
' ' end as level_eight_id, "
  + "case when length(a.id)/2 = '8' then a.desc else ' ' end as
level_eight_name, "
  + "case when length(a.id)/2 = '9' then a.id else
' ' end as level_nine_id, "
  + "case when length(a.id)/2 = '9' then a.desc else ' ' end as
level_nine_name, "
  + "case when length(a.id)/2 = '10' then a.id else ' ' end as
level_ten_id, "
  + "case when length(a.id)/2 = '10' then a.desc
else ' ' end as level_ten_name "
  + "from CategoryTempTable a")


Can someone help me in also populating all the parents levels in the
respective level ID and level name, please?


Thanks,
Aakash.

Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-10-31 Thread Aakash Basu

Hi all,

I have to generate a table with Spark-SQL with the following columns -


Level One Id: VARCHAR(20) NULL
Level One Name: VARCHAR( 50) NOT NULL
Level Two Id: VARCHAR( 20) NULL
Level Two Name: VARCHAR(50) NULL
Level Thr ee Id: VARCHAR(20) NULL
Level Thr ee Name: VARCHAR(50) NULL
Level Four Id: VARCHAR(20) NULL
Level Four Name: VARCHAR( 50) NULL
Level Five Id: VARCHAR(20) NULL
Level Five Name: VARCHAR(50) NULL
Level Six Id: VARCHAR(20) NULL
Level Six Name: VARCHAR(50) NULL
Level Seven Id: VARCHAR( 20) NULL
Level Seven Name: VARCHAR(50) NULL
Level Eight Id: VARCHAR( 20) NULL
Level Eight Name: VARCHAR(50) NULL
Level Nine Id: VARCHAR(20) NULL
Level Nine Name: VARCHAR( 50) NULL
Level Ten Id: VARCHAR(20) NULL
Level Ten Name: VARCHAR(50) NULL

My input source has these columns -


ID Description ParentID
10 Great-Grandfather
1010 Grandfather 10
101010 1. Father A 1010
101011 2. Father B 1010
101012 4. Father C 1010
101013 5. Father D 1010
101015 3. Father E 1010
101018 Father F 1010
101019 6. Father G 1010
101020 Father H 1010
101021 Father I 1010
101022 2A. Father J 1010
10101010 2. Father K 101010
Like the above, I have ID till 20 digits, which means, I have 10 levels.

I want to populate the ID and name itself along with all the parents till
the root for any particular level, which I am unable to create a concrete
logic for.

Am using this way to fetch respecting levels and populate them in the
respective columns but not their parents -

Present Logic ->

FinalJoin_DF = spark.sql("select "
  + "case when length(a.id)/2 = '1' then a.id else
' ' end as level_one_id, "
  + "case when length(a.id)/2 = '1' then a.desc else ' ' end as
level_one_name, "
  + "case when length(a.id)/2 = '2' then a.id else ' ' end as level_two_id,
"
  + "case when length(a.id)/2 = '2' then a.desc else ' ' end as
level_two_name, "
  + "case when length(a.id)/2 = '3' then a.id else
' ' end as level_three_id, "
  + "case when length(a.id)/2 = '3' then a.desc
else ' ' end as level_three_name, "
  + "case when length(a.id)/2 = '4' then a.id else
' ' end as level_four_id, "
  + "case when length(a.id)/2 = '4' then a.desc
else ' ' end as level_four_name, "
  + "case when length(a.id)/2 = '5' then a.id else
' ' end as level_five_id, "
  + "case when length(a.id)/2 = '5' then a.desc
else ' ' end as level_five_name, "
  + "case when length(a.id)/2 = '6' then a.id else
' ' end as level_six_id, "
  + "case when length(a.id)/2 = '6' then a.desc else ' ' end as
level_six_name, "
  + "case when length(a.id)/2 = '7' then a.id else ' ' end as
level_seven_id, "
  + "case when length(a.id)/2 = '7' then a.desc
else ' ' end as level_seven_name, "
  + "case when length(a.id)/2 = '8' then a.id else
' ' end as level_eight_id, "
  + "case when length(a.id)/2 = '8' then a.desc else ' ' end as
level_eight_name, "
  + "case when length(a.id)/2 = '9' then a.id else
' ' end as level_nine_id, "
  + "case when length(a.id)/2 = '9' then a.desc else ' ' end as
level_nine_name, "
  + "case when length(a.id)/2 = '10' then a.id else ' ' end as
level_ten_id, "
  + "case when length(a.id)/2 = '10' then a.desc
else ' ' end as level_ten_name "
  + "from CategoryTempTable a")


Can someone help me in also populating all the parents levels in the
respective level ID and level name, please?


Thanks,
Aakash.

Re: Spark Streaming - Multiple Spark Contexts (SparkSQL) Performance

2017-10-01 Thread Gerard Maas

Hammad,

The recommended way to implement this logic would be to:

Create a SparkSession.
Create a Streaming Context using the SparkContext embedded in the
SparkSession

Use the single SparkSession instance for the SQL operations within the
foreachRDD.
It's important to note that spark operations can process the complete
dataset. In this case, there's no need to do a perPartition or perElement
operation. (that would be the case if we were directly using the drivers
API and DB connections)

Reorganizing the code in the question a bit, we should have:

 SparkSession sparkSession = SparkSession
.builder()
.setMaster("local[2]").setAppName("TransformerStreamPOC")

.config("spark.some.config.option", "some-value")
.getOrCreate();

JavaStreamingContext jssc = new
JavaStreamingContext(sparkSession.sparkContext,
Durations.seconds(60));

// this dataset doesn't seem to depend on the received data, so we can
load it once.

Dataset baselineData =
sparkSession.read().jdbc(MYSQL_CONNECTION_URL, "table_name",
connectionProperties);

// create dstream

DStream dstream = ...

... operations on dstream...

dstream.foreachRDD { rdd =>

Dataset incomingData = sparkSession.createDataset(rdd)

   ... do something the incoming dataset, eg. join with the baseline ...

   DataFrame joined =  incomingData.join(baselineData, ...)

   ... do something with joined ...

  }


kr, Gerard.

On Sun, Oct 1, 2017 at 7:55 PM, Hammad <ham...@flexilogix.com> wrote:

> Hello,
>
> *Background:*
>
> I have Spark Streaming context;
>
> SparkConf conf = new 
> SparkConf().setMaster("local[2]").setAppName("TransformerStreamPOC");
> conf.set("spark.driver.allowMultipleContexts", "true");   *<== this*
> JavaStreamingContext jssc = new JavaStreamingContext(conf, 
> Durations.seconds(60));
>
>
> that subscribes to certain kafka *topics*;
>
> JavaInputDStream<ConsumerRecord<String, String>> stream =
> KafkaUtils.createDirectStream(
> jssc,
> LocationStrategies.PreferConsistent(),
> ConsumerStrategies.<String, String>Subscribe(*topics*, 
> kafkaParams)
> );
>
> when messages arrive in queue, I recursively process them as follows (below 
> code section will repeat in Question statement)
>
> stream.foreachRDD(rdd -> {
> //process here - below two scenarions code is inserted here
>
> });
>
>
> *Question starts here:*
>
> Since I need to apply SparkSQL to received events in Queue - I create 
> SparkSession with two scenarios;
>
> *1) Per partition one sparkSession (after 
> "spark.driver.allowMultipleContexts" set to true); so all events under this 
> partition are handled by same sparkSession*
>
> rdd.foreachPartition(partition -> {
> SparkSession sparkSession = SparkSession
> .builder()
> .appName("Java Spark SQL basic example")
> .config("spark.some.config.option", "some-value")
> .getOrCreate();
>
> while (partition.hasNext()) {
>   Dataset df = sparkSession.read().jdbc(MYSQL_CONNECTION_URL, 
> "table_name", connectionProperties);
>
> }}
>
> *2) Per event under each session; so each event under each queue under each 
> stream has one sparkSession;*
>
> rdd.foreachPartition(partition -> {while (partition.hasNext()) {
> SparkSession sparkSession = SparkSession.builder().appName("Java Spark SQL 
> basic example").config("spark.some.config.option", 
> "some-value").getOrCreate();
>
> Dataset df = sparkSession.read().jdbc(MYSQL_CONNECTION_URL, 
> "table_name", connectionProperties);
>
> }}
>
>
> Is it good practice to create multiple contexts (lets say 10 or 100)?
> How does number of sparkContext to be allowed vs number of worker nodes
> relate?
> What are performance considerations with respect to scenario1 and
> scenario2?
>
> I am looking for these answers as I feel there is more to what I
> understand of performance w.r.t sparkContexts created by a streaming
> application.
> Really appreciate your support in anticipation.
>
> Hammad
>
>

Fwd: Spark Streaming - Multiple Spark Contexts (SparkSQL) Performance

2017-10-01 Thread Hammad

Hello,

*Background:*

I have Spark Streaming context;

SparkConf conf = new
SparkConf().setMaster("local[2]").setAppName("TransformerStreamPOC");
conf.set("spark.driver.allowMultipleContexts", "true");   *<== this*
JavaStreamingContext jssc = new JavaStreamingContext(conf,
Durations.seconds(60));


that subscribes to certain kafka *topics*;

JavaInputDStream<ConsumerRecord<String, String>> stream =
KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(*topics*,
kafkaParams)
);

when messages arrive in queue, I recursively process them as follows
(below code section will repeat in Question statement)

stream.foreachRDD(rdd -> {
//process here - below two scenarions code is inserted here

});


*Question starts here:*

Since I need to apply SparkSQL to received events in Queue - I create
SparkSession with two scenarios;

*1) Per partition one sparkSession (after
"spark.driver.allowMultipleContexts" set to true); so all events under
this partition are handled by same sparkSession*

rdd.foreachPartition(partition -> {
SparkSession sparkSession = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate();

while (partition.hasNext()) {
  Dataset df = sparkSession.read().jdbc(MYSQL_CONNECTION_URL,
"table_name", connectionProperties);

}}

*2) Per event under each session; so each event under each queue under
each stream has one sparkSession;*

rdd.foreachPartition(partition -> {while (partition.hasNext()) {
 SparkSession sparkSession = SparkSession.builder().appName("Java
Spark SQL basic example").config("spark.some.config.option",
"some-value").getOrCreate();

Dataset df = sparkSession.read().jdbc(MYSQL_CONNECTION_URL,
"table_name", connectionProperties);

}}


Is it good practice to create multiple contexts (lets say 10 or 100)?
How does number of sparkContext to be allowed vs number of worker nodes
relate?
What are performance considerations with respect to scenario1 and scenario2?

I am looking for these answers as I feel there is more to what I understand
of performance w.r.t sparkContexts created by a streaming application.
Really appreciate your support in anticipation.

Hammad

hive2 query using SparkSQL seems wrong

2017-09-25 Thread Cinyoung Hur

Hi,
I'm using hive 2.3.0, spark 2.1.1, and zeppelin 0.7.2.

When I submit query in hive interpreter, it works fine.
I could see exactly same query in zeppelin notebook and hiveserver2 web UI.

However, when I submitted query using sparksql, query seemed wrong.
For example, every columns are with double quotes, like this.

SELECT
"component_2015.spec_id_sno","component_2015.jid","component_2015.fom_tp_cd","component_2015.dif",...
FROM component_2015


And the query just finished without any results.
Is this problem of Spark? or Hive?
Please help me.

Regards,
Cinyoung

transactional data in sparksql

2017-07-31 Thread luohui20001

hello guys: I have some transactional data as attached file 1.txt. A 
sequence of a single operation 1 followed by a few operations 0 is a transation 
here. The transcations, which sum(Amount) of operation 0 is less than the 
sum(Amount) of operation 1, need to be found out. 
 There are serveral questions here:1. To deal with this kind of 
transaction, What is the most sensible way?Does UDAF help? Or does sparksql 
provide transactional support? I remembered that hive has some kind of support 
towards transaction, like 
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-GrammarChanges.
2.The data has been sorted by timestamp. How about to get those transactions 
with a time period ,like 24hours.
 thank you.



 

ThanksBest regards!
San.Luo
|Account|Operation| Timestamp| Amount|
+---+-+--+---+
| 13|1|1400017208| 674.33|
| 13|0|1400102650|  73.86|
| 13|1|1400130576|1155.48|
| 13|1|1400165378|  96.04|
| 13|0|1400245724| 173.84|
| 13|0|1400258007| 852.29|
| 13|1|1400265065|2085.32|
| 13|0|1400329127|  429.3|
| 13|0|1400383007|  611.2|
| 13|1|1400428342|1629.76|
| 13|0|1400457645| 490.55|
| 13|1|1400516552| 369.54|
| 13|1|1400618678|1316.05|
| 13|0|1400655615| 573.71|
| 13|0|1400696930| 877.16|
| 13|0|1400732011| 105.51|
| 13|0|1400751612|1512.23|
| 13|0|1400761888| 414.36|
| 13|0|1400814042|  36.52|
| 13|0|1400831895| 611.15|
+---+-+--+---+
only showing top 20 rows

SQL£ºselect * from r where Account=13 limit 20
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: SparkSQL to read XML Blob data to create multiple rows

2017-07-08 Thread Amol Talap

Hi Zhang & All,

Thanks so much for your earlier response. I am trying to get final
solution. We could parse the data successfully however I am getting
Nullpointerexception while mapping it back. Can you please suggest on
below findings ?

spark-shell --packages com.databricks:spark-xml_2.11:0.4.1
//--
scala> spark.version
res19: String = 2.0.0

scala >
def xml2DF (xml:String) = {
 | val xRDD = sc.parallelize(Seq(xml))
 | val df = new XmlReader().xmlRdd(spark.sqlContext,xRDD)
 | val new_df = df.withColumn("comment",
explode(df("Comments.Comment"))).select($"comment.Description",$"comment.Title")
 | new_df.collectAsList}
xml2DF: (xml: String)java.util.List[org.apache.spark.sql.Row]
//---
scala> val 
xml3="Title3.1Descript3.1Title.1Descript.1"

scala> xml2DF(xml3)
res18: java.util.List[org.apache.spark.sql.Row] =
[[Descript3.1,Title3.1], [Descript.1,Title.1]]
//xml2DF works as expected
//---
val rdd = sc.textFile("file:///home/spark/XML_Project/data.txt")
val xml_pRDDs = rdd.map(x=>(x.split(',')(0).toInt, x.split(',')(3)))
scala> xml_pRDDs.map(x=>(x._1,"call xml2DF "+x._2)).collect
res17: Array[(Int, String)] = Array((1,call xml2DF
Title1.1Descript1.1(x._1,xml2DF(x._2))).collect
//This gives Nullpointerexception, I was expecting result as below
//Array[(Int, java.util.List[org.apache.spark.sql.Row])] =
Array((1,[[Descript3.1,Title3.1], [Descript.1,Title.1]]),

Below is content of data.txt file.
1,Amol,Kolhapur,Title1.1Descript1.1Title1.2Descript1.2Title1.3Descript1.3
2,Ameet,Bangalore,Title2.1Descript2.1Title2.2Descript2.2
3,Rajesh,Jaipur,Title3.1Descript3.1Title3.2Descript3.2Title3.3Descript3.3Title3.4Descript3.4

Regards,
Amol



On 6/29/17, Yong Zhang <java8...@hotmail.com> wrote:
> scala>spark.version
> res6: String = 2.1.1
>
> scala>val rdd  =
> sc.parallelize(Seq("""Title1.1Description_1.1
> Title1.2Description_1.2
> Title1.3Description_1.3
> """))
> rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at
> parallelize at :24
>
> scala>import com.databricks.spark.xml.XmlReader
>
> scala>val df = new XmlReader().xmlRdd(spark.sqlContext, rdd)
> df: org.apache.spark.sql.DataFrame = [Comments: struct array<struct<Description:string,Title:string>>>]
>
> scala>df.printSchema
> root
>  |-- Comments: struct (nullable = true)
>  ||-- Comment: array (nullable = true)
>  |||-- element: struct (containsNull = true)
>  ||||-- Description: string (nullable = true)
>  ||||-- Title: string (nullable = true)
>
> scala>df.show(false)
> +--+
> |Comments
>   |
> +--+
> |[WrappedArray([Description_1.1,Title1.1], [Description_1.2,Title1.2],
> [Description_1.3,Title1.3])]|
> +--+
>
>
> scala>df.withColumn("comment",
> explode(df("Comments.Comment"))).select($"comment.Description",
> $"comment.Title").show
> +---++
> |Description|   Title|
> +---++
> |Description_1.1|Title1.1|
> |Description_1.2|Title1.2|
> |Description_1.3|Title1.3|
> +---++
>
>
>
> 
> From: Talap, Amol <amol.ta...@capgemini.com>
> Sent: Thursday, June 29, 2017 9:38 AM
> To: Judit Planas; user@spark.apache.org
> Subject: RE: SparkSQL to read XML Blob data to create multiple rows
>
>
> Thanks Judit, Ayan
>
> Judit,
>
> You almost got it. The explode might help here.
>
> But when I tried I see load() doesn’t like to read from xmlcomment column on
> oracle_data.
>
>
>
> scala> val xmlDF = sqlContext.sql("SELECT * FROM oracle_data")
>
> 17/06/29 18:31:58 INFO parse.ParseDriver: Parsing command: SELECT * FROM
> oracle_data
>
> 17/06/29 18:31:58 INFO parse.ParseDriver: Parse Completed
>
> …
>
> scala> val xmlDF_flattened =
> xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))
>
> :22: error: overloaded method value load with alternatives:
>
>   ()org.apache.spark.sql.DataFrame 
>
>   (path: String)org.apache.spark.sql.DataFrame
>
> cannot be applied to (org.a

RE: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Talap, Amol

Thanks so much Zhang. This definitely helps.

From: Yong Zhang [mailto:java8...@hotmail.com]
Sent: Thursday, June 29, 2017 4:59 PM
To: Talap, Amol; Judit Planas; user@spark.apache.org
Subject: Re: SparkSQL to read XML Blob data to create multiple rows


scala>spark.version

res6: String = 2.1.1



scala>val rdd  = 
sc.parallelize(Seq("""Title1.1Description_1.1

Title1.2Description_1.2

Title1.3Description_1.3

"""))

rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize 
at :24



scala>import com.databricks.spark.xml.XmlReader



scala>val df = new XmlReader().xmlRdd(spark.sqlContext, rdd)

df: org.apache.spark.sql.DataFrame = [Comments: struct>>]



scala>df.printSchema

root

 |-- Comments: struct (nullable = true)

 ||-- Comment: array (nullable = true)

 |||-- element: struct (containsNull = true)

 ||||-- Description: string (nullable = true)

 ||||-- Title: string (nullable = true)



scala>df.show(false)

+--+

|Comments   
   |

+--+

|[WrappedArray([Description_1.1,Title1.1], [Description_1.2,Title1.2], 
[Description_1.3,Title1.3])]|

+--+





scala>df.withColumn("comment", 
explode(df("Comments.Comment"))).select($"comment.Description", 
$"comment.Title").show

+---++

|Description|   Title|

+---++

|Description_1.1|Title1.1|

|Description_1.2|Title1.2|

|Description_1.3|Title1.3|

+---++



From: Talap, Amol <amol.ta...@capgemini.com<mailto:amol.ta...@capgemini.com>>
Sent: Thursday, June 29, 2017 9:38 AM
To: Judit Planas; user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: SparkSQL to read XML Blob data to create multiple rows


Thanks Judit, Ayan

Judit,

You almost got it. The explode might help here.

But when I tried I see load() doesn't like to read from xmlcomment column on 
oracle_data.



scala> val xmlDF = sqlContext.sql("SELECT * FROM oracle_data")

17/06/29 18:31:58 INFO parse.ParseDriver: Parsing command: SELECT * FROM 
oracle_data

17/06/29 18:31:58 INFO parse.ParseDriver: Parse Completed

...

scala> val xmlDF_flattened = 
xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))

:22: error: overloaded method value load with alternatives:

  ()org.apache.spark.sql.DataFrame 

  (path: String)org.apache.spark.sql.DataFrame

cannot be applied to (org.apache.spark.sql.ColumnName)

   val xmlDF_flattened = 
xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))



Ayan,

Output of books_inexp.show was as below
title, author
Midnight Rain,    Ralls, Kim
Maeve Ascendant,  Corets, Eva



Regards,

Amol

From: Judit Planas [mailto:judit.pla...@epfl.ch]
Sent: Thursday, June 29, 2017 3:46 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: SparkSQL to read XML Blob data to create multiple rows



Hi Amol,

Not sure I understand completely your question, but the SQL function "explode" 
may help you:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode
pyspark.sql module - PySpark 2.1.1 
documentation<http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode>
spark.apache.org
pyspark.sql.SparkSession Main entry point for DataFrame and SQL functionality. 
pyspark.sql.DataFrame A distributed collection of data grouped into named 
columns.




Here you can find a nice example:
https://stackoverflow.com/questions/38210507/explode-in-pyspark
[https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-i...@2.png?v=73d79a89bded]<https://stackoverflow.com/questions/38210507/explode-in-pyspark>

python - Explode in PySpark - Stack 
Overflow<https://stackoverflow.com/questions/38210507/explode-in-pyspark>
stackoverflow.com
I would like to transform from a DataFrame that contains lists of words into a 
DataFrame with each word in its own row. How do I do explode on a column in a 
DataFrame?




HTH,
Judit

On 29/06/17 09:05, ayan guha wrote:

Hi



Not sure if I follow your issue. Can you please post output of 
books_inexp.show()?



On Thu, Jun 29, 2017 at 2:30 PM, Talap, Amol 
<amol.ta...@capgemini.com<m

Re: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Yong Zhang

scala>spark.version
res6: String = 2.1.1

scala>val rdd  = 
sc.parallelize(Seq("""Title1.1Description_1.1
Title1.2Description_1.2
Title1.3Description_1.3
"""))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize 
at :24

scala>import com.databricks.spark.xml.XmlReader

scala>val df = new XmlReader().xmlRdd(spark.sqlContext, rdd)
df: org.apache.spark.sql.DataFrame = [Comments: struct>>]

scala>df.printSchema
root
 |-- Comments: struct (nullable = true)
 ||-- Comment: array (nullable = true)
 |||-- element: struct (containsNull = true)
 ||||-- Description: string (nullable = true)
 ||||-- Title: string (nullable = true)

scala>df.show(false)
+--+
|Comments   
   |
+--+
|[WrappedArray([Description_1.1,Title1.1], [Description_1.2,Title1.2], 
[Description_1.3,Title1.3])]|
+--+


scala>df.withColumn("comment", 
explode(df("Comments.Comment"))).select($"comment.Description", 
$"comment.Title").show
+---++
|Description|   Title|
+---++
|Description_1.1|Title1.1|
|Description_1.2|Title1.2|
|Description_1.3|Title1.3|
+---++




From: Talap, Amol <amol.ta...@capgemini.com>
Sent: Thursday, June 29, 2017 9:38 AM
To: Judit Planas; user@spark.apache.org
Subject: RE: SparkSQL to read XML Blob data to create multiple rows


Thanks Judit, Ayan

Judit,

You almost got it. The explode might help here.

But when I tried I see load() doesn’t like to read from xmlcomment column on 
oracle_data.



scala> val xmlDF = sqlContext.sql("SELECT * FROM oracle_data")

17/06/29 18:31:58 INFO parse.ParseDriver: Parsing command: SELECT * FROM 
oracle_data

17/06/29 18:31:58 INFO parse.ParseDriver: Parse Completed

…

scala> val xmlDF_flattened = 
xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))

:22: error: overloaded method value load with alternatives:

  ()org.apache.spark.sql.DataFrame 

  (path: String)org.apache.spark.sql.DataFrame

cannot be applied to (org.apache.spark.sql.ColumnName)

   val xmlDF_flattened = 
xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))



Ayan,

Output of books_inexp.show was as below
title, author
Midnight Rain,Ralls, Kim
Maeve Ascendant,  Corets, Eva



Regards,

Amol

From: Judit Planas [mailto:judit.pla...@epfl.ch]
Sent: Thursday, June 29, 2017 3:46 AM
To: user@spark.apache.org
Subject: Re: SparkSQL to read XML Blob data to create multiple rows



Hi Amol,

Not sure I understand completely your question, but the SQL function "explode" 
may help you:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode

pyspark.sql module — PySpark 2.1.1 
documentation<http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode>
spark.apache.org
pyspark.sql.SparkSession Main entry point for DataFrame and SQL functionality. 
pyspark.sql.DataFrame A distributed collection of data grouped into named 
columns.




Here you can find a nice example:
https://stackoverflow.com/questions/38210507/explode-in-pyspark
[https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-i...@2.png?v=73d79a89bded]<https://stackoverflow.com/questions/38210507/explode-in-pyspark>

python - Explode in PySpark - Stack 
Overflow<https://stackoverflow.com/questions/38210507/explode-in-pyspark>
stackoverflow.com
I would like to transform from a DataFrame that contains lists of words into a 
DataFrame with each word in its own row. How do I do explode on a column in a 
DataFrame?




HTH,
Judit

On 29/06/17 09:05, ayan guha wrote:

Hi



Not sure if I follow your issue. Can you please post output of 
books_inexp.show()?



On Thu, Jun 29, 2017 at 2:30 PM, Talap, Amol 
<amol.ta...@capgemini.com<mailto:amol.ta...@capgemini.com>> wrote:

Hi:



We are trying to parse XML data to get below output from given input sample.

Can someone suggest a way to pass one DFrames output into load() function or 
any other alternative to get this output.



Input Data from Oracle Table XMLBlob:

SequenceID


Name


City


XMLComment


1


Amol


Kolhapur


Title1.1Description_1.1Title1.2Description_1.2

Re: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Judit Planas


Hi Ayan,

I'm afraid I can't help you here. I'm a beginner in Spark with 0 
experience in Scala (always used PySpark).


In my case, I had a list of numbers, and the explode function was 
exactly what I was looking for.


Sorry about that, hope you can solve it.

Best,
Judit

On 29/06/17 15:38, Talap, Amol wrote:


Thanks Judit, Ayan

Judit,

You almost got it. The explode might help here.

But when I tried I see load() doesn’t like to read from xmlcomment 
column on oracle_data.


scala> val xmlDF = sqlContext.sql("SELECT * FROM oracle_data")

17/06/29 18:31:58 INFO parse.ParseDriver: Parsing command: SELECT * 
FROM oracle_data


17/06/29 18:31:58 INFO parse.ParseDriver: Parse Completed

…

scala> val xmlDF_flattened = 
xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))


:22: error: overloaded method value load with alternatives:

()org.apache.spark.sql.DataFrame 

(path: String)org.apache.spark.sql.DataFrame

cannot be applied to (org.apache.spark.sql.ColumnName)

val xmlDF_flattened = 
xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))


Ayan,

Output of books_inexp.show was as below
title, author
Midnight Rain,Ralls, Kim
Maeve Ascendant,  Corets, Eva

Regards,

Amol

*From:*Judit Planas [mailto:judit.pla...@epfl.ch]
*Sent:* Thursday, June 29, 2017 3:46 AM
*To:* user@spark.apache.org
*Subject:* Re: SparkSQL to read XML Blob data to create multiple rows

Hi Amol,

Not sure I understand completely your question, but the SQL function 
"explode" may help you:

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode

Here you can find a nice example:
https://stackoverflow.com/questions/38210507/explode-in-pyspark

HTH,
Judit

On 29/06/17 09:05, ayan guha wrote:

Hi

Not sure if I follow your issue. Can you please post output of
books_inexp.show()?

On Thu, Jun 29, 2017 at 2:30 PM, Talap, Amol
<amol.ta...@capgemini.com <mailto:amol.ta...@capgemini.com>> wrote:

Hi:

We are trying to parse XML data to get below output from given
input sample.

Can someone suggest a way to pass one DFrames output into
load() function or any other alternative to get this output.

Input Data from Oracle Table XMLBlob:

*SequenceID*



*Name*



*City*



*XMLComment*

1



Amol



Kolhapur




Title1.1Description_1.1Title1.2Description_1.2Title1.3Description_1.3

2



Suresh



Mumbai




Title2Description_2

3



Vishal



Delhi




Title3Description_3

4



Swastik



Bangalore




Title4Description_4

Output Data Expected using Spark SQL:

*SequenceID*



*Name*



*City*



*Title*



*Description*

1



Amol



Kolhapur



Title1.1



Description_1.1

1



Amol



Kolhapur



Title1.1



Description_1.2

1



Amol



Kolhapur



Title1.3



Description_1.3

2



Suresh



Mumbai



Title2



Description_2

3



Vishal



Delhi



Title3.1



Description_3.1

4



Swastik



Bangalore



Title4



Description_4

I am able to parse single XML using below approach in
spark-shell using example below but how do we apply the same
recursively for all rows ?


https://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html.



val dfX =

sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load("books.xml")

val xData = dfX.registerTempTable("books")

dfX.printSchema()

val books_inexp =sqlCont

RE: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Talap, Amol

Thanks Judit, Ayan
Judit,
You almost got it. The explode might help here.
But when I tried I see load() doesn’t like to read from xmlcomment column on 
oracle_data.

scala> val xmlDF = sqlContext.sql("SELECT * FROM oracle_data")
17/06/29 18:31:58 INFO parse.ParseDriver: Parsing command: SELECT * FROM 
oracle_data
17/06/29 18:31:58 INFO parse.ParseDriver: Parse Completed
…
scala> val xmlDF_flattened = 
xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))
:22: error: overloaded method value load with alternatives:
  ()org.apache.spark.sql.DataFrame 
  (path: String)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.sql.ColumnName)
   val xmlDF_flattened = 
xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))

Ayan,
Output of books_inexp.show was as below
title, author
Midnight Rain,Ralls, Kim
Maeve Ascendant,  Corets, Eva

Regards,
Amol
From: Judit Planas [mailto:judit.pla...@epfl.ch]
Sent: Thursday, June 29, 2017 3:46 AM
To: user@spark.apache.org
Subject: Re: SparkSQL to read XML Blob data to create multiple rows

Hi Amol,

Not sure I understand completely your question, but the SQL function "explode" 
may help you:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode

Here you can find a nice example:
https://stackoverflow.com/questions/38210507/explode-in-pyspark

HTH,
Judit
On 29/06/17 09:05, ayan guha wrote:
Hi

Not sure if I follow your issue. Can you please post output of 
books_inexp.show()?

On Thu, Jun 29, 2017 at 2:30 PM, Talap, Amol 
<amol.ta...@capgemini.com<mailto:amol.ta...@capgemini.com>> wrote:
Hi:

We are trying to parse XML data to get below output from given input sample.
Can someone suggest a way to pass one DFrames output into load() function or 
any other alternative to get this output.

Input Data from Oracle Table XMLBlob:
SequenceID

Name

City

XMLComment

1

Amol

Kolhapur

Title1.1Description_1.1Title1.2Description_1.2Title1.3Description_1.3

2

Suresh

Mumbai

Title2Description_2

3

Vishal

Delhi

Title3Description_3

4

Swastik

Bangalore

Title4Description_4


Output Data Expected using Spark SQL:
SequenceID

Name

City

Title

Description

1

Amol

Kolhapur

Title1.1

Description_1.1

1

Amol

Kolhapur

Title1.1

Description_1.2

1

Amol

Kolhapur

Title1.3

Description_1.3

2

Suresh

Mumbai

Title2

Description_2

3

Vishal

Delhi

Title3.1

Description_3.1

4

Swastik

Bangalore

Title4

Description_4


I am able to parse single XML using below approach in spark-shell using example 
below but how do we apply the same recursively for all rows ?
https://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html.

val dfX = 
sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load("books.xml")

val xData = dfX.registerTempTable("books")

dfX.printSchema()

val books_inexp =sqlContext.sql("select title,author from books where price<10")

books_inexp.show


Regards,
Amol
This message contains information that may be privileged or confidential and is 
the property of the Capgemini Group. It is intended only for the person to whom 
it is addressed. If you are not the intended recipient, you are not authorized 
to read, print, retain, copy, disseminate, distribute, or use this message or 
any part thereof. If you receive this message in error, please notify the 
sender immediately and delete all copies of this message.



--
Best Regards,
Ayan Guha

This message contains information that may be privileged or confidential and is 
the property of the Capgemini Group. It is intended only for the person to whom 
it is addressed. If you are not the intended recipient, you are not authorized 
to read, print, retain, copy, disseminate, distribute, or use this message or 
any part thereof. If you receive this message in error, please notify the 
sender immediately and delete all copies of this message.

Re: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Judit Planas


Hi Amol,

Not sure I understand completely your question, but the SQL function 
"explode" may help you:

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode

Here you can find a nice example:
https://stackoverflow.com/questions/38210507/explode-in-pyspark

HTH,
Judit

On 29/06/17 09:05, ayan guha wrote:

Hi

Not sure if I follow your issue. Can you please post output of 
books_inexp.show()?


On Thu, Jun 29, 2017 at 2:30 PM, Talap, Amol > wrote:


Hi:

We are trying to parse XML data to get below output from given
input sample.

Can someone suggest a way to pass one DFrames output into load()
function or any other alternative to get this output.

Input Data from Oracle Table XMLBlob:

*SequenceID*



*Name*



*City*



*XMLComment*

1



Amol



Kolhapur




Title1.1Description_1.1Title1.2Description_1.2Title1.3Description_1.3

2



Suresh



Mumbai




Title2Description_2

3



Vishal



Delhi




Title3Description_3

4



Swastik



Bangalore




Title4Description_4

Output Data Expected using Spark SQL:

*SequenceID*



*Name*



*City*



*Title*



*Description*

1



Amol



Kolhapur



Title1.1



Description_1.1

1



Amol



Kolhapur



Title1.1



Description_1.2

1



Amol



Kolhapur



Title1.3



Description_1.3

2



Suresh



Mumbai



Title2



Description_2

3



Vishal



Delhi



Title3.1



Description_3.1

4



Swastik



Bangalore



Title4



Description_4

I am able to parse single XML using below approach in spark-shell
using example below but how do we apply the same recursively for
all rows ?


https://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html

.



val dfX =

sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load("books.xml")

val xData = dfX.registerTempTable("books")

dfX.printSchema()

val books_inexp =sqlContext.sql("select title,author from books
where price<10")

books_inexp.show

Regards,

Amol

This message contains information that may be privileged or
confidential and is the property of the Capgemini Group. It is
intended only for the person to whom it is addressed. If you are
not the intended recipient, you are not authorized to read, print,
retain, copy, disseminate, distribute, or use this message or any
part thereof. If you receive this message in error, please notify
the sender immediately and delete all copies of this message.




--
Best Regards,
Ayan Guha

Re: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread ayan guha

Hi

Not sure if I follow your issue. Can you please post output of
books_inexp.show()?

On Thu, Jun 29, 2017 at 2:30 PM, Talap, Amol 
wrote:

> Hi:
>
>
>
> We are trying to parse XML data to get below output from given input
> sample.
>
> Can someone suggest a way to pass one DFrames output into load() function
> or any other alternative to get this output.
>
>
>
> Input Data from Oracle Table XMLBlob:
>
> *SequenceID*
>
> *Name*
>
> *City*
>
> *XMLComment*
>
> 1
>
> Amol
>
> Kolhapur
>
> Title1.1<
> Description>Description_1.1
> Title1.2Description_1.2<
> Comment>Title1.3Description_1.3<
> /Comment>
>
> 2
>
> Suresh
>
> Mumbai
>
> Title2<
> Description>Description_2
>
> 3
>
> Vishal
>
> Delhi
>
> Title3<
> Description>Description_3
>
> 4
>
> Swastik
>
> Bangalore
>
> Title4<
> Description>Description_4
>
>
>
> Output Data Expected using Spark SQL:
>
> *SequenceID*
>
> *Name*
>
> *City*
>
> *Title*
>
> *Description*
>
> 1
>
> Amol
>
> Kolhapur
>
> Title1.1
>
> Description_1.1
>
> 1
>
> Amol
>
> Kolhapur
>
> Title1.1
>
> Description_1.2
>
> 1
>
> Amol
>
> Kolhapur
>
> Title1.3
>
> Description_1.3
>
> 2
>
> Suresh
>
> Mumbai
>
> Title2
>
> Description_2
>
> 3
>
> Vishal
>
> Delhi
>
> Title3.1
>
> Description_3.1
>
> 4
>
> Swastik
>
> Bangalore
>
> Title4
>
> Description_4
>
>
>
> I am able to parse single XML using below approach in spark-shell using
> example below but how do we apply the same recursively for all rows ?
>
> https://community.hortonworks.com/questions/71538/parsing-
> xml-in-spark-rdd.html.
>
>
> val dfX = sqlContext.read.format("com.databricks.spark.xml").option(
> "rowTag","book").load("books.xml")
>
> val xData = dfX.registerTempTable("books")
>
> dfX.printSchema()
>
> val books_inexp =sqlContext.sql("select title,author from books where
> price<10")
>
> books_inexp.show
>
>
>
> Regards,
>
> Amol
> This message contains information that may be privileged or confidential
> and is the property of the Capgemini Group. It is intended only for the
> person to whom it is addressed. If you are not the intended recipient, you
> are not authorized to read, print, retain, copy, disseminate, distribute,
> or use this message or any part thereof. If you receive this message in
> error, please notify the sender immediately and delete all copies of this
> message.
>



-- 
Best Regards,
Ayan Guha

SparkSQL to read XML Blob data to create multiple rows

2017-06-28 Thread Talap, Amol

Hi:

We are trying to parse XML data to get below output from given input sample.
Can someone suggest a way to pass one DFrames output into load() function or 
any other alternative to get this output.

Input Data from Oracle Table XMLBlob:
SequenceID

Name

City

XMLComment

1

Amol

Kolhapur

Title1.1Description_1.1Title1.2Description_1.2Title1.3Description_1.3

2

Suresh

Mumbai

Title2Description_2

3

Vishal

Delhi

Title3Description_3

4

Swastik

Bangalore

Title4Description_4


Output Data Expected using Spark SQL:
SequenceID

Name

City

Title

Description

1

Amol

Kolhapur

Title1.1

Description_1.1

1

Amol

Kolhapur

Title1.1

Description_1.2

1

Amol

Kolhapur

Title1.3

Description_1.3

2

Suresh

Mumbai

Title2

Description_2

3

Vishal

Delhi

Title3.1

Description_3.1

4

Swastik

Bangalore

Title4

Description_4


I am able to parse single XML using below approach in spark-shell using example 
below but how do we apply the same recursively for all rows ?
https://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html.

val dfX = 
sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load("books.xml")

val xData = dfX.registerTempTable("books")

dfX.printSchema()

val books_inexp =sqlContext.sql("select title,author from books where price<10")

books_inexp.show


Regards,
Amol
This message contains information that may be privileged or confidential and is 
the property of the Capgemini Group. It is intended only for the person to whom 
it is addressed. If you are not the intended recipient, you are not authorized 
to read, print, retain, copy, disseminate, distribute, or use this message or 
any part thereof. If you receive this message in error, please notify the 
sender immediately and delete all copies of this message.

Re: Bizzare diff in behavior between scala REPL and sparkSQL UDF

2017-06-20 Thread jeff saremi

never mind!

I has a space at the end of my data which was not showing up in manual testing.

thanks

From: jeff saremi <jeffsar...@hotmail.com>
Sent: Tuesday, June 20, 2017 2:48:06 PM
To: user@spark.apache.org
Subject: Bizzare diff in behavior between scala REPL and sparkSQL UDF

I have this function which does a regex matching in scala. I test it in the 
REPL I get expected results.

I use it as a UDF in sparkSQL i get completely incorrect results.

Function:

class UrlFilter (filters: Seq[String]) extends Serializable  {
  val regexFilters = filters.map(new Regex(_))
  regexFilters.foreach(println)

  def matches(s: String) : Boolean = {
if(s == null || s.isEmpty) return false
regexFilters.exists(f => {print("matching " + f + " against " + s); s match 
{
case f() => { println("; matched! returning true"); true };
case _ => { println("; did NOT match. returning false"); false }
}})
  }
}

Instantiating it with a pattern like:
^[^:]+://[^.]*\.company[0-9]*9\.com$

(matches a url that has company in the name and a number that ends in digit 9)
Test it in Scala REPL:

scala> val filters = 
Source.fromFile("D:\\cosmos-modules\\testdata\\fakefilters.txt").getLines.toList

scala> val urlFilter = new UrlFilter(filters)

scala>  urlFilter.matches("ftp://ftp.company9.com;)
matching ^[^:]+://[^.]*\.company[0-9]*9\.com$ against ftp://ftp.company9.com; 
matched! returning true
res2: Boolean = true

Use it in SparkSQL:

val urlFilter = new UrlFilter(filters)
sqlContext.udf.register("filterListMatch", (url: String) => 
urlFilter.matches(url))

val nonMatchingUrlsDf = sqlContext.sql("SELECT url FROM distinctUrls WHERE NOT 
filterListMatch(url)")

Look at the debug prints in the console:
matching ^[^:]+://[^.]*\.company[0-9]*9\.com$ against ftp://ftp.company9.com ; 
did NOT match. returning false

I have repeated this several times to make sure I'm comparing apples only
I am using Spark 1.6 and Scala 2.10.5 with Java 1.8
thanks

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1023 matches

Mail list logo